The Lord of the Coins: The Two Towers

9 minute read

Recently there was an interesting puzzle posted on the ODS Slack channel. An ancient evil dragon captures Alice and Bob and puts them in two different towers. The dragon tosses a fair coin infinitely many times for Alice, and then does the same for Bob. The adventures should each name a number, and the dragon checks the coin toss outcome of the Alice’s infinite sequence at the position number, named by Bob, and then the coin toss outcome of the Bob’s sequence at the position number, named by Alice. If the outcomes are the same, our heroes are free. What strategy can you think of to maximise the probability of the successful release?

If you want to read the final thoughts on the possible solution, please go to the section ‘Scalar products, decision vectors and gradients’. Otherwise, here’s the story how I came up with it.

First things first

This problem looked suspicious to me, because neither Alice nor Bob know each others sequences, and naturally don’t hear the second person’s choice (and therefore can’t infer anything from that).

Three questions you might want to ask before approaching this problem. First, why does the dragon have towers, if they usually live in caves filled with gold and gems? Second, how fast the dragon should toss the coins before Alice and Bob get bored? Third, is there a way to predict a simingly stochastic value?

We can start from a simple question: apparently, there is a bunch of possible strategies.

I can put the success condition in a slighltly more formal way. Let’s call the sequence Alice sees \(A = \left(a_1, a_2, \dots \right)\), and the sequence Bob sees \(B = \left(b_1, b_2, \dots \right)\). If Alice picks \(k\) as the index for Bob, and Bob’s choice is \(m\), they are doing great if \(A[m] = a_m = b_k = B[k]\). Bad luck otherwise! They are going to solve stupid puzzles for the rest of their days.

Minimum viable strategy

Intuitively, the odds of winning are 50/50. Indeed, a simple strategy: “pick 1” works like that. A way to illustrate this is to use this table:


The first row and the first column are the players’ possible outcomes at their first coin toss. In two out of four times the first items are the same, i.e. the total winning probability is \(\mathcal{P} = 0.5\).

Interestingly, since Alice and Bob have never actually used any information about their sequences, it doesn’t matter what numbers the pick.

Can the first coin help us?

The next stage of developing our strategy is to think whether the outcome of the first coin toss can anyhow help us. Let’s try out the following strategy:

If this first coin is heads (H), pick "1". Otherwise, pick any other number.

We can explicitly calculate the probability of winning in all possible cases, and then get the full probability. Without loss of generality we assume Alice picks number \(k\), nd Bob picks \(m\) (when they don’t see heads at the start of their sequences). The corresponding outcomes table looks somewhat like this:

$$a_1 = H, a_m = H$$ $$a_1 = H, a_m = T$$ $$a_1 = T, a_m = H$$ $$a_1 = T, a_m = T$$
$$b_1 = H, b_k = H$$
$$b_1 = H, b_k = T$$
$$b_1 = T, b_k = H$$
$$b_1 = T, b_k = T$$

The table provides all possible outcomes for current strategy. For example, if Alice and Bob both have heads first \(a_1 = b_1 = H\), they win regardless of what are the other elements of their sequences. Otherwise, there’s a 50% match probability. Overall, this gives us \(\mathcal{P} = 10 / 16 = 0.625 > 0.5\), which is already great!

I believe this ‘imbalance’ comes from the fact Alice and Bob have actually communicated before the game. We have removed some uncertainty in the decision making process, and improved the odds.

Which came first, the mathematician or the puzzle?

Let’s try to add some extra a priory information into the process. Now, the decision is based on the first appearance of, say, heads in the sequence, instead of just the first item.

Pick number equal to the position of the first heads in your sequence.

Naturally, the probability of having the first heads on \(k\)-th trial is given by geometric distribution with \(p = 0.5\), and therefore equals to \(P(a_1 = \dots = a_{k-1} = T, a_k = H) = 2^{-k}\). Let’s consider as an example the case when Alice has heads on the first position \(a_1 = H\).

Bob has a 0.5 probability to have heads on the first position, and therefore it’s a match - they both call “1” and point to each other’s heads. If Bob has his first heads on any other position (say, \(m\)), then the probability of winning is essentially 0.5 (we don’t know anything about Alice’s \(m\)-th item, which should be tails, because Alice will point to Bob’s first item, tails). So, the probability of the event “Bob has first heads at \(m\) and Alice has tails at \(m\)” is \(P(b_1 = \dots = b_{m-1} = T, b_m = H, a_m = T)=0.5\cdot 2^{-m} = 2^{-m-1}\).

Summing up the last paragraph, if Alice has heads first, the probability of winning is

\[\mathcal{P}(win | a_1 = H) = \frac{1}{2} + \sum_{m=1}^\infty \frac{1}{2^m+1} = \frac{1}{2} + \frac{1}{4}.\]

We can continue and calculate the winning odds for the cases Alice has first heads after one tail, two tails, and so on. I will leave the proof to the reader:

\[\mathcal{P}(win | a_1 = \dots = a_{k-1} = T, a_k = H) = \frac{1}{2} + \sum_{m=1}^\infty \frac{1}{2^m+1} = \frac{1}{2} + \frac{1}{2^{k+1}}.\]

Finally, explicit treatment of probability of all possible Alice’s cases results in

\[\mathcal{P} = \sum_{k = 1}^\infty \frac{1}{2^k} \left(\frac{1}{2} + \frac{1}{2^{k+1}} \right) = \frac{2}{3}.\]

Three is a lucky number

Before we dive into boring maths, this is a solution proposed by @mihaild. I like its symmetry, and the way it enriches the ‘first appearance’ approach.

Alice and Bob split their sequences into triplets, and focus on the first triplet with not all the same elemets (\(HHH\) and \(TTT\) are called trivial and are skipped). We can show, that the probability Alice and Bob have the first non-trivial triplet at the same position is 0.6.

Then, we use this rule to pick an intermediate number:

{'HHT': 1, 'HTH': 2, 'HTT': 2, 'THH': 0, 'THT': 1, 'TTH': 0}

We add the intermediate number to the global index of the first element in the first non-trivial triplet, and this will be the number we tell to the dragon. (Here by global index I mean the index of the element in the original sequence given to us).

Again, direct simulation shows that Alice and Bob win in 5 out of 6 games (this applies to the case of Alice and Bob have the first non-trivial triplet at the same position). Otherwise, they win in 1 out of 2 games, equivalently to the Minimum Viable Strategy. This gives us \(\mathcal{P} = 0.7\) probability of winning.

Scalar products, decision vectors and gradients.

I am going to use numerical values for heads and tails: let \(H \equiv 1, T \equiv -1\), from now on. The Heaviside step function \(\mathbb{H}(x)\) will be also very useful.

Let’s denote a sequence of coin toss outcomes of length \(N\) by vector \(\mathbf{a}\). The elements of this vector \(a_i, i = 1, \dots, N\) are either +1 or -1, which is heads or tails as we called them before. There are \(2^N\) possible different such vectors, and to distinquish one from another the upper index is used, \(\mathbf{a}^m\).

The general idea of all the discussed strategies was to find a set of numbers, one per each possible state of your own sequence. Indeed, what that number really does, is pulls out a certain element of the unknown sequence. This is exactly how a scalar product works. I prefer to think about it as the dragon takes a scalar product of the Bob’s heads-tails sequence and the Alice’s decision vector. The decision vector has all zeroes (do not pick) and a single ‘one’ (pick this) value. This scalar product is:

\[(\mathbf{b} \cdot \mathbf{d}) = \sum_i b_i \delta_{ik} = b_k,\]

where \(\mathbf{b}\) is the Bob’s sequence, \(\mathbf{d}\) is the Alice’s decision vector with elements \(d_i = \delta_{ik}\) (Alice picks \(k\)).

I am going to focus on a strategy, which only operates with \(N\) first elements of the sequences. Also, Alice and Bob should follow the same strategy. I assume for every \(\mathbf{a}^m\) there exists a decision vector, \(\mathbf{d}^m\), even though I have no idea what it looks like. Then, if Alice has got a sequence with the first \(N\) elements being \(\mathbf{a}^m\) and for Bob it is \(\mathbf{a}^n\), they win if and only if

\[(\mathbf{a}^m \cdot \mathbf{d}^n) = (\mathbf{a}^n \cdot \mathbf{d}^m).\]

For given \((m,n)\) they win exactly if

\[\sum_{i,j} d^n_i d^m_j \mathbb{H}(a^m_i a^n_j ) = 1.\]

Therefore, if we sum this expression over all possible \(m\) and \(n\), we are getting the total number of wins. And the total winning probability is just this number, normalized by the number of all possible Alice-Bob combinations. Namely,

\[\mathcal{P} = \frac{1}{2^{2N}} \sum_{m,n} \sum_{i,j} d^n_i d^m_j \mathbb{H}(a^m_i a^n_j)\]

Now, carefully: sleight of hand. The decision vector elements can be not only zeroes and ones. The elements of the decision vector represent the probability of choosing this particular index. From now on, for any appropriate \(a\) and \(b\), \(0 \le d^b_a \le 1\), \(\sum_a d^b_a = 1\).

This is a typical optimization problem with analytical gradients. The gradient of the probability function \(\mathcal{P}\) with respect to a decision vector element \(d^b_a\) equals to

\[\frac{\partial \mathcal{P}}{\partial d^b_a} = \mathbb{C}_b \sum_{i,m} d^m_i \left(\mathbb{H}(a^m_i a^b_a) - \sum_j \left( d^b_j \mathbb{H}(a^m_i a^b_j)\right) \right).\]

A positive non-zero constant \(\mathbb{C}_ b\) doesn’t change the gradient, so it is possible to use this expression for actual decision vectors tuning.

Gradient-based optimization: results

I could find a local optimum with \(\mathcal{P} = 44/64\), for a case with rather small \(N = 3\). The optimal sequence vectors - decision vectors system looks like:

A = [[-1, -1, -1],
    [-1, -1,  1],
    [-1,  1, -1],
    [-1,  1,  1],
    [ 1, -1, -1],
    [ 1, -1,  1],
    [ 1,  1, -1],
    [ 1,  1,  1]]

D = [[ 0.33, 0.33, 0.34],
    [ 0.,  0.,  1.],
    [ 1.,  0.,  0.],
    [ 0.,  0.,  1.],
    [ 0.,  1.,  0.],
    [ 0.,  1.,  0.],
    [ 1.,  0.,  0.],
    [ 0.33, 0.33, 0.34]]

This solution is also a global optimum, which can be verified by brute force.

The solution is not unique. For example, the first and the last row of the decision matrix D can be shuffled as you wish: the trivial triplets (supposedly) should always result in the same contribution regardless of the decision vector elements.

If I increase the sequence length to \(N=5\), the winning probability is slightly higher, \(\mathcal{P} = 358/512\).

Some conclusions

I have strong doubts we can reach \(\mathcal{P} = 1\) as \(N \to \infty\). Moreover, I have shown that for \(N \le 15\) the best probability you can get is below 0.7, which is reachable with the ‘triples’ strategy mentioned above. My implementation of the optimization algorithm doesn’t scale well with \(N\), so it is hard to find optimal solutions for large Ns. I’m looking forward to make it reasonably fast, even though the decision matrix size grows exponentially.

If you have any ideas on the symmetries of the optimal solutions, or, more important, you can spot any mistakes and typos, please let me know.

Thanks for reading and happy escape from the dragon!