Garbled Email, Round 1B 2013.

This is a tricky problem. The first tricky part is figuring out an efficient way to store the dictionary so that we can quickly identify whether a particular substring is a prefix of a word in the dictionary. This is a well-studied problem, and the right data structure is a trie. Suppose we’re working with a small dictionary containing the words **a, aac, acb, baa, bbc, bc, caba, caab**. We’ll start our trie with an empty node, and then keep adding words until we’re finished. We start by adding the word **a**. We link a node with the letter *a* to the root, and then link a marker node to that to show that this is the end of a word.

Next, we’ll add in the word **aac**. We already have the letter *a* linked to the root, so we go there and add more letters to that node.

And we repeat this until all the words are in the tree. For each letter in the new word, if that letter is already a child of the current node, we walk to that child. If not, we create a new child. Once we’re at the end of the word, we add a marker child to the last node denoting the end of the word. After we’re finished with our small dictionary, we end up with this:

Each of the red nodes marks the end of a word, so we can see that we have all eight words in the trie.

Now that we have our data stored in an easily-accessible manner, let’s start with a simpler problem than the one asked. Suppose that there was no garbling, just spaces removed, and we’re given some string, say, **aacaabbc**. How can we use this structure to quickly find out whether this string could be divided up into words from our dictionary? Start at the root of the tree and look at the first letter a of the string. *a* is a child of our current node, so we’ll walk to that node and move to the next letter. Here we see an end node as a child, marking that we’ve now finished a possible node. So we know we can split off the word **a** from our string, leaving **acaabbc** behind. This split will produce a valid word separation for the whole string if and only if **acaabbc** can itself be divided up into words from our dictionary.

First, though, we need to finish walking down and make sure we’ve found the only possible split. The next letter is *a*, which is a child of our current node, so we move there. This node has no end node as a child, so it’s not a possible word separation. The third letter is *c*, which is a child of this node, so we’ll move there next. This node is another word separation, giving us **aac** as the first word and **aabbc** as the remainder of the string. There are no other children of this node, so these are the only possible word separations.

So we now know that **aacaabbc** is divisible into words if and only if at least one of its two suffixes **acaabbc** and **aabbc** are divisible into words. This suggests that we might be able to use dynamic programming for our problem. We can create an array of boolean variables telling us whether each suffix of our string can be divided into words. We can look at the suffixes of our string in order of increasing size and calculate whether each of them is divisible. The larger suffixes can then use the calculations previously done on the smaller suffixes to quickly calculate whether they themselves are divisible.

Starting with the smallest suffix, **c**, we quickly find that it is not a word, so we mark that index as False. **bc**, the next smallest, so we mark it as True, and similarly for **bbc**. **abbc** gives a potential split at **a|bbc**, and we know that **bbc** is True, so **abbc** must be True as well. We then keep working backwards until we reach the beginning of the string, and then the final answer is simply the result in the first element of our array. Doing this we quickly find that **aacaabba** is in fact divisible.

So this works for messages where there is no garbling at all. Now let’s add the garbling in and see if we can work out a solution. We’ll simplify the problem slightly, though, by allowing any character in the string to be garbled, ignoring the constraints on how close the garbled characters are allowed to be. Suppose we have the string **abbbcbba**. We can work backwards, just as before, from smaller to larger suffixes. This time, though, instead of only storing whether or not a suffix is divisible into words, we’ll store how many changes are needed to make it divisible.

**c** alone, for instance, isn’t a word, but we can change it into **a** with a single substitution, so our answer is one. The next two suffixes, **bc** and **bbc**, are both words, so they get values of zero. **cbbc** isn’t valid, but if we swap it to **abbc** we can split it at **a|bbc**, giving us two valid words, for an answer of one. We keep working backwards just as before, looking for all possible words we can construct at the start of our suffix, splitting it, and then putting the minimum changes required into our array. Once we’re finished, the final answer is just the result in the first element of our array. So we find that **abbbcbbc** can be made divisible with only a single character swap.

Finally, we want to incorporate the closeness requirements for our garbled characters – changes must have at least four non-garbled characters between them. Now for each suffix we want to store five different values, giving the minimum changes required for this suffix when the first *x* characters are required to match perfectly, with *x* ranging between 0 and 4 inclusive. Let’s look at this array for the string **abbbcbba**.

Here **N** is marking divisions that are not possible at all. Otherwise, we fill the array with the number of changes required. Again, let’s start at the back and move towards the front of the string. **a**, the last suffix, is a word itself, so we fill its column with zeros. **ba** is not a word, but could be divisible if we change one of the characters, to either **bc** or **a|a**. The bottom three rows require us to keep the first two characters unchanged, so those entries are marked as **N. **Both of the top two are possible with only a single change. **bba** is valid only if changed into **bbc**, which requires no more than two characters held unchanged, so we mark up the column appropriately. We continue in the same manner, looking for potential word divisions and marking down the minimum changes required that allow the required distance between changes. Then, just as with the simpler problems, we read off our final answer as the entry in the first column and the first row.

def precalculate(): #We're going to load the dictionary into a trie global tree global end tree = {} end = '' def add_branch(tree, word): for char in word: if not char in tree: tree[char] = {} tree = tree[char] tree[end] = True infile = open('./garbled_email_dictionary.txt') for line in infile.readlines(): add_branch(tree, line.strip()) infile.close() def match_words(S, prefix, index, required): suffix = S[index:] #Walk down the tree to find the right branch branch = tree for char in prefix: branch = branch[char] if suffix == '': if end in branch: return 0 else: return -1 def check(best, new, inc=0): if new == -1: return best if best is None: best = new + inc else: best = min(best, new + inc) return best best = None char = suffix[0] if char in branch: #valid match if end in branch[char]: #end of a word best = check(best, mincosts[index + 1, max(0, required - 1)]) best = check(best, match_words(S, prefix + char, index + 1, max(0, required - 1))) if required == 0: for this in branch: if this == end or this == char: continue if end in branch[this]: best = check(best, mincosts[index + 1, 4], 1) if best is None or best > 1: #Don't bother checking if best is already good enough best = check(best, match_words(S, prefix + this, index + 1, 4), 1) if best is None: return -1 else: return best def solve_case(S): global mincosts mincosts = np.zeros((len(S) + 1, 5), dtype=int) for index in range(len(S))[::-1]: for required in range(5): mincosts[index, required] = match_words(S, '', index, required) return mincosts[0,0]

Pingback: Google Code Jam – Summary | Reflections of Dusk