GCJ – Garbled Email

Garbled Email, Round 1B 2013.

This is a tricky problem.  The first tricky part is figuring out an efficient way to store the dictionary so that we can quickly identify whether a particular substring is a prefix of a word in the dictionary.  This is a well-studied problem, and the right data structure is a trie.  Suppose we’re working with a small dictionary containing the words a, aac, acb, baa, bbc, bc, caba, caab.  We’ll start our trie with an empty node, and then keep adding words until we’re finished.  We start by adding the word a.  We link a node with the letter a to the root, and then link a marker node to that to show that this is the end of a word.

A simple trie.

Next, we’ll add in the word aac. We already have the letter a linked to the root, so we go there and add more letters to that node.

A larger trie.

And we repeat this until all the words are in the tree. For each letter in the new word, if that letter is already a child of the current node, we walk to that child. If not, we create a new child. Once we’re at the end of the word, we add a marker child to the last node denoting the end of the word. After we’re finished with our small dictionary, we end up with this:

A full trie.

Each of the red nodes marks the end of a word, so we can see that we have all eight words in the trie.

Now that we have our data stored in an easily-accessible manner, let’s start with a simpler problem than the one asked.  Suppose that there was no garbling, just spaces removed, and we’re given some string, say, aacaabbc.  How can we use this structure to quickly find out whether this string could be divided up into words from our dictionary?  Start at the root of the tree and look at the first letter a of the string.  a is a child of our current node, so we’ll walk to that node and move to the next letter.  Here we see an end node as a child, marking that we’ve now finished a possible node.  So we know we can split off the word a from our string, leaving acaabbc behind.  This split will produce a valid word separation for the whole string if and only if acaabbc can itself be divided up into words from our dictionary.

First, though, we need to finish walking down and make sure we’ve found the only possible split.  The next letter is a, which is a child of our current node, so we move there.  This node has no end node as a child, so it’s not a possible word separation.  The third letter is c, which is a child of this node, so we’ll move there next.  This node is another word separation, giving us aac as the first word and aabbc as the remainder of the string.  There are no other children of this node, so these are the only possible word separations.

So we now know that aacaabbc is divisible into words if and only if at least one of its two suffixes acaabbc and aabbc are divisible into words.  This suggests that we might be able to use dynamic programming for our problem.  We can create an array of boolean variables telling us whether each suffix of our string can be divided into words.  We can look at the suffixes of our string in order of increasing size and calculate whether each of them is divisible.  The larger suffixes can then use the calculations previously done on the smaller suffixes to quickly calculate whether they themselves are divisible.

Dynamic programming results.

Starting with the smallest suffix, c, we quickly find that it is not a word, so we mark that index as False.  bc, the next smallest, so we mark it as True, and similarly for bbc.  abbc gives a potential split at a|bbc, and we know that bbc is True, so abbc must be True as well.  We then keep working backwards until we reach the beginning of the string, and then the final answer is simply the result in the first element of our array.  Doing this we quickly find that aacaabba is in fact divisible.

So this works for messages where there is no garbling at all.  Now let’s add the garbling in and see if we can work out a solution.  We’ll simplify the problem slightly, though, by allowing any character in the string to be garbled, ignoring the constraints on how close the garbled characters are allowed to be.  Suppose we have the string abbbcbba.  We can work backwards, just as before, from smaller to larger suffixes.  This time, though, instead of only storing whether or not a suffix is divisible into words, we’ll store how many changes are needed to make it divisible.

Dynamic programming with counts.c alone, for instance, isn’t a word, but we can change it into a with a single substitution, so our answer is one.  The next two suffixes, bc and bbc, are both words, so they get values of zero.  cbbc isn’t valid, but if we swap it to abbc we can split it at a|bbc, giving us two valid words, for an answer of one.  We keep working backwards just as before, looking for all possible words we can construct at the start of our suffix, splitting it, and then putting the minimum changes required into our array.  Once we’re finished, the final answer is just the result in the first element of our array.  So we find that abbbcbbc can be made divisible with only a single character swap.

Finally, we want to incorporate the closeness requirements for our garbled characters – changes must have at least four non-garbled characters between them.  Now for each suffix we want to store five different values, giving the minimum changes required for this suffix when the first x characters are required to match perfectly, with x ranging between 0 and 4 inclusive.  Let’s look at this array for the string abbbcbba.

Final dynamic programming array.

Here N is marking divisions that are not possible at all.  Otherwise, we fill the array with the number of changes required.  Again, let’s start at the back and move towards the front of the string.  a, the last suffix, is a word itself, so we fill its column with zeros.  ba is not a word, but could be divisible if we change one of the characters, to either bc or a|a.  The bottom three rows require us to keep the first two characters unchanged, so those entries are marked as N.  Both of the top two are possible with only a single change.  bba is valid only if changed into bbc, which requires no more than two characters held unchanged, so we mark up the column appropriately.  We continue in the same manner, looking for potential word divisions and marking down the minimum changes required that allow the required distance between changes.  Then, just as with the simpler problems, we read off our final answer as the entry in the first column and the first row.  


def precalculate():
    #We're going to load the dictionary into a trie
    global tree
    global end
    tree = {}
    end = ''

    def add_branch(tree, word):
        for char in word:
            if not char in tree:
                tree[char] = {}
            tree = tree[char]
        tree[end] = True

    infile = open('./garbled_email_dictionary.txt')
    for line in infile.readlines():
        add_branch(tree, line.strip())
    infile.close()

def match_words(S, prefix, index, required):
    suffix = S[index:]

    #Walk down the tree to find the right branch
    branch = tree
    for char in prefix:
        branch = branch[char]

    if suffix == '':
        if end in branch:
            return 0
        else:
            return -1

    def check(best, new, inc=0):
        if new == -1:
            return best
        if best is None:
            best = new + inc
        else:
            best = min(best, new + inc)
        return best

    best = None

    char = suffix[0]
    if char in branch: #valid match
        if end in branch[char]: #end of a word
            best = check(best, mincosts[index + 1, max(0, required - 1)])
        best = check(best, match_words(S, prefix + char, index + 1, max(0, required - 1)))
    if required == 0:
        for this in branch:
            if this == end or this == char:
                continue
            if end in branch[this]:
                best = check(best, mincosts[index + 1, 4], 1)
            if best is None or best > 1: #Don't bother checking if best is already good enough
                best = check(best, match_words(S, prefix + this, index + 1, 4), 1)

    if best is None:
        return -1
    else:
        return best

def solve_case(S):
    global mincosts

    mincosts = np.zeros((len(S) + 1, 5), dtype=int)

    for index in range(len(S))[::-1]:
        for required in range(5):
            mincosts[index, required] = match_words(S, '', index, required)
    return mincosts[0,0]

One thought on “GCJ – Garbled Email

  1. Pingback: Google Code Jam – Summary | Reflections of Dusk

Leave a comment