Given a hypothetical plaintext we use an automatic sanity check, using an English dictionary. The dictionary is stored in a trie. A trie is a tree-like data structure where each node represents a letter. Any path from the root down the tree thus makes a potential word. If the word obtained by following a path from the root to a node v is in the dictionary, we mark the node v as a word-ending node. With this data structure it is possible to see if a word is present in the dictionary in time linear in the length of the word.
The potential plaintext is evaluated using dynamic programming. A word w is given the score | w|2, if the length of the word | w| > 2and if w is present in the dictionary. Otherwise the score of the word is 0. Our aim is to partition the plaintext into disjoint words so that the sum of the score for each word is maximized. This is easy to do with dynamic programming. Let S(i) be the maximum score for the first i letters in the plaintext, and let s(i, j) be the score for a word beginning at index i and ending at index j in the plaintext. Then
S(i) = max1 k i{S(k) + s(k, i)} |
The reason for using a quadratic scoring function was that we wanted to reward longer words more than shorter words. Short real words tend to be present, en masse, in nonsense text as well. With a quadratic function and the selected threshold the number of false hits was reasonably small. The hope was that the correct text be detected with this methodology. This was also eventually confirmed.