Win a copy of The Way of the Web Tester: A Beginner's Guide to Automating Tests this week in the Testing forum!

# Parsing strings filled with binary

dave hopkins
Greenhorn
Posts: 25
I was looking at a Puzzle from a website called BrainBashers this morning. Here's the Q

During the recent BrainBashers cipher convention, a binary code contest took place. The contest consisted of a binary code transmission where the spaces between the letters were missing and there was no punctuation. Each letter of the alphabet was translated into its binary equivalent based on its position in the alphabet, a=1, b=10, c=11, d=100, e=101, f=110, g=111, h=1000, i=1001, j=1010, k=1011, l=1100, m=1101, n=1110, o=1111, p=10000, q=10001, r=10010, s=10011, t=10100, u=10101, v=10110, w=10111, x=11000, y=11001, z=11010. Can you find 10 countries?

101011000011110
101111110010110011
100001111110011110100
110100101111011101
11100010011100101
1101111110010111111111111
11010111011010011
10100110011011111110
1000011110111011
1101111111101111111110010011

I thought it would be interesting to write a program to give me a list of all the possible answers but I found it to be extremely challenging.

Here's my solution, it does work but I fear it's inefficient as for example the last country takes about 5seconds to finish. Any feedback appreciated!

The code for Letters is here to keep this post tidy.

http://pastie.org/8036886

Ryan McGuire
Ranch Hand
Posts: 1085
4
• 1
I think I'd use a recursive method to solve this:

dave hopkins
Greenhorn
Posts: 25
Ryan McGuire wrote:I think I'd use a recursive method to solve this:

Wow! I just ran yours and it's an order of magnitude faster than my solution. Can you explain why?

Martin Vajsar
Sheriff
Posts: 3752
62
Having a list of possible countries at hand, my solution would be to encode all country names into the binary code and then look for matches in the given encoded strings. But it would probably count as cheating.

Ryan McGuire
Ranch Hand
Posts: 1085
4
Martin Vajsar wrote:Having a list of possible countries at hand, my solution would be to encode all country names into the binary code and then look for matches in the given encoded strings. But it would probably count as cheating.

Good one. Ok, let's pretend that there isn't a List<> but rather a bool method that takes a string and returns whether or not the string is a country name. Yeah, that's it.

Piet Souris
Rancher
Posts: 1403
29
Why would you think these binary strings represent countrynames written in English?
Would you find "Sverige" or "Nederland"?

Nevertheless: disappointing puzzle. Wrote myself a dream of a recursive solution, only to find the last string alone having 680+
translations (possibly not all unique; I used an ArrayList in stead of a Hashset). I need a list of countries to see if there's a country in it, but in what language?

Mike Simmons
Ranch Hand
Posts: 3090
14
Piet Souris wrote:I need a list of countries to see if there's a country in it, but in what language?

Perhaps a reasonable default is to use the same language the problem is specified in.

dave hopkins
Greenhorn
Posts: 25
Here's the model answer list for those who wanted to check. Middle column being the number of potential matches.

http://pastie.org/8056214

Piet Souris
Rancher
Posts: 1403
29
Got the same number of words, to my relief.

I tried three ways for storing the solutions, to see what times were needed. The solution method was equal in all three cases:

HashMap<String, ArrayList<String>>: build time 1 second, solution in 1.2 seconds
HashMap<String, HashSet<String>>: build time 2 s, solution 2.1 s
HashMap<String, TreeSet<String>>: build time 2 s, solution 2.2 s

(all done in NetBeans).

I found it surprising that the build times differ, and to find that HashSet and TreeSet hardly differ in speed.
But maybe that's only by the fact that there are only 2 binary strings that give rise to 300.000+ words.

Indeed, you do need a list of country names to solve this puzzle, but having such a list at your disposal: I don't think you can beat Martin's solution.
(well, assuming Mike's default language, and that the possibly non native English speaking author of the puzzle got every spelling correct)

Nevertheless, it reminds me of how fast computers are, nowadays: generating nearly 700.000 words, in about two seconds,
at least on my 7 year old E6400 processor. Truly amazing!

Ryan McGuire
Ranch Hand
Posts: 1085
4
The next step is to try to identify heuristics that might make the search terminate quicker.

For instance, if we assume that the letter encodings are tried in order (A, B, C, ...) the last string (1101111111101111111110010011) would first find "ABAAAAAAABAAAAAAAADDAA" as a possible country name. Next would be "ABAAAAAAABAAAAAAAADDC". Would it be any more efficient to loop through the letters in reverse order (Z, Y, X, ...). In that case, the first word to try would be "MONOODS". One last option might be to use letter frequency, either for all of the problem space language (English or otherwise) or for some possibly non-authoritative list of countries (E, T, A, O, I, N, S, H, R, ...).

(My code above loops through letterCodes.keys(), so it's essentially using a random order of the letters.)

For any of those strategies, we could identify a country name that makes it inefficient. If you start search with "AAAAA", then it will take a while to find ZIMBABWE. Conversely, if you start with "Z" and work backwards, then it will take a while to find ALBANIA. The trick is to find the strategy that is most likely to find the most country names the fastest.

Any thoughts on either this or any other optimization?

Mike Simmons
Ranch Hand
Posts: 3090
14
Well, I think you guys were too quick to add rules precluding Martin's solution, which doesn't seem like cheating at all to me. But I think there may be faster solutions possible, given a complete list of possible country names. I would be inclined to store the binary-coded names in a binary trie, so that each digit is a node, and countries with the same starting letters could share the same nodes, until they diverge. This would allow you to search a pattern for matches against all possible country names, simultaneously, rather than searching for each country name separately. And it allows you to terminate a search much earlier, if the sequence of chars so far cannot match any possible country name.

Ah, after a bit of research, what I'm thinking of seems to be similar to the Aho-Corasick string matching algorithm. Modified for a binary alphabet, which should allow a few nice optimizations.

Steve Fahlbusch
Bartender
Posts: 605
7
well for a first optimization i can see is the letter p.

This pattern is unique over all countries.

Then i would recurse right to left as there are no letters that begin with a zero.

Also there are many two options (i.e.: has to be this letter or another) that can be exercised to reduce the problem set.

Myke Enriq
Ranch Hand
Posts: 115
Get a list of all the country names in English , translate it into that binary encoding , then parse the input checking for any occurrence in the list.

This is probably the best architecture for a solution because, no matter what other solution you can come up with:
- at some point you need to compare a string (chars in the input) with the name of a country to see if they match , thus you need to have a list of all countries
- encoding a char into binary has a lower or equal cost than encoding a binary in a char
- whatever algorithm you chose for a solution , it will iterate on the input so it has at least the complexity O(n) , n = number of digits in the input
- this algorithm has O(n)

Anyhow , the real question then is how given an input String find the first occurrence of a certain sub string in it.
This is where you need to optimize. Looks like O(n) to me , but how fast can you make it go.

Martin Vajsar
Sheriff
Posts: 3752
62
Myke Enriq wrote:- at some point you need to compare a string (chars in the input) with the name of a country to see if they match , thus you need to have a list of all countries

The puzzle can be reformulated in a way which precludes this solution - see Ryan's note here.