best performance using minum memory

Edward Chen

Ranch Hand

Posts: 798

posted 17 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I have an interview question I ever had.

We have a HUGE plain txt dictionary file, like
Life

ddddd
Live:eeeeee

Now we need to build a dictionary application, search a word a user inputs, fetch the meaning back to user. Assumption: the file size is HUGE (40GB) we can't load all data into memory.

The question is , how to achieve the best performance using minum memory size ? Using what alogrithm and what Collection object ?

Have any idea?

Thanks.

fred rosenberger

lowercase baba

Posts: 13089

I like...

posted 17 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

a lot of times, they don't want a specific answer, but just want to see how you approach the problem. are you aware of tradeoffs you are making? are you making bogus assumptions? can you defend whatever answer you give?

often these are more important that the exact algorithm you give.

There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors

Edward Chen

Ranch Hand

Posts: 798

posted 17 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Here is my thinking. Any comments are welcome.

Let me assume
1. the dictionary file is in order. (of course)
2. each word is just one line.

Here we only have one choice: randomAccessFile. So, we build an index file for that dictionay.
Dic['A']="100-200"; // a word begining with "A" is located from line 100 to line 200.
Dic['A']['A']="100-150"; // a word begining with "AA" is located from line 100-150.
So we have a 26X26 array, first time, we need to scan the huge file, but second time, we could serialize this array into a file.

User search a word, we just need the first two, then get array data, then use randomaccess file to getLine() to match.

///////////////////////////////////////
Here a problem left, how to scan it ?

Let me assue we have the total line numbers of this file.
We use multithread, for example, we cut wholes files into n parts , each of them will be 100 lines (n*100=total_lines). Then we create n thread to randomaccess file.

Then we use an algorithm like binary search, for example,
1. thread points to 2*100 line, is "Life".
2. "List" is not begining of "L", so we go back to 200-100 line,
3. the word in 200-100 line is "Knife", so we go forward 100/2,
4. repeat, repeat...
5. finally, we could get "L" begining line number.

Just a thinking. what is your idea ? Thanks.