• Post Reply Bookmark Topic Watch Topic
  • New Topic

best performance using minum memory  RSS feed

 
Edward Chen
Ranch Hand
Posts: 798
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have an interview question I ever had.

We have a HUGE plain txt dictionary file, like
Life ddddd
Live:eeeeee

Now we need to build a dictionary application, search a word a user inputs, fetch the meaning back to user. Assumption: the file size is HUGE (40GB) we can't load all data into memory.

The question is , how to achieve the best performance using minum memory size ? Using what alogrithm and what Collection object ?

Have any idea?

Thanks.
 
fred rosenberger
lowercase baba
Bartender
Posts: 12542
48
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
a lot of times, they don't want a specific answer, but just want to see how you approach the problem. are you aware of tradeoffs you are making? are you making bogus assumptions? can you defend whatever answer you give?

often these are more important that the exact algorithm you give.
 
Edward Chen
Ranch Hand
Posts: 798
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here is my thinking. Any comments are welcome.

Let me assume
1. the dictionary file is in order. (of course)
2. each word is just one line.

Here we only have one choice: randomAccessFile. So, we build an index file for that dictionay.
Dic['A']="100-200"; // a word begining with "A" is located from line 100 to line 200.
Dic['A']['A']="100-150"; // a word begining with "AA" is located from line 100-150.
So we have a 26X26 array, first time, we need to scan the huge file, but second time, we could serialize this array into a file.

User search a word, we just need the first two, then get array data, then use randomaccess file to getLine() to match.

///////////////////////////////////////
Here a problem left, how to scan it ?

Let me assue we have the total line numbers of this file.
We use multithread, for example, we cut wholes files into n parts , each of them will be 100 lines (n*100=total_lines). Then we create n thread to randomaccess file.

Then we use an algorithm like binary search, for example,
1. thread points to 2*100 line, is "Life".
2. "List" is not begining of "L", so we go back to 200-100 line,
3. the word in 200-100 line is "Knife", so we go forward 100/2,
4. repeat, repeat...
5. finally, we could get "L" begining line number.

Just a thinking. what is your idea ? Thanks.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!