• Post Reply Bookmark Topic Watch Topic
  • New Topic

Searching a large text file...  RSS feed

 
John Calabasas
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

This was a question asked in a job interview.

What is the best way to search the number of occurances of a string in a very large text file (>1 gb) assuming that we must use a Java program.

I am sure that the usual String.indexOf() will be very inefficient. Considering the size of file, I feel it will be better to read the file in chunks(lines) and search. Please let me know other options.

John
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A simple technique that could well be considerably faster than indexOf() is to use a Scanner, with the method findWithinHorizon(). If you use a constructor that takes a File object, Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.

If you don't have access to JDK 5 but you can use JDK 1.4, then you can use java.util.regex and FileChannel yourself, but it will be more complex. Reading the file in chunks (e.g. into a CharBuffer) is a good idea, but you need to consider that it's possible that the target string may be split between reads. If you know the maximum size of the target string (and if it's not a prohibitively large number), you can handle this by copying length-1 chars from the end of the last read to the beginning of the buffer for the next read. You'll have to be careful to get this right though.

Simply reading the file line by line and searching each line individually (preferably using java.util.regex) may be acceptable, if you're certain that the target string does not contain any line separators. It will still most likely be slower than the other techniques suggested so far, because it will take more time to transfer each and every char into a char[] and then into String object. The nice thing about NIO classes is they allow most of that stuff to happen outside the JVM, and thanks to Boyer-Moore you don't really need to look at all the charcters anyway. But reading line by line does have the advantage of being a common idiom that most people understand. If you can't use a Scanner and findWithinHorizon(), then reading line by line will probably be simpler than anything else I've suggested. And simplicity is almost always a virtue in programming.
[ June 19, 2006: Message edited by: Jim Yingst ]
 
John Calabasas
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim,

Thanks a lot for the hints. I have tried by using FileChannel with Scanner. findWithinHorizon(pattern, horizon) method. However I could not get it right yet. I am also not sure of what the horizon parameter means. Its not seen documented in the Javadocs properly.

Could you please post a sampler code of the same?

Thanks
John
 
Joe Ess
Bartender
Posts: 9406
12
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Pattern is a "regular expression", which is a series of characters that describes a set of strings. There's an introduction to using Pattern here and a more in-depth tutorial in the Java Tutorial
 
John Calabasas
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you Joe. I am aware of regex patterns. My problem was trying to get the program run using the FileChannel and findWithinHorizon methods. I could not find any sample code on the usage of these APIs either.

Please see my sample code below. Not sure if this is the best way to do it.



John
[ June 29, 2006: Message edited by: John Calabasas ]
 
Joe Ess
Bartender
Posts: 9406
12
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I understand the horizion parameter to be a limit set on the amount of data to search:

public String findWithinHorizon(Pattern pattern,int horizon)
Attempts to find the next occurrence of the specified pattern.
. . .
A scanner will never search more than horizon code points beyond its current position. . .
If horizon is 0, then the horizon is ignored and this method continues to search through the input looking for the specified pattern without bound.

java.util.Scanner
It looks like your code should work. Does it?
 
John Calabasas
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Joe,

Thanks for the input. Yes my code works. But its slower than using String.indexOf() in all cases I have tested. Can you please help me optimize this code?

John
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hmmm. Well, since you're using the same regular expression repeatedly, it would make sense to compile it once into a Pattern (outside the while loop), and search with the Pattern instead. Otherwise you're effectively recompiling the Pattern (internally) every time you search. Additionally, it slows things down to also count the total number of words in the file, creating a String object for each and every one. That wasn't in the original problem statement, and it's something that changes the performance characteristics considerably. If it's not necessary, drop it, and things should speed up considerably:
Lastly, I would note that exception handling with e.getMessage() is unreliable in my experience. Some expressions have no message; instead the name of the exception class itself may be all the information you can get. Additionally the stack trace is often extremely useful information. I recommend using e.printStackTrace() if you want to print an exception, since this will also print the name and message of the exception.
[ June 29, 2006: Message edited by: Jim Yingst ]
 
John Calabasas
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Jim for your suggestions. I tried the same thing and it seems to be much faster now. Exception handling was to be changed anyway.

I could not find much advantage for using the NIO though...
[ June 30, 2006: Message edited by: John Calabasas ]
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
NIO is being used internally within Scanner here:

So the effect is exactly as if you'd called new FileInputStream and getChannel() yourself.
 
John Calabasas
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim,

I am very thankful to your help on this.

I think this program still loads the whole file into memory. I am facing OOM exceptions when trying with large files.

Is there anyway I can make this program work on files of any size?
Can I control the extend of file loaded into memory at any time?
Does using MemoryMappedBuffer help?

John.
[ July 01, 2006: Message edited by: John Calabasas ]
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A stream with Boyer-Moore sounds ideal. That algorithm goes much faster on larger search strings and probably slightly faster over fewer larger input than running more times on smaller inputs.

For your current method, you could read the file into memory in parts. Make sure to overlap the parts by the size of the needle (the string you're looking for). I'd guess RandomAccessFile would be good at that, but compare it to a read-forward-only stream and manage the overlaps yourself.
 
John Calabasas
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

I tried with doing a mini benchmarking exercise for comparing the performance of a searching a 1 gb text file using the method described here and also using String.indexOf(). Trial was done for about 15 different Strings as search param.


Using Regex and NIO:- 120 to 210 seconds
Using indexOf():- 70 to 100 seconds


As shown above, the results are not very encouraging and I find that in many cases, indexOf() performed atleast 25% faster. Also another thing I noticed with the approach here is that the time taken varies widely depending on the String that is used to search, whereas it was more or less consistent when using indexOf() method.

Hence I have doubt if using Regex patterns is the best way to search for plain strings in a large input. Please let me know your comments.

-John.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Interesting. I'd like to see the exact code you're using at this point. Is it the same code that was throwing OutOfMemoryError, or has it been further modified?

Boyer-Moore definitely will be more advantageous the longer the target string is, as Stan said. It sounds like the strings you're looking for are short enough that it doesn't offer a great advantage. Also I guess Scanner is such a generalized class, it's designed to be able to do so many different things, it's not necessarily optimized to be as fast as possible in all things. Oh well.

If you want to continue optimizing this stuff, it would probably be worthwhile to start using a profiler on the code to see what's really taking up most of the time. Guesswork has taken us this far, but real data would be preferable.
 
John Calabasas
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim,

I have tried with the -server option in JVM which gave me a good 25% performance boost and also using Java 1.6 (Mustang) Beta which gave me another 10%. But still I find that indexOf() is faster with these options.

Here is the code. It is essentially the same that you suggested earlier. To avoid OOM, I am using a higher heap size (2gb) for now.


[ July 02, 2006: Message edited by: John Calabasas ]
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!