Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Problems with Matcher objects and the last line of a file  RSS feed

 
B Mayes
Ranch Hand
Posts: 47
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am working on a program that does file parsing, and previously was using a Scanner object in a loop to read each line like this:



That's all well and good and it's very easy to read/understand, but it's horribly slow for really large files. I assume BufferedReader's ready() method is going to be slow as well. I found the following on Sun's website not too long ago and found their Grep class which uses a Matcher object to break apart lines:

http://download.oracle.com/javase/1.4.2/docs/guide/nio/example/Grep.java

Specifically:




This method works great as it's incredibly efficient and can easily handle regular expressions -- or not, if I just invoke Pattern.compile() method with the Pattern.LITERAL option. I have recently discovered a problem with this implementation however -- it won't always read the last line of a file, because the file may not end in a newline! The reason appears to be the regex defined in the linePattern object:



Ultimately I would like to stick with the Matcher object, but how do I get the find() method to return the final line to me, even if it doesn't end in a newline? Can I add in something to the object named linePattern so that it will recognize end of file as well as end of line?
 
Paul Clapham
Sheriff
Posts: 22185
38
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
B Mayes wrote:I assume BufferedReader's ready() method is going to be slow as well.


Assuming things about potential performance is an error-prone strategy. It's possible that BufferedReader is buffered (well, we actually know it is) and that Scanner is not buffered (I have no idea). This would make the performance of the two classes very different.
 
B Mayes
Ranch Hand
Posts: 47
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My experience with BufferedReader has been less than stellar...i'll try it I suppose and see what happens but ultimately I need regex matching as well. Thus far the Grep.java implementation is the most efficient thing I have found...so I would prefer to stick with that.
 
B Mayes
Ranch Hand
Posts: 47
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Wow...I absolutely have to eat my words. I take it all back -- BufferedReader is FAST! I guess my previous negative experiences must have been the result of poor coding on my part. Perhaps I didn't compile a Pattern and instead used the matches() method from the String class again and again...I don't really know. I have to admit that I haven't really used BufferedReader since I was an undergrad. So to be fair, I probably wasn't a very good coder at that time.

What I do know is that giant logs which once took me 65-70 seconds to parse are now done in about 33 seconds! Forget Sun's implementation -- I'm going with my own. Oddly enough, the new code (using BufferedReader) is much more straightforward and way easier to follow. I was not expecting this but am very pleasantly surprised.

Thanks for setting me straight Paul!! I owe you a beer if we ever meet up on the street.
 
Paul Clapham
Sheriff
Posts: 22185
38
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, there you go... they always say "Don't guess, test" when talking about performance, and in this case that turned out to be the right thing to do.
 
B Mayes
Ranch Hand
Posts: 47
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"Dont guess, test"

I like it!

Thanks again for forcing me to "prove you wrong"... ultimately that didn't work so well for me but it certainly improved my program. I got my teammate to test the same log he used to present the bug to me and this time it caught the error even though it was on the final line of the log. Hooray!

I switched up several other instances of Scanner to BufferedReader today and improved performance even further. A 32MB log that previously took roughly 28 seconds to parse now takes about 13. So similar to my previous results on even larger logs -- it cut parsing time in HALF (give or take). Let's not forget about the added bonus that changing the code fixed the bug in my program. Results are even more impressive if you use the console option (to print output to STDOUT and skip the use of the Swing GUI entirely). If you can't tell I'm very excited about this!

Changes are safe and sound (committed to SVN). Just need to do some further regression testing and then I get to release a new version that fixes this as well as a few other bugs. Then I get to move on to other fun stuff, like changing the PHP code to display the "top contributors" and so forth...
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!