Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Suggestions on fastest way to parse a String?

 
Ron Ditch
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello...
Does anyone have any suggestions on the fastest (and hopefully most efficient) way to parse a string?
Let's say I have a string that is comma delimited, and I wanted to convert it to a Collection. Also, the elements in the string that are comma delimited are of unequal length.
For example - item1,items22,item333,item55555
I was thinking of using an array of characters, but I don't know the speed implication of for loops versus creating sub-strings using String.substring(int,int).
Any suggestions?
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Use java.util.StringTokenizer - it's optimized for exactly this type of parsing.
[ September 26, 2002: Message edited by: Ilja Preuss ]
 
Blake Minghelli
Ranch Hand
Posts: 331
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just a warning about StringTokenizer if you have never used it before...
The default behavior ignores empty "tokens".
For example: "token1,token2,,token3"
A StringTokenizer created on that string will return 3 tokens.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you really want the fastest parsing possible, you can probably improve on StringTokenizer a little bit, because StringTokenizer spends a little bit of time checking for multiple delimiters, and even checking to see if the set of delimiters has changed since the last time nextToken() was called. You can omit this for your situation, and thereby speed things up a bit, I imagine. But I doubt you'll see a big difference, so don't spend too much time on it unless you're sure performance is a real problem. I'd probably just store the input as a String, and use indexOf(',', startPos) to find delimiters, and substring(int, int) to create a String for each token. You could also store the input as a char[] array; I'm not sure if that will end up any faster or not. You'd have to try both ways and measure, I suppose.
Now in terms of development speed (rather than execution speed), the easiest solution is probably
String[] tokens = inputStr.split(",");
Try it; you may well find it's already fast enough for you. (You need to be using SDK 1.4 though.) It also fixes the annoying "feature" of StringTokenizer which Blake mentioned.
 
Ron Ditch
Ranch Hand
Posts: 33
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Jim, that's what I was looking for.
 
Thomas Paul
mister krabs
Ranch Hand
Posts: 13974
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You should keep in mind that StringTokenizer was designed to parse Java programs. The token to split on was assumed to be a space. The reason we have the default behavior of the StringTokenizer is that multiple spaces doesn't mean anything special in java source.
 
Yarik Chinskiy
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
What if i want to parse records of a file?
whouldn't the StringTokenizer be a killer??
I want to monitor a log file and reformat the records for the output based on a pattern submitted by a user.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tom's comment may be a bit misleading - it's possible to use StringTokenizer to parse a lot of things other than Java code. But it has a number of limitations - nowadays it's probably more powerful and flexible to learn how to parse using the classes in java.util.regex (at least, for anything more complicated than the split() method I showed above).
 
John Coffey
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have some sample code to test out "log" parsing. It looks like StringTokenizer isn't too good as far as performance is concerned. Using jdk 1.4.1, I got the following results:

Can anyone come up with a faster version? Is there a better IO class?
First a utility to create a big log file:

Now the Split code:

Now the StringTokenizer code:

Now the Pos code:
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic