Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

String.split Vs StringTokenizer

 
Karthik Veeramani
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Any idea if jdk 1.4's String.split() method is faster than the traditional
StringTokenizer? I'm skeptical about using the split() and replaceAll()
methods as I have a feeling they might compile the regular expression everytime, which is an expensive operation.

Please advice.
 
sander hautvast
Ranch Hand
Posts: 71
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i guess you're right about the compiling:
source for String (jdk1.4.1) says:

public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
 
Karthik Veeramani
Ranch Hand
Posts: 132
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
StringTokenizer, from what I've heard, is very inefficient. I want to know how it compares with this split method... Even if split compiles the regex everytime, I'm OK if its faster than tokenizer.
 
Blake Minghelli
Ranch Hand
Posts: 331
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why don't you try some performance tests on the 2 options?
Personally, I hate that with StringTokenizer, if you have 2 delimiters back-to-back (e.g. "1,2,,3") then the empty element gets completely ignored. I believe String.split() does not have that problem, but I've never actually used it.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13074
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What is this
from what I've heard
stuff - where do people hear things like this and why do you believe it?

Look at the source code for StringTokenizer - it looks pretty simple to me, give the flexibility it provides.

Seems to me that if you REALLY want to know which is faster you could write a little test program using data similar to your usual data and set of separators, and run it. Be sure to do some "warmup" loops so that JIT has had time to optimize.
Bill
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Blake:

public StringTokenizer(String str,
String delim,
boolean returnDelims)

Constructs a string tokenizer for the specified string. All characters in the delim argument are the delimiters for separating tokens.

If the returnDelims flag is true, then the delimiter characters are also returned as tokens. ...

src: javadocs.
I still don't know why the docs discourage usage of StringTokenizer.

(well - I could google, and so I will do...)
[ May 27, 2004: Message edited by: Stefan Wagner ]
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13074
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm looking at the JavaDocs for java.util.StringTokenizer right now and I don't see anything discouraging the use.
Bill
 
Tim West
Ranch Hand
Posts: 539
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
(Gah, I just realised my entire post is redundant. Still, I'll leave it here)

I think Stefan's referring to this, from the StringTokenizer JavaDoc:


StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.


I have no idea why really, aside from redundancy - I don't think StringTokenizer does anything String.split() doesn't do...


--Tim
[ May 27, 2004: Message edited by: Tim West ]
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The split() method can do anything a StringTokenizer can, and more. No need for both tools really, so you might as well use the more powerful one. There are a couple other reasons to discourage StringTokenizer. One is that StringTokenizer is not very good for detecting empty fields, e.g. to interpret

    one|two||four

as

    { "one", "two", "", "four" }

With split() this is easy. With StringTokenizer it's possible, thanks to the returnDelims parameter (as pointed out by Stefan) - but it's still a bit difficult. You need several more lines of logic to say that two successive delims translate into an empty field. E.g.:


The split() method seems a lot simpler, to me. Though it does force users to learn about regex escape sequences. And note that I was able to specify that ten fields were expected, total, so the tenth empty field was reported as "" rather than a null - which can be convenient. The StringTokenizer code would require some extra logic if you need to try to access the tenth field.

Another common problem people have is that the want to use a delimiter of more than one character. E.g. they might see something like

    foo and bar and baz

and then try to use a StringTokenizer with delimiter string " and ". Except this doesn't work, because " and " means that space or a or n or d will be considered a delimiter, and the results will be:

    "foo", "b", "r", "b", "z"

rather than the intended

    "foo", "bar", "baz"

We can say this is the user's fault for failing to read the documentation for StringTokenizer before using it. But still, the way " and " is interpreted as a delimiter string is counterintuitive for most of us. And it would be nice if there were a way to handle multi-character words as delimiters. Again, the split() method handles this sort of thing easily.

For what it's worth, JDK 1.5 also offers the Scanner class, which offers the same basic functionality with a few more improvements. A Scanner makes it very easy to read from a file or other IO stream, and its API does not force you to load all the results into memory at once (which can be a problem if you're reading from a really big file). Plus it adds some methods giving you access to any groups matched in the regex you used as a delimiter, which gives you many more flexible options in text processing. For those who lament how many lines of code it takes them to process a simple text file in Java (as opposed to, say, Perl) - Scanner does a nice job of simplifying things.

BTW, for those of you familiar with the new for loop: check out this RFE. Basically, this enhancement would allow us to write

rather than

Iterable is the new interface that allows us to use the new for loop syntax with a given construct. Yeah, it's a minor point. But what was the point of making Scanner implement Iterator if it isn't going to be Iterable? Seems like an oversight; easy to fix at this point. Please vote for this bug if you agree. Assuming you haven't already used your 3 votes on more important things.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks, Jim, for that deep information.

Of course I would say 'user fault'.
And of course I made the same mistake, when I was new to StringTokenizer.

But doesn't this build the community? You're burned from the same fire, and may show your injuries.
Well of course 'split' has it's own fire, since a newbie wouldn't read 8 pages of regex-syntax, when trying to understand split ("\\|") - but estimate a splitting around '|' and '\'.

Compactness seems to be a point, but Sun could decide to give StringTokenizer a 'toArray' or 'splitAll' - Method too, which returns an Array of Strings.
OK - I agree in advance, could but wouldn't.

The 'ST.nextElement'-Method looks very suspicious - hmmm.

In C there is a similar function 'strtok' - might be a kind of father for StringTokenizer.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13074
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Since the String split builds a new Pattern every time, it is bound to be slower than StringTokenizer. If you have alot of Strings to operate on, building the Pattern once and using the Pattern split() method would be the way to go for maximum speed.
Bill
 
ashraf karim
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As StringTokenizer do not detect empty token, that sometimes becomes beneficial.
I was trying to parse a string like, " This is test " and suppose need only the words.
StringTokenize only return the strings/words.
But String.split("\\s+") still returns and empty token at the first.
any comments?
 
Henry Wong
author
Marshal
Pie
Posts: 21506
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
ashraf karim wrote:any comments?


Thanks for the info.... but you do know that this topic is over 5 years old, right?

Henry
 
Jill Iyer
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How can we specify multiple delimiters???
 
Darryl Burke
Bartender
Posts: 5148
11
Java Netbeans IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
 
Jan Cumps
Bartender
Posts: 2602
13
C++ Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
Thanks for the info.... but you do know that this topic is over 5 years old, right?
...
Henry
Six years old now.
 
Henry Wong
author
Marshal
Pie
Posts: 21506
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jill Iyer wrote:How can we specify multiple delimiters???


For StringTokenizer, there is a constructor, with a parameter, that allows you to specify possible delimiter characters.

For String.split(), it takes a regular expressions -- which can be used to define everything from the very simpliest of patterns to the ridiculous complex. A list of possible delimiter characters falls under the simple category.

Henry
 
Sagy Drucker
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i have run a few simle tests about string tokenizing

the result is conclusive:
StringTokenized is MUCH faster than regex, or String.split().

results:
for 1000 iterations on a large text:
StringTokenizer: 0:00:01.586 seconds

using pattern: 0:00:02.925 seconds

using string.split: 0:00:02.776
which makes sense, since the split uses the pattern regex.

hope this is useful.
 
Joanne Neal
Rancher
Posts: 3742
16
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jan Cumps wrote:
Henry Wong wrote:
Thanks for the info.... but you do know that this topic is over 5 years old, right?
...
Henry
Six years old now.

Seven years old now
 
Sagy Drucker
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
oh, true.
i didn't notice.
i read it while googling stringTokenizer...
well.. never too late
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sagy Drucker wrote:the result is conclusive:
StringTokenized is MUCH faster than regex, or String.split().
results:
for 1000 iterations on a large text:
StringTokenizer: 0:00:01.586 seconds
using pattern: 0:00:02.925 seconds...

So you've just spent an hour (I reckon it would take me at least that to write a comprehensive test) to prove that String.split() would take 1.2 seconds longer to check a thousand large strings than a class whose use has now been discouraged for 4 releases (I checked back to 1.4.2).

Optimization is fun, but it's worth remembering that your time is more valuable than any old computer's. You might also want to check out my quote below.

Winston
 
Sagy Drucker
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i see your point, and you are correct.
but 2 things:
1. writing a few for loops with split and stringTokenizer, took me 5 minutes. (10 minutes top)

2. at my job, we need to process millions of millions of strings, so even if it saves us a little time, we might feel it in the long run.
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sagy Drucker wrote:1. writing a few for loops with split and stringTokenizer, took me 5 minutes. (10 minutes top)

Sounds like a fairly cursory test then.

2. at my job, we need to process millions of millions of strings, so even if it saves us a little time, we might feel it in the long run.

Hmmm. 20 minutes of computer time per million as against using a class that may well get deprecated? I think I'd let the machines chug a bit more myself, especially since this particular test is so...well...particular.

Between them, String.split(), java.util.regex.Pattern and java.util.regex.Matcher provide a lot more variety than you'll ever get out of StringTokenizer, and they also have the great advantage of being more familiar to new Java bods.

Winston
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic