• Post Reply Bookmark Topic Watch Topic
  • New Topic

How to do a fast substring search on an Array of strings?  RSS feed

 
Ranch Hand
Posts: 60
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What i do now is loop through the array and do an indexof on all Strings, which i I think isn't very efficient. So I wonder if their is a smarter/faster algorithm.
 
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I guess I'm not sure how else you'd do it, without changing the way the strings are represented.
 
Ranch Hand
Posts: 152
Linux VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
you could iterate over your array and use
regular expressions instead of indexOf().

If you are using a fix set of regular expressions, compiling them (e.g. Pattern.compile())
would boost performance.

hope that helps
Matt

 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's not really any different, though.
 
Ranch Hand
Posts: 1183
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Matt Cartwright wrote:you could iterate over your array and use
regular expressions instead of indexOf().

If you are using a fix set of regular expressions, compiling them (e.g. Pattern.compile())
would boost performance.

hope that helps
Matt



I'm not even sure if the heavy regex engine is really faster compared to a plain indexOf.
 
Matt Cartwright
Ranch Hand
Posts: 152
Linux VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Man was I wrong!

Sorry for passing on second hand knowledge.

My original post was based of what I've been told
by developers in different projects when I asked them
why we have to use regular expressions.

Well, the statement is true only in comparison with
the matches(regex) method on String.

I ran multiple tests with indexOf() and matches(regex) on String.
The latter one is implemented as Pattern.matches(regex, input).






I also compared against the pre-compiled regular expression.



Output of the test runs is:


Filling the array with 7,000,000 String objects. DONE!
5% of these String objects contain a neelde.
350,000 neeldes have been hidden in the hay stack.

Searching the needle using indexOf("needle").
Found 350,000 needles in 407 milliseconds.

Searching the needle using Pattern.matches(".*needle*.*").
Found 350,000 needles in 139,653 milliseconds.

Searching the needle using pattern.matcher(hayStack[i]).matches().
Found 350,000 needles in 33,042 milliseconds.


and:


Filling the array with 7,000,000 String objects. DONE!
20% of these String objects contain a neelde.
1,400,000 neeldes have been hidden in the hay stack.

Searching the needle using indexOf("needle").
Found 1,400,000 needles in 416 milliseconds.

Searching the needle using Pattern.matches(".*needle*.*").
Found 1,400,000 needles in 149,735 milliseconds.

Searching the needle using pattern.matcher(hayStack[i]).matches().
Found 1,400,000 needles in 34,379 milliseconds.



Hope this is useful.
Matt

PS: I wanted to attach the source code I've used. Unfortunately
.java, .zip, .tar and .gz are not valid extensions.
 
Sheriff
Posts: 21202
87
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's odd that Pattern.matches and pattern.matcher.matches provide such different results. Pattern.matches compiles a new Pattern, gets a Matcher for the input and calls matches on that:
So in fact, they are the same (save one method call).

Oh wait, of course. The compiling of the Pattern object is taken outside of the loop. That's a bit cheating, isn't it?
 
Matt Cartwright
Ranch Hand
Posts: 152
Linux VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
no, not cheating, that was intentional.
That is exactly what I meant when saying

compiling them (e.g. Pattern.compile()) would boost performance.


 
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Matt Cartwright wrote:PS: I wanted to attach the source code I've used. Unfortunately
.java, .zip, .tar and .gz are not valid extensions.



Could you try to rename file into *.txt or *.jpg? Or could you please send your test example to me via private message? I've found no performance measure of the regular expressions in the internet yet.


By the way, matching ".*" + needle + "*.*" means quite a bit of work including a huge amount of backtracking especially for strings which don't contain needle in them.

See, it first greedily matches ".*" until the end of string (because dot matches every character except newline), then tries to match "needle", fails, backtracks one character back, tries to match once more - and this in cycle moving backwards in the string until finds the last entry of "needle" (or reaches the beginning of string finishing the whole roundtrip). Then it matches the rest of string by the last ".*".

Moreover, regex ".*needle*.*" isn't actually doing the thing you want (for example, "needle*" doesn't mean 0 or more times "needle" but instead it means "needl" and 0 or more times "e"). If you want to find any number of repeated "needle" substring within the string you should use ".*?(needle)+.*" instead.

First ".*?" means shy matching (it doesn't tries to possess the whole string and then backtrack - instead it tries to match the next expression after each character match).

So could you please try testing match of the this example?


Next, indexOf() performs only search of the substring. Using the whole matching regular expression is quite unfair since the regular expression with the matches() method should match each string character.

Exact equivalent of text.indexOf("needle") would be Pattern.compile("needle").matcher(text).find();

Could you please try the last example as well?
 
Matt Cartwright
Ranch Hand
Posts: 152
Linux VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

rather than mailing the source code, I think it is much better to post it here.

So here we go:









Hope that helps.
Matt

 
Nikolay Yurchenko
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks.



Filling the array with 500000 String objects. DONE!
33% of these String objects contain a needle.
166666 needles have been hidden in the hay stack.

Searching the needle using hayStack[i].indexOf("needle").
Found 166666 needles in 46 milliseconds.

Searching the needle using Pattern.matches(".*needle*.*", hayStack[i]).
Found 166666 needles in 1641 milliseconds.

Searching the needle using Pattern.compile(".*needle*.*").matcher(hayStack[i]).matches().
Found 166666 needles in 922 milliseconds.

Searching the needle using Pattern.compile(".*?needle.*").matcher(hayStack[i]).matches().
Found 166666 needles in 734 milliseconds.

Searching the needle using Pattern.compile("needle").matcher(hayStack[i]).find().
Found 166666 needles in 250 milliseconds.



So, simple regex search 5 times slower than indexOf - it's not bad from my point of view.

Measure.zip.jpg
[Thumbnail for Measure.zip.jpg]
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!