• Post Reply Bookmark Topic Watch Topic
  • New Topic

Is there an "existence" binary search in the API  RSS feed

 
Shawn Smith
Ranch Hand
Posts: 41
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I basically just need to know if something's in a list and want to avoid full iteration. I've been aGooglin' and so far no luck.

I'll start writing my own in anticipation of the answer....

Cheers,
Shawn
 
Mori Gad
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/Collections.html

binarySearch(List list, Object key)
and
binarySearch(List list, Object key, Comparator c)
 
Shawn Smith
Ranch Hand
Posts: 41
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Those only return the potential insertion point for a new object. They don't tell you whether an object exists in the collection. I can use that to "look left and right", but I'm really just looking for a true or false.
 
Mori Gad
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"index of the search key, if it is contained in the list"

You can easily do a check to see if you got a hit or not. The potential insertion point is if it misses.
 
Stephan van Hulst
Saloon Keeper
Posts: 7992
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Shawn Smith wrote:Those only return the potential insertion point for a new object. They don't tell you whether an object exists in the collection. I can use that to "look left and right", but I'm really just looking for a true or false.


It does. The result is negative if the key isn't contained. So binarySearch(list, key) >= 0 will return your true/false.
 
Mori Gad
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Not so, as per the api

"index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or list.size(), if all elements in the list are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found."

For my shame, I misread it.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
TreeSet provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
 
Campbell Ritchie
Marshal
Posts: 56570
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A binary search might run in log(n) time, but it requires the List be sorted beforehand. Sorting may destroy information about insertion order, and runs in nlog(n) time, and copying the entire List to a TreeSet runs in nlog(n) time, too, so you might be no better off with a binary search than a linear search.
If you really want a fast contains procedure, running multiple times, copy the entire List into a HashSet; both inserting a single element and searching [edit: look for contains() methods in the Set interface /edit] run in constant time, and are fast, so the entire procedure would run in linear time. Remember a HashSet requires correct overriding of the hashCode and equals methods.
 
Rob Spoor
Sheriff
Posts: 21135
87
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is't n log(n) < n?
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:Is't n log(n) < n?


For small enough values of n, yes, but it grows faster than n, and that's what matters. d(n log(n)/dn > dn/dn
 
Shawn Smith
Ranch Hand
Posts: 41
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I go home for a good nights rest and come back to a wealth of information. Thanks to all for the feedback.

Cheers,
Shawn
 
Shawn Smith
Ranch Hand
Posts: 41
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So, as a follow up for those keeping score at home:

using



Total time for the for loop: 600 msec

Switching toand using contains took 552 msec

doubling the toSearchFor (actually just running the list more than once) resulted in:
binarySearch
2x - 913 msec (+313)
3x - 1245 msec (+332)
4x - 1626 msec (+381)

HashSet.contains
2x - 874 msec (+322)
3x - 1350 msec (+476)
4x - 1757 msec (+407)

And just to make it interesting, I threw in some c++ standard library (i.e. )
These timings aren't accumulative, just the loop time for each iteration.
1x - 264 msec
2x - 265 msec
3x - 267 msec
etc...

So my conclusions were (and feel free to throw rocks), there's really not a enough of a significant difference in the timed tests between HashSet and binarySearch for something as simple as determining String containment.
 
Stephan van Hulst
Saloon Keeper
Posts: 7992
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yeah, but that's still assuming the Strings are already sorted. If you're fair, you should add the time it takes to do the sorting to the time it takes binarySearch() to find an element.
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Shawn Smith wrote:
So my conclusions were (and feel free to throw rocks), there's really not a enough of a significant difference in the timed tests between HashSet and binarySearch for something as simple as determining String containment.


I don't think your tests are sufficient to show that. A HashSet lookup with a proper hash function is O(1) and a binary search is O(log N). I haven't looked at your testing code, but I'll bet at least one of the following is true: 1) The scale is too small, so that the differences aren't showing yet, and/or the search times are not large relative to the precision of the clock. 2) You're timing superfluous stuff, not just search times, and that superfluous time is not tiny relative to the search time. 3) GC is throwing your search times off. 4) You picked values such that binary searches happen to be hitting very early on, without having to go further down the tree.

If you get a large enough sample, have a proper hash function, eliminate GC from the equation, and run many searches, with hits at many different spots in the structures, and with a fair number of misses as well, you will see that average hashed search times grow more slowly than binary search times.
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Yeah, but that's still assuming the Strings are already sorted. If you're fair, you should add the time it takes to do the sorting to the time it takes binarySearch() to find an element.


No, that's just a constant, and its significance shrinks as we do more searches, which is what matters for this performance comparison. Comparing a single binary search to a single hash search will not give consistent results and is a meaningless comparison.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Did you use a HashMap or a HashSet? Just askin

Another thing to consider when evaluating HashSet is the number of collisions in your strings that need to be searched. For a truly random set, the search time on a HashSet should be O(1). However, if all of your input data collides (very unlikely), the performance can degrade to O(n)
 
Shawn Smith
Ranch Hand
Posts: 41
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I can certainly add the sort to the timing, though that's only going to affect the first iteration.

Essentially this is a dirty word checker. There are 622 dirty words and I ran the entire content of my /usr/share/dict/words file against them (479830) words.

There is a very small subset, and once the data is loaded I do simple iteration:

The test code looks like:


and the passesDirtyWord looks like:



With the commented areas being the difference between the HashSet and the ArrayList.

Please forgive the swear word, it is a dirty word checker after all...

All other code in the test is just loading collections of String's from files.

I'm not sure if this information helps the analysis, but...
 
Shawn Smith
Ranch Hand
Posts: 41
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There is a very small subset, and once the data is loaded I do simple iteration:


Sorry, I should have pre-read. There is a very small code set.
 
Shawn Smith
Ranch Hand
Posts: 41
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jayesh A Lalwani wrote:Did you use a HashMap or a HashSet? Just askin

Another thing to consider when evaluating HashSet is the number of collisions in your strings that need to be searched. For a truly random set, the search time on a HashSet should be O(1). However, if all of your input data collides (very unlikely), the performance can degrade to O(n)


HashSet, that HashMap was a typo.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How long does split take?
 
Shawn Smith
Ranch Hand
Posts: 41
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jayesh A Lalwani wrote:How long does split take?


I can measure it, but is should be constant across both tests.
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Shawn Smith wrote:
Jayesh A Lalwani wrote:How long does split take?


I can measure it, but is should be constant across both tests.


Doesn't matter. If it's part of your timing and not at least an order of magnitude smaller that what you're testing, it is invalidating your tests.

If A takes time 0.001 and B takes time 0.005, then A is 5 times faster than B, but if S takes time 1.0 and you're comparing S+A vs. S+B, you'll get 1.001 vs. 1.005 and conclude (erroneously) that A and B are about the same.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
split might be throwing off GC. Essentially every call to split will compile a new Pattern. That means you will have 479830 Pattern objects created in memory. Depending on your memory settings, it might be triggerring GC too many times. Try changing the code to create a Pattern once outside the loop.
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Shawn Smith wrote:I can certainly add the sort to the timing, though that's only going to affect the first iteration.

Essentially this is a dirty word checker. There are 622 dirty words and I ran the entire content of my /usr/share/dict/words file against them (479830) words.


I still haven't studied your test, and to be honest, I probably won't. Because, unless there's something funky in the implementations, I know how the two approaches behave. Now it certainly may be the case that the difference doesn't show up until we get to extremely large search sets, and that in itself is worth knowing. In general, though, hash search times will grow more slowly than binary search times. (Although, what could reverse the trend is when your number of buckets gets saturated, and then the O(n) linear search within the bucket will eventually surpass binary search as the buckets grow larger.)
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jeff Verdegan wrote:
Shawn Smith wrote:Now it certainly may be the case that the difference doesn't show up until we get to extremely large search sets, and that in itself is worth knowing.



He will never have extremely large search sets. Outside of college campuses (:p), the list of dirty words are going to be much smaller than the list of all possible words.
 
Stephan van Hulst
Saloon Keeper
Posts: 7992
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jeff Verdegan wrote:No, that's just a constant, and its significance shrinks as we do more searches, which is what matters for this performance comparison. Comparing a single binary search to a single hash search will not give consistent results and is a meaningless comparison.


Errr yes of course. D'oh. I was still confused with just doing one search XD
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!