I'll start writing my own in anticipation of the answer....

Cheers,

Shawn

binarySearch(List list, Object key)

and

binarySearch(List list, Object key, Comparator c)

You can easily do a check to see if you got a hit or not. The potential insertion point is if it misses.

Shawn Smith wrote:Those only return the potential insertion point for a new object. They don't tell you whether an object exists in the collection. I can use that to "look left and right", but I'm really just looking for a true or false.

It does. The result is negative if the key isn't contained. So

`binarySearch(list, key) >= 0`will return your true/false.

*The mind is a strange and wonderful thing. I'm not sure that it will ever be able to figure itself out, everything else, maybe. From the atom to the universe, everything, except itself.*

"index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or list.size(), if all elements in the list are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found."

For my shame, I misread it.

*log*(n) time, but it requires the List be sorted beforehand. Sorting may destroy information about insertion order, and runs in n

*log*(n) time, and copying the entire List to a TreeSet runs in n

*log*(n) time, too, so you might be no better off with a binary search than a linear search.

If you really want a fast contains procedure, running multiple times, copy the entire List into a HashSet; both inserting a single element and searching [edit: look for contains() methods in the Set interface /edit] run in constant time, and are fast, so the entire procedure would run in linear time. Remember a HashSet requires correct overriding of the hashCode and equals methods.

SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6 - OCEJPAD 6

How To Ask Questions How To Answer Questions

Rob Spoor wrote:Is't n log(n) < n?

For small enough values of n, yes, but it grows faster than n, and that's what matters. d(n log(n)/dn > dn/dn

Cheers,

Shawn

using

Total time for the for loop: 600 msec

Switching toand using contains took 552 msec

doubling the toSearchFor (actually just running the list more than once) resulted in:

binarySearch

2x - 913 msec (+313)

3x - 1245 msec (+332)

4x - 1626 msec (+381)

HashSet.contains

2x - 874 msec (+322)

3x - 1350 msec (+476)

4x - 1757 msec (+407)

And just to make it interesting, I threw in some c++ standard library (i.e. )

These timings aren't accumulative, just the loop time for each iteration.

1x - 264 msec

2x - 265 msec

3x - 267 msec

etc...

So my conclusions were (and feel free to throw rocks), there's really not a enough of a significant difference in the timed tests between HashSet and binarySearch for something as simple as determining String containment.

*The mind is a strange and wonderful thing. I'm not sure that it will ever be able to figure itself out, everything else, maybe. From the atom to the universe, everything, except itself.*

Shawn Smith wrote:

So my conclusions were (and feel free to throw rocks), there's really not a enough of a significant difference in the timed tests between HashSet and binarySearch for something as simple as determining String containment.

I don't think your tests are sufficient to show that. A HashSet lookup with a proper hash function

**is**O(1) and a binary search

**is**O(log N). I haven't looked at your testing code, but I'll bet at least one of the following is true: 1) The scale is too small, so that the differences aren't showing yet, and/or the search times are not large relative to the precision of the clock. 2) You're timing superfluous stuff, not just search times, and that superfluous time is not tiny relative to the search time. 3) GC is throwing your search times off. 4) You picked values such that binary searches happen to be hitting very early on, without having to go further down the tree.

If you get a large enough sample, have a proper hash function, eliminate GC from the equation, and run many searches, with hits at many different spots in the structures, and with a fair number of misses as well, you will see that average hashed search times grow more slowly than binary search times.

Stephan van Hulst wrote:Yeah, but that's still assuming the Strings are already sorted. If you're fair, you should add the time it takes to do the sorting to the time it takes binarySearch() to find an element.

No, that's just a constant, and its significance shrinks as we do more searches, which is what matters for this performance comparison. Comparing a single binary search to a single hash search will not give consistent results and is a meaningless comparison.

Another thing to consider when evaluating HashSet is the number of collisions in your strings that need to be searched. For a truly random set, the search time on a HashSet should be O(1). However, if all of your input data collides (very unlikely), the performance can degrade to O(n)

Essentially this is a dirty word checker. There are 622 dirty words and I ran the entire content of my /usr/share/dict/words file against them (479830) words.

There is a very small subset, and once the data is loaded I do simple iteration:

The test code looks like:

and the passesDirtyWord looks like:

With the commented areas being the difference between the HashSet and the ArrayList.

Please forgive the swear word, it is a dirty word checker after all...

All other code in the test is just loading collections of String's from files.

I'm not sure if this information helps the analysis, but...

There is a very small subset, and once the data is loaded I do simple iteration:

Sorry, I should have pre-read. There is a very

*small code*set.

Jayesh A Lalwani wrote:Did you use a HashMap or a HashSet? Just askin

Another thing to consider when evaluating HashSet is the number of collisions in your strings that need to be searched. For a truly random set, the search time on a HashSet should be O(1). However, if all of your input data collides (very unlikely), the performance can degrade to O(n)

HashSet, that HashMap was a typo.

Jayesh A Lalwani wrote:How long does split take?

I can measure it, but is should be constant across both tests.

Shawn Smith wrote:

Jayesh A Lalwani wrote:How long does split take?

I can measure it, but is should be constant across both tests.

Doesn't matter. If it's part of your timing and not at least an order of magnitude smaller that what you're testing, it is invalidating your tests.

If A takes time 0.001 and B takes time 0.005, then A is 5 times faster than B, but if S takes time 1.0 and you're comparing S+A vs. S+B, you'll get 1.001 vs. 1.005 and conclude (erroneously) that A and B are about the same.

Shawn Smith wrote:I can certainly add the sort to the timing, though that's only going to affect the first iteration.

Essentially this is a dirty word checker. There are 622 dirty words and I ran the entire content of my /usr/share/dict/words file against them (479830) words.

I still haven't studied your test, and to be honest, I probably won't. Because, unless there's something funky in the implementations, I

**know**how the two approaches behave. Now it certainly may be the case that the difference doesn't show up until we get to extremely large search sets, and that in itself is worth knowing. In general, though, hash search times

**will**grow more slowly than binary search times. (Although, what could reverse the trend is when your number of buckets gets saturated, and then the O(n) linear search within the bucket will eventually surpass binary search as the buckets grow larger.)

Jeff Verdegan wrote:

Shawn Smith wrote:Now it certainly may be the case that the difference doesn't show up until we get to extremely large search sets, and that in itself is worth knowing.

He will never have extremely large search sets. Outside of college campuses (:p), the list of dirty words are going to be much smaller than the list of all possible words.

Jeff Verdegan wrote:No, that's just a constant, and its significance shrinks as we do more searches, which is what matters for this performance comparison. Comparing a single binary search to a single hash search will not give consistent results and is a meaningless comparison.

Errr yes of course. D'oh. I was still confused with just doing one search XD

*The mind is a strange and wonderful thing. I'm not sure that it will ever be able to figure itself out, everything else, maybe. From the atom to the universe, everything, except itself.*