• Post Reply Bookmark Topic Watch Topic
  • New Topic

Problem with hashcode etc  RSS feed

 
james dunster
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have read the Java tutorial and as much as I can google on the above subject but I am struggling to understand the basic concept. Unfortunately my brain will not absorb any more information until the block is removed. The problem is this:

"you will calculate the same hashCode, and will be able to look up numerically with it. Of course there is always the possibility two different Strings will have the same digest hashCode. However, even then, all is not lost; it greatly narrows down the search, hence speeding it up. "

As I understand it collisions are put into an array list or linked list (not come to that yet, maybe that's my problem). But I can't see that in example code.

Under what circumstances does ' near enough is good enough ' apply. Reminds me of time a couple if years ago I was held at immigration because I had the same surname as someone they were seeking whose other names were totally different from mine. The reason given was, it might be me!
Is there a Ladybird book in simple words and large print available on this subject anyone can recommend?
 
Fred Hamilton
Ranch Hand
Posts: 686
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hash tables are an integral part of chess programming. If you go to the following site and do some searching for hash, you will find a lot of information.

http://chessprogramming.wikispaces.com/

The following article in particular is quite good.

http://www.cis.uab.edu/hyatt/collisions.html

as long as you know the basics of chess, you should get some mileage out of these sites, and a different insight into hash collisions. These aren't java articles, but general principles still apply.
 
K. Tsang
Bartender
Posts: 3648
16
Firefox Browser Java Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Howdy James. Think of hashcode this way. Suppose you want to "hash" something say username and password. Now say the username is "james" and password is "abc". Another user username may be "tom" and his password is "def". To produce the hash may look like this:

Is it possible to produce the same hashcode using this way? If you answer yes... you are correct. At least I think so. Why? Ignoring database concepts of primary keys and such, what happens if another user also called james and have his username name "james" but his password something different? Eventually that new James may have an identical password as "abc" which is same as yours. And when that happens it will be impossible to identify that account as you or the new James.

IP addresses may be a better example. If you use the computer IP address produce the hashcode, you are bound to have trouble because a user can log in on different computers. Hope this makes some sense.
 
Fred Hamilton
Ranch Hand
Posts: 686
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
K. Tsang wrote:Howdy James. Think of hashcode this way. Suppose you want to "hash" something say username and password. Now say the username is "james" and password is "abc". Another user username may be "tom" and his password is "def". To produce the hash may look like this:

Is it possible to produce the same hashcode using this way? If you answer yes... you are correct. At least I think so. Why? Ignoring database concepts of primary keys and such, what happens if another user also called james and have his username name "james" but his password something different? Eventually that new James may have an identical password as "abc" which is same as yours. And when that happens it will be impossible to identify that account as you or the new James.

IP addresses may be a better example. If you use the computer IP address produce the hashcode, you are bound to have trouble because a user can log in on different computers. Hope this makes some sense.


which is why it can be so hard to get a user id that you like on sites like yahoo, etc. They don't allow duplicates, for the very reasons you describe. And all the good id's are already taken. Also a reason why so many sites use email address as user id.
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24217
38
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi James,

I'm not sure any of the above does much to clarify the situation. Let's look at a concrete example. Imagine you are an instructor who teaches a class of 30 students, and they have taken an exam. You've marked the exam, and now you'd like to return their graded papers to them. Each paper, of course, has a mark on it.

Unfortunately, you must leave town, so you can't hand the papers out in class; you'd like to leave the papers in your office for the students to quickly pick up. You want to organize the papers so that the students can find their own paper quickly. You decide to use a sort of hashing algorithm to do this. The name on each paper is the "key" that we're going to hash, while the paper itself with the mark on it is the data that will come along for the ride.

You choose an extraordinarily simple hashing algorithm: the hashcode for a student's name is the first letter of their last name. There are therefore 26 possible hash values.

You make a row of piles: the first pile contains the papers for the students whose last names hash to 1 -- i.e., the names that start with "A". The second pile is for hash 2, and so on, up to the pile for hash 26.

Some of the piles are empty, and some of them have more than one paper in it. But all the piles are small, and the average pile size is 30/26 == 1.15 papers.

To find their paper, a student instantly identifies the appropriate pile by computing the hash value for their name. Then they do a very short linear search through that pile for their own paper by comparing the name on each paper to their own name. (Note that this is why the Javadocs make such a big deal about overloading both hashCode() and equals() together, so they're consistent. If you use hashCode() you almost certainly will need to use equals() in this way!)

The row of piles is a hash table -- that's all there is to it! This is exactly how java.util.HashMap (and its older, uglier cousin java.util.Hashtable) works.

Note that if the hashCode() function is more complex, so that there are an inconvenient number of possible values (in Java, hashCode() can return any integer, so this is always true at least in theory) then it's common to use the absolute value of the hash code modulo some arbitrary number of piles. In the above example, we could use 10 piles and use "hashCode() % 10" to sort the papers into ten piles.
 
Fred Hamilton
Ranch Hand
Posts: 686
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ernest Friedman-Hill wrote:Hi James,

...

To find their paper, a student instantly identifies the appropriate pile by computing the hash value for their name. Then they do a very short linear search through that pile for their own paper by comparing the name on each paper to their own name. (Note that this is why the Javadocs make such a big deal about overloading both hashCode() and equals() together, so they're consistent. If you use hashCode() you almost certainly will need to use equals() in this way!)



Ernest, I think your paragraph cuts to the essense of the original question quicker than my examples, which are rather complex.

In chess it is very simple. There is a hash code that represents a position, and a table of some kind of table that represents all possible positions in analysis to a certain move depth. It is very likely that the same position is possible through different move sequences, and the table is search to see if a position has occurred previously, so we don't have to evaluate the position more than once. The collision problem occurs when the same hash code can happen for two different positions, and considerable time and effort is devoted to dealing with this possibility. Another factor is whether or not it is practical to represent all possible factors which make a position unique. The speed of the algorithm is everything here.

In any event the issue hash collisions must be resolved. Very often it is a tradeoff. Which is more efficient? devoting extra effort to ensuring that hash codes are created uniquely? Or, as in your case I believe, taking extra steps after the fact to deal with the uniqueness thing.
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
+1 for a nice explanation.
 
james dunster
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Fred, do I have to write some code to do the final sort if there is more than one item returned for a hash code? I can't see it in the examples I have seen. Thanks.
 
Fred Hamilton
Ranch Hand
Posts: 686
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
james dunster wrote:Fred, do I have to write some code to do the final sort if there is more than one item returned for a hash code? I can't see it in the examples I have seen. Thanks.


You know, I'm pretty good with the theory, but not so good with the implementation. I know the reasons for hash tables, and what sorts of things cause problems, but haven't yet mastered the actual details of how to do it. I was just trying to give a high level overview.

That being said, if there is a specific example from the links I provided, then please specify which example, and I will try to be of assistance.

Otherwise, someone else here could probably help you better.

regards.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
james dunster wrote:Fred, do I have to write some code to do the final sort if there is more than one item returned for a hash code? I can't see it in the examples I have seen. Thanks.


After the collection uses the hashcode to locate the "bucket", it uses the equals() method to find the right object in that bucket. Hence, it will never return two objects.

Henry
 
james dunster
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
After the collection uses the hashcode to locate the "bucket", it uses the equals() method to find the right object in that bucket. Hence, it will never return two objects.


so I have to override the equals function to that it compares everything used to generate the hashcode?
 
Vijitha Kumara
Bartender
Posts: 4002
42
Chrome Fedora Hibernate
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
james dunster wrote:
After the collection uses the hashcode to locate the "bucket", it uses the equals() method to find the right object in that bucket. Hence, it will never return two objects.


so I have to override the equals function to that it compares everything used to generate the hashcode?


In general you should use same set of attributes for both equals & hashcode. Check this for more details.
 
Campbell Ritchie
Marshal
Posts: 56584
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, if you want collections framework classes to work correctly, you must override both equals() and hashCode().

Find the chapter in Bloch's book about equals(), or Google for "Angelika Langer Java equals" for details, because the equals method appears easy to implement, but it isn't at all easy.
 
Consider Paul's rocket mass heater.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!