• Post Reply Bookmark Topic Watch Topic
  • New Topic

HashSet ordering  RSS feed

 
Tyson Lindner
Ranch Hand
Posts: 211
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
List list = Arrays.asList(10,5,10,20);
System.out.println(new HashSet(list));

The output is [20, 5, 10]

From what I can tell the hashcode() of an Integer is just its value, so I'm wondering how the 5 gets between the 20 and the 10.
 
Steve Luke
Bartender
Posts: 4181
22
IntelliJ IDE Java Python
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The order is probably not directly from the hashcode, but the hashcode modulus the set's capacity. So with a default capacity of 16, 20 % 16 = 4, 5 % 16 = 5 and 10 % 16 is 10. Or if the capacity is 4 (since the length of the list is four): 20 % 4 = 0, 5 % 4 = 1 and 10 % 4 is 2. This is one of the reasons why you should never consider the Set to have a consistent order - not only is it not defined, it could change if the values/capacity change.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:List list = Arrays.asList(10,5,10,20);
System.out.println(new HashSet(list));

The output is [20, 5, 10]

From what I can tell the hashcode() of an Integer is just its value, so I'm wondering how the 5 gets between the 20 and the 10.


First, there isn't a one to one correspondence between hash codes and buckets -- if there were, then a hash set would have 4 billion buckets, and eat up all of memory. To deal with this, there is a calculation to convert hash code to their bucket, based on the number of buckets. Second, the collection has re-hashing algorithms to rehash inefficient hash codes -- such as the simple (silly) hash code that is returned from the Integer class. Imagine if the application only dealt with small numbers, then without rehashing, it may only use the first few buckets.

The combination of these two tend to scatter the elements across all the buckets.

Henry
 
Campbell Ritchie
Marshal
Posts: 56525
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Integer class' hashCode method is simple, but why is it silly, Henry? In that case, what is wrong with returning the value? Particularly in view of Integer's caching behaviour, that is probably as good a hashCode method as you are going to get. It is consistent with equals (as is compareTo in this case) and returns a different value for different objects. Even small numbers will go into different buckets; Integer.valueOf(0) and Integer.valueOf(1) will produce exactly two instances however many times you call them, and will always go into different buckets.
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think Henry's small-number example is flawed, for the reasons Campbell gives. A better example is, what if you have a bunch of different numbers that are all multiples of some power of two? E.g if they all end in 000, they're multiples of 1000, which factors to 2*2*2*5*5*5. The 5's don't matter so much, but the three 2's mean that, without rehashing, seven out of every eight buckets in your hash table remain unused. The purpose of rehashing is that even if the input keys contain many powers of two, as long as they are unique somewhere they will probably go into different hash buckets.
 
Campbell Ritchie
Marshal
Posts: 56525
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That is a better example, Mike. Thank you.
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16057
88
Android IntelliJ IDE Java Scala Spring
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Note that HashSet should be regarded as an unordered collection. It's just like a bag that contains things. The things are in the bag, but not in any particular order. If you stick your hand in and grab things from the bag, you don't know in which order you'll grab them.

Do not write programs that rely on the HashSet giving you things back in any particular order.
 
Tyson Lindner
Ranch Hand
Posts: 211
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Steve Luke wrote:The order is probably not directly from the hashcode, but the hashcode modulus the set's capacity. So with a default capacity of 16, 20 % 16 = 4, 5 % 16 = 5 and 10 % 16 is 10. Or if the capacity is 4 (since the length of the list is four): 20 % 4 = 0, 5 % 4 = 1 and 10 % 4 is 2. This is one of the reasons why you should never consider the Set to have a consistent order - not only is it not defined, it could change if the values/capacity change.


So is "%" the default means to convert an Integer hashcode to its bucket location? The two lines of code I posted were the only two inside the main method of the program. Also what would happen if the values 32 and 64 were added?
 
Tyson Lindner
Ranch Hand
Posts: 211
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jesper de Jong wrote:Note that HashSet should be regarded as an unordered collection. It's just like a bag that contains things. The things are in the bag, but not in any particular order. If you stick your hand in and grab things from the bag, you don't know in which order you'll grab them.

Do not write programs that rely on the HashSet giving you things back in any particular order.


But it seems like HashSet does have a particular order its just very complicated to figure out what that order is. This is a question I got wrong in a mock exam because I specifically did not know what that order was. I thought it was [5,10,20] in the OP. So I'm trying to figure out how the heck I was supposed to know it would be [20, 5, 10].
 
Campbell Ritchie
Marshal
Posts: 56525
172
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote: . . . So is "%" the default means to convert an Integer hashcode to its bucket location? . . .
No. In fact HashMap does not use %. It uses bitwise AND.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:But it seems like HashSet does have a particular order its just very complicated to figure out what that order is...

And doubly so, because it actually further mangles the hashcode based on the Set's current capacity. A LinkedHashSet, on the other hand, DOES have a defined order (by default, insertion order), so your example above will almost certainly return:

[10, 5, 20]

HIH

Winston
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Likes 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:
But it seems like HashSet does have a particular order its just very complicated to figure out what that order is. This is a question I got wrong in a mock exam because I specifically did not know what that order was. I thought it was [5,10,20] in the OP. So I'm trying to figure out how the heck I was supposed to know it would be [20, 5, 10].


The docs doesn't say what is the order -- except that you can't rely on the order. Basically, the order is implementation specific, and not documented. So, even if you can figure it all out, you can't write code that rely on it since it can change with future implementations.

Henry
 
Tyson Lindner
Ranch Hand
Posts: 211
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
Tyson Lindner wrote:
But it seems like HashSet does have a particular order its just very complicated to figure out what that order is. This is a question I got wrong in a mock exam because I specifically did not know what that order was. I thought it was [5,10,20] in the OP. So I'm trying to figure out how the heck I was supposed to know it would be [20, 5, 10].


The docs doesn't say what is the order -- except that you can't rely on the order. Basically, the order is implementation specific, and not documented. So, even if you can figure it all out, you can't write code that rely on it since it can change with future implementations.



That doesn't tell me how to answer the exam question, but from what I'm gathering it seems like it was a very unreasonable question to begin with.

I'm trying to use the bitwise AND, using variations of the h & c - 1 formula to figure out bucket locations but its not working for values like 32 and 64 (I just get 0 for both). Does the formula change if the values change but not the capacity?

Also by implementation does that mean java version, or just how the programmer sets up the HashSet?
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:That doesn't tell me how to answer the exam question, but from what I'm gathering it seems like it was a very unreasonable question to begin with.

Yes, the author of that particular mock exam question was an idiot. Don't trust all mock exams you find on the internet. The real exam will not have a question like that.
 
Steve Luke
Bartender
Posts: 4181
22
IntelliJ IDE Java Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:That doesn't tell me how to answer the exam question, but from what I'm gathering it seems like it was a very unreasonable question to begin with.

probably true, unless one of the answers was 'undefined' or something of the sort.

I'm trying to use the bitwise AND, using variations of the h & c - 1 formula to figure out bucket locations but its not working for values like 32 and 64 (I just get 0 for both). Does the formula change if the values change but not the capacity?

What is the problem of two values giving you the same result? It just means both would go into the same bucket.
 
Campbell Ritchie
Marshal
Posts: 56525
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote: . . . not working for values like 32 and 64 (I just get 0 for both). . . .
What's wrong with that? That is working correctly. As with array indices, 0 is a valid bucket number.
Also by implementation does that mean java version, or just how the programmer sets up the HashSet?
Don't know.
 
Tyson Lindner
Ranch Hand
Posts: 211
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Steve Luke wrote:
What is the problem of two values giving you the same result? It just means both would go into the same bucket.


List list = Arrays.asList(10,5,10,20,64,32);
System.out.println(new HashSet(list));

List list2 = Arrays.asList(10,5,10,20,32,64);
System.out.println(new HashSet(list));

my output is
[32, 64, 20, 5, 10]
[32, 64, 20, 5, 10]

So the values are either put into the same bucket or taken out of the bucket according to some order, so the original formula isn't telling the whole story. I think I'm about done trying to figure this out though, it all does seem a bit unnecessary.
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:
Henry Wong wrote:Basically, the order is implementation specific, and not documented. So, even if you can figure it all out, you can't write code that rely on it since it can change with future implementations.


Also by implementation does that mean java version, or just how the programmer sets up the HashSet?

In this case, "implementation" refers to what version of HashSet.class and HashMap.class files are found on your CLASSPATH. This is generally determined by the java version, yes. You should be able to see the source code for these files, in the stc.zip file found in your JDK installation. Or any good IDE will find it for you automatically.

Oh, the order is also determined by the hashCode() implementation for whatever objects you're using as keys. For a String or Integer, this would also be in the classes installed in your JDK installation, determined by your jdk version. For a class you wrote yourself, it will depend on the hashCode() method you yourself have written.
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:So the values are either put into the same bucket or taken out of the bucket according to some order, so the original formula isn't telling the whole story.

If you still care, you need to look at the code in HashMap.java, specifically the hash(Object) method, which does additional scrambling of the bits:

This is from the JDK 1.7 source code; you may find something different in other versions.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:So the values are either put into the same bucket or taken out of the bucket according to some order, so the original formula isn't telling the whole story.

What formula? As I tried to tell you above there are at least two involved because, in addition to a simple 'mod', HashSet ALSO does some bit mangling on your hash - and I'm pretty sure it involves XOR.

However, that doesn't mean that the order won't be consistent - indeed, it MUST be consistent in order for HashSet to work - it simply means that it's extremely difficult to predict. And, as Henry said, it could change with any future release, so you'd better not rely on it.

Winston
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:HashSet ALSO does some bit mangling on your hash - and I'm pretty sure it involves XOR.
Pretty sure of that, yes. If only we could find the exact formula somewhere...
 
Campbell Ritchie
Marshal
Posts: 56525
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I thought a hash set encapsulates a hash map with the same dummy object for the “range” part throughout. In which case it would use exactly the same formula for mangling the hashes.
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That would be my point, yes.
 
Tyson Lindner
Ranch Hand
Posts: 211
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mike, how is the hashSeed value calculated?

I tried using the formula within the hash() with 0 and 1 for hashSeed but its not giving values that reflect the order when the HashSet is printed out.

 
Steve Luke
Bartender
Posts: 4181
22
IntelliJ IDE Java Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:Mike, how is the hashSeed value calculated?

Depends. Even earlier in JDK 7 there was no hashSeed. In the latest releases hashSeed is either 0 or what appears to be a random value, depending on capacity.

What is most important is that the order of the printout is not just dependent on the hashcode, but also on the order of the objects in the buckets defined by the hashes. The discussion so far has been about how Objects find their way into a particular hash bucket, but multiple objects could end up in the same bucket. I think they are put into the bucket in a linked list type structure in the order in which they reach the bucket.
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:Mike, how is the hashSeed value calculated?

Steve's summary is good - the complete answer is unduly complicated, but it's essentially 0 in this particular use case, at least on my machine.

Tyson Lindner wrote:I tried using the formula within the hash() with 0 and 1 for hashSeed but its not giving values that reflect the order when the HashSet is printed out.

Well, I get the same order as you, but for me, the results do make sense, using hashSeed = 0. Here's a test I found useful:

The values h, h2, and h3 represent the different stages of the hash calculation. The final h3 is the one that determines iteration order. Note that both 5 and 20 have an h3 = 20, so they're in the same bucket. In this case, every time a new entry is put in the bucket, that new entry becomes the first entry in the bucket. So 20 appears before 5 because it was put in the HashMap later.

Note that I'm using JDK 1.7. I don't think you've said what version you're using, but you're getting the same order I am, so I'm going to assume my JDK 1.7 code is close enough unless we hear otherwise.
 
Tyson Lindner
Ranch Hand
Posts: 211
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Mike!

I was using only h3 at first but then only h2. I was also using "h2 & (16-1)" for h3 but after looking some more that part doesn't seem to matter for the original numbers.
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tyson Lindner wrote:I was also using "h2 & (16-1)" for h3 but after looking some more that part doesn't seem to matter for the original numbers.

Yes, that formula is a little faster than h2 % 16, but the results are the same for positive h2; I just wanted to see the result. If you throw in negative numbers, the h2 (16-1) formula is needed for correctness. Good catch.
 
Campbell Ritchie
Marshal
Posts: 56525
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Steve Luke wrote: . . . I think they are put into the bucket in a linked list type structure in the order in which they reach the bucket.
Yes. As far as I can remember, the Map.Entry implementation in HashMap has an instance field Map.Entry next (or similar), so it creates a singly‑linked list.
 
Campbell Ritchie
Marshal
Posts: 56525
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If the hash‑mangling algorithm uses a random seed, wouldn't that mean the iteration order would be different if you run it on a new instance or new JVM?
 
Steve Luke
Bartender
Posts: 4181
22
IntelliJ IDE Java Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:If the hash‑mangling algorithm uses a random seed, wouldn't that mean the iteration order would be different if you run it on a new instance or new JVM?

Yeah, but it isn't likely you will get to the random seed. It will normally be 0 unless the capacity is greater than some threshold. The threshold can be set by a system property but defaults to Integer.MAX_VALUE.
 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!