This week's book giveaway is in the Artificial Intelligence and Machine Learning forum.
We're giving away four copies of Machine Learning with R: Expert techniques for predictive modeling and have Brett Lantz on-line!
See this thread for details.
Win a copy of Machine Learning with R: Expert techniques for predictive modeling this week in the Artificial Intelligence and Machine Learning forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Ganesh Patekar

Slow performance in hash-based collections

 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In my case I want to create a map with 26 million entries. Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.

Also, does anyone know if using different hash code distributions for the keys could help?
 
Ranch Hand
Posts: 607
Firefox Browser Spring Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Afra Tamam wrote:In my case I want to create a map with 26 million entries. Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.

Also, does anyone know if using different hash code distributions for the keys could help?



Inserts (put) should be at a constant rate & should not depend on how many entries you already have (or so I think). Are you running into some memory conditions? (or are you using the default hash code function that comes with Java? for 26 million entries you would do well to write a domain specific hash key)

P.S. If you override equals, you need to override hashcode
 
Marshal
Posts: 65823
250
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Afra Tamam wrote: . . . Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.
. . .

That means you have got a poor quality hashCode method which returns the same hash code for many instances of the “K”. You can read about hash code methods in Joshua Bloch's Effective Java™ chapter 3, or google for Angelika Langer Java hashCode equals. Please show us your hashCode method.
 
Campbell Ritchie
Marshal
Posts: 65823
250
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Saifuddin Merchant wrote: . . . or are you using the default hash code function that comes with Java? . . .

You mean the hashCode method inherited un‑overridden from java.lang.Object? Since that provides a different hash code for most objects, at least until you reach 2³² of them, that is very unlikely to be the problem.
 
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:

Afra Tamam wrote: . . . Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.
. . .

That means you have got a poor quality hashCode method which returns the same hash code for many instances of the “K”.


I'm not sure that that is necessarily the reason. We know nothing about the objects being inserted - if you insert 2 million object each of a 1K size there could be slowdowns in the memory system. Also, it doesn't sound to me like the time it takes to insert is exactly what was measured - just that the overall rate of code execution slowed down a lot.
 
Saifuddin Merchant
Ranch Hand
Posts: 607
Firefox Browser Spring Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:

Saifuddin Merchant wrote: . . . or are you using the default hash code function that comes with Java? . . .

You mean the hashCode method inherited un‑overridden from java.lang.Object? Since that provides a different hash code for most objects, at least until you reach 2³² of them, that is very unlikely to be the problem.



Isn't that particularly bad for large inputs? Since each object gets mapped to its own bucket, we would end up creating a huge number of buckets?

Isn't a hash function bad,
if it returns the same value 'K' over all Input X
if it returns different value 'Ki' over all Input Xi
 
Campbell Ritchie
Marshal
Posts: 65823
250
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Ulf Dittmer wrote: . . . if you insert 2 million object each of a 1K size there could be slowdowns in the memory system. . . .

But you don't insert objects into a Map. You insert their reference. You will have ceil(log₂((26000000 ÷ 0.75) ÷ 16)) = 22 occurrences of rehashing and resizing the Map, unless it was given an initial capacity of 26000000. With a 75% load factor, that will mean the Map expands to capacity 67108864 = 2²⁶, and there would be no rehashing. But the rehashes become less frequent as the Map enlarges, so that should not affect insertion speed.

If the hashing method is correctly overridden and actually provides a good distribution of hashes, then you are right that the delay is caused by something independent from the Map. But we cannot tell unless we see the code.
 
Campbell Ritchie
Marshal
Posts: 65823
250
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Saifuddin Merchant wrote: . . . Isn't that particularly bad for large inputs? Since each object gets mapped to its own bucket, we would end up creating a huge number of buckets?

Not at all. Your buckets are part of an array and arrays support random access; it is as fast to access element 9348793 as element 0, assuming the array is that large.

Isn't a hash function bad,
if it returns the same value 'K' over all Input X
if it returns different value 'Ki' over all Input Xi

No. It is very bad always to return the same result. If you find the Joshua Bloch reference I gave you earlier. it says that reduces the performance of a Map from constant time to linear time, i.e. it behaves as a linked list. He says that can make the difference between an app that works and one that doesn't. A correct hash function will (as far as possible) return a different Ki over all Xi.

I am ssuming the default load factor of 0.75f in all my calculations.
Are there performance problems after 2000000 insertions? At that stage the Map would have pow(2, ceil(log₂(2000000 ÷ 0.75))) capacity, which it 4Mi. On a 32‑bit machine that means it occupies about 16MiB in memory. So one would need 16MiB of contiguous heap space. So one would probably undergo a sequence of garbage collection, with more garbage collection cycles as the Map enlarges. There is also the possibility, as Ulf suggested, that those objects are occupying so much heap space that the JVM is running short and needs multiple garbage collections. The figures I worked out earlier would suggest one requires about 256MiB space at the maximum capacity.
We can only help if we know more details about heap space, size of objects n the Map, the hash method, etc.
 
Campbell Ritchie
Marshal
Posts: 65823
250
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Earlier, I wrote: . . . If you find the Joshua Bloch reference I gave you earlier. it says that reduces the performance of a Map from constant time to linear time, . . .

I think that wasn't quite accurate. I think Bloch simply says it reduces performance to that of a linked list.

That means linear time instead of constant time. That probably only applies to get calls, however; put calls may still execute in constant time.
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:

Ulf Dittmer wrote: . . . if you insert 2 million object each of a 1K size there could be slowdowns in the memory system. . . .

But you don't insert objects into a Map. You insert their reference.


I guess I wasn't as clear as I should have been. I'm not talking about the performance of the hashing at all, I'm talking about having 2 GBs worth of objects. So when I said "if you insert 2 million objects each of a 1K size " I really meant "if you create 2 million objects each of a 1K size (references to which then get inserted into a HashMap)".
 
Campbell Ritchie
Marshal
Posts: 65823
250
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, I see what you mean. I think we do actually agree, but didn't explain it very well. 2000000 × 1k will overwhelm the heap space on most computers, short of something like this
26000000 × 1k will probably never fit into any computer's available RAM, and would require paging.
 
Bartender
Posts: 10777
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:...unless it was given an initial capacity of 26000000. With a 75% load factor, that will mean the Map expands to capacity 67108864 = 2²⁶, and there would be no rehashing.


Actually, unless they've corrected the code since version 6, there will be one rehash (a lengthy one that requires a fair bit of extra space) if you set the initial capacity to 26,000,000.

OTOH, if you set it to 35,000,000 (≈ 26M ÷ 0.75), there won't.

However, until Afra comes back with some answers to our questions, I suspect the point is moot.

Winston
 
Campbell Ritchie
Marshal
Posts: 65823
250
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, I think you are correct there. Sorry for the mistake.
 
Rancher
Posts: 2759
32
Eclipse IDE Spring Tomcat Server
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I agree with Ulf here. Before we go blaming the hashing implementation, I would like to look at one basic thing:- Are you running out of memory? Remember that when any Java app is on the brink of OOM it starts getting slower, mainly because GC is constantly running looking for objects to collect. Actually, in most cases, Java apps become slow long before they run out of memory. I've seen application that have run for days with 10% of throughput before they reported OOM.
 
Something about .... going for a swim. With this tiny ad ...
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!