• Post Reply Bookmark Topic Watch Topic
  • New Topic

Load Factor Queries  RSS feed

 
Vaibhav Gargs
Ranch Hand
Posts: 116
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have some doubts regarding load factors in collections:

1. In case of ArrayList, the load factor is 1? Why it is so?

2. In case of HashMap etc., the load factor is 0.75? What is the reason? Why it is not 1?

3. ArrayList size is increased by 1.5 while that of HashMap it is increased by 2 times? What is the rationale behind this?
 
Stephan van Hulst
Saloon Keeper
Posts: 7991
143
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Vaibhav Gargs wrote:1. In case of ArrayList, the load factor is 1? Why it is so?

Load factors are not applicable to lists. Lists simply get filled until they are full and then they are resized. Otherwise it would just be a waste of space.

2. In case of HashMap etc., the load factor is 0.75? What is the reason? Why it is not 1?

Hash tables become very slow when they are almost full. Hash tables more or less "guess" a location to insert an item, and they have to keep looking if that location is already occupied*. The fuller the table is, the more time it takes to find a spot for the item. 0.75 is a trade-off between speed and efficient use of memory.

3. ArrayList size is increased by 1.5 while that of HashMap it is increased by 2 times? What is the rationale behind this?

Where did you get this? Even if this was true, it would be an implementation detail and not part of the official specification.

*Reality is more complex. Hash tables are very sophisticated and use a lot of mathematical tricks to do what they do. I can't explain it in this post.
 
Vaibhav Gargs
Ranch Hand
Posts: 116
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:
Hash tables become very slow when they are almost full. Hash tables more or less "guess" a location to insert an item, and they have to keep looking if that location is already occupied*. The fuller the table is, the more time it takes to find a spot for the item. 0.75 is a trade-off between speed and efficient use of memory.



Thank you Stephan!

Can you please explain more about this? Won't it be good if loadfactor would have been 1/0.9? When the hashmap gets full i.e. loadfactor of 1, then only it should do the rehashing because rehashing is also a time consuming process. What do we actually achieve having loadfactor as 0.75?
 
Campbell Ritchie
Marshal
Posts: 56570
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Have you read the documentation for HashMap? I would have thought that explains the load factor. If that isn't clear to you, please ask again.
 
Vaibhav Gargs
Ranch Hand
Posts: 116
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:Have you read the documentation for HashMap? I would have thought that explains the load factor. If that isn't clear to you, please ask again.


Thank you Campbell!

As per the java doc, "Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). "

I am just wondering that if we have a default hashmap of size 16, then having load factor of 0.75 means that on 13th entry (bucket), it will do rehashing & we still had 4 empty buckets. So, if we performed rehashing after filling all 16 buckets, then will it impact performance too much i.e. those 4 buckets will have such a huge impact on performance?
 
Stephan van Hulst
Saloon Keeper
Posts: 7991
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes. If the hash codes are spread uniformly and the map has a size of 12 and a capacity of 16, the chance of 4 consecutive collisions is (12/16)^4, or about 32%. If we first double the capacity of the map, the chance of 4 consecutive collisions becomes (12/32)^4, or about 2%.

As you can see, before doubling there was a one-in-three chance of the map completely falling back to linear performance. Doubling the capacity of the map improved this situation drastically, allowing the map to keep performing in constant time.

Now, you can say that the linear performance doesn't matter to you and that it's fast enough and that you don't want to waste space, but then why use a Hash​Map at all? In that case you need a Tr​eeMap or a List.
 
Campbell Ritchie
Marshal
Posts: 56570
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think you are going to have to do some profiling. Create 100,000 HashMap objects (in a List) with default initial capacity and load factor of 75%. Remember that the load factor is not 0.75 but 0.75f, so you will have precision issues in large Maps.
Start with the nanoseconds method. Create your 1,000,000 Map instances and see how long that takes.Now populate the Maps with not more than ten pairs each, clear them all, and populate them again and time them the second time round.
The reason for using nanoTime thrice (not twice) is to allow for the duration of the nanoTime call. The reason for populating the Maps twice is to allow any just in time optimisations to kick in. You will have to run the program several times because the timings will be different each time. Try populating the Maps with 10, 11, 12 and 13 pairs, and see whether there is a large difference.
I do not know whether the Maps will be resized and rehashed before or after the 11th or 12th “putting”, but they will double their capacities; if you have overfilled them earlier they will have capacities of 32 and you won't see any effects of resizing.
Tell us what the figures look like.

I suggest you use the same pairs for every Map, so you won't run out of heap space.
 
Campbell Ritchie
Marshal
Posts: 56570
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A few minutes ago, I wrote:. . . The reason for using nanoTime thrice (not twice) is to allow for the duration of the nanoTime call. . . .
I tried it and three such nanosecond calls added up to −75ns
 
Stephan van Hulst
Saloon Keeper
Posts: 7991
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You can eliminate the rehashing from the equation by creating two sets of maps: One of new HashMap<>(32) and one of new HashMap<>(16, 1). Fill each map with 12 elements, and then benchmark how much time it takes to add 4 elements to both sets of maps. No resizing or rehashing will be done at all.
 
Piet Souris
Master Rancher
Posts: 2044
75
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Yes. If the hash codes are spread uniformly and the map has a size of 12 and a capacity of 16, the chance of 4 consecutive collisions is (12/16)^4, or about 32%. If we first double the capacity of the map, the chance of 4 consecutive collisions becomes (12/32)^4, or about 2%.

As you can see, before doubling there was a one-in-three chance of the map completely falling back to linear performance. Doubling the capacity of the map improved this situation drastically, allowing the map to keep performing in constant time. (...)

Is that correct? I'm certainly no hashmap expert, but for the map to show linear behaviour, the four items should end up in the same bucket. Now, if I'm right the chance is more like 12/16 * (1/16)^3, that is much less than 1/3
 
Vaibhav Gargs
Ranch Hand
Posts: 116
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Yes. If the hash codes are spread uniformly and the map has a size of 12 and a capacity of 16, the chance of 4 consecutive collisions is (12/16)^4, or about 32%. If we first double the capacity of the map, the chance of 4 consecutive collisions becomes (12/32)^4, or about 2%.


Stephan, can you please elaborate on this calculation? I didn't get how 12/16^4 is coming to picture
 
Campbell Ritchie
Marshal
Posts: 56570
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Vaibhav Gargs wrote:. . . 12/16^4 . . .
(12/16)^4, please, or better (12 ÷ 16)⁴     Use &#x2074;
When you put the first pair, there is no chance of a collision.
The second pair has a 1 in 16 chance of a collision.
The third pair has a 1 in 16 chance of being a second collision (three pairs in the same bucket) or a 1 in 8 chance of being a first collision.
The fourth pair has a 1 in 16² change of being a third collision a 1 in 16 chance of being a second collision and a 1 in 16 chance of being a first collision.
etc. There must be a simpler way to calculate that statistically.
 
Stephan van Hulst
Saloon Keeper
Posts: 7991
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ahh that's a fair point Piet. My calculation is indeed off.

I just thought of a different consideration: If you put 16 elements in a map with a capacity of 16, and a load factor of 1, the chance of exactly filling up each bucket of the map with one element is (16! / 16^16), or about 0.0001%. So even with a load factor of 1, you will almost certainly have wasted space, because at least one bucket will contain multiple elements, and at least one bucket will be empty. I suppose there's a certain sweet spot where expected value for the number of wasted buckets due to collisions will be equal to the number of unused buckets due to the load factor.
 
Campbell Ritchie
Marshal
Posts: 56570
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Maybe we shouldn't think of a bucket being wasted because it contains a collision; if collisions are infrequent and buckets at most contain 2 or three pairs, nobody is going to notice that performance is slower.
 
Stephan van Hulst
Saloon Keeper
Posts: 7991
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not saying buckets that contain more than one element are wasted, I'm saying buckets that contain no elements before the resize happens are wasted.
 
Campbell Ritchie
Marshal
Posts: 56570
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaah!
That is an unavoidable result of using a load factor < 100%.
 
Stephan van Hulst
Saloon Keeper
Posts: 7991
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My point is that it's almost unavoidable even with a load factor of 100%.
 
Vaibhav Gargs
Ranch Hand
Posts: 116
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you Stephan & Campbell!

Stephan van Hulst wrote:My point is that it's almost unavoidable even with a load factor of 100%.


@Stephan, is that really feasible?
 
Paul Clapham
Sheriff
Posts: 22832
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:My point is that it's almost unavoidable even with a load factor of 100%.


That's called a "perfect hash function" if I'm not mistaken. Wikipedia has a page about that -- apparently there must be applications where having a perfect hash function is important. I guess those applications must be using the hash function to such an extent that it's important for it to execute in constant time.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!