• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Ron McLeod
  • paul wheaton
  • Jeanne Boyarsky
Sheriffs:
  • Paul Clapham
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
  • Himai Minh
Bartenders:

Hash Tables

 
Ranch Hand
Posts: 117
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I need to create my own HashTable class. I also need to read in about 40,000 words from a text file and store them inside the hashtable. For efficiency, should my hashtable size be about double the amount of words I need to store in it? So that way it's only about half full. Also if it should be of double size, should i then search for the closest prime number and set the table to that size? So that way all buckets can be used?

ps. I use double hashing, and the first function hashes to: key % tableSize, the second function: key % tableSize-1
 
Ranch Hand
Posts: 694
Mac OS X Eclipse IDE Firefox Browser
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator


I need to create my own HashTable class.



I don't understand why you can't use one of the hashing classes in the Java API.

I don't understand double-hashing.

Kaydell
 
John Lockheart
Ranch Hand
Posts: 117
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I guess the point is to understand the inner workings, and not just to use something I don't understand...As for double hashing, you can wikipedia it. It's not complicated at all. As for my questions...Are you able to give an answer?

1)Should the table size always be a prime number?
2)Should the table's size be double the amount of information being stored
in it so it is only half full?
3)Double Hashing: 2 Hashing functions..1st for bucket number, second for step size...when calculating step size and using modulos arithmatic, does it matter how you find the step size? Can I just copy the first function but subtract 1 from the tablesize?
 
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Why not look at the source code to see how the hashtable work, instead of building something over :-).
 
John Lockheart
Ranch Hand
Posts: 117
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
haha, alright! I get what everyone's saying..but I don't have a choice...That's the way I did it, it's already done...now does anyone have answers for my questions?
 
Sheriff
Posts: 28407
101
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by John Lockheart:
1)Should the table size always be a prime number?

It depends. You want to arrange things so that your hashing algorithm will generate every number between 1 and the table size. So if you're generating an arbitrary number and doing modulo the table size, it doesn't matter what the table size is.

2)Should the table's size be double the amount of information being stored in it so it is only half full?

No, ideally the table should be completely full. But you probably won't achieve a perfect hash function.

3)Double Hashing: 2 Hashing functions..1st for bucket number, second for step size...when calculating step size and using modulos arithmatic, does it matter how you find the step size? Can I just copy the first function but subtract 1 from the tablesize?

Since the function "1" works perfectly well for step size, I would say it doesn't matter how you find it.
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
[John]: 2)Should the table's size be double the amount of information being stored in it so it is only half full?

[Paul]: No, ideally the table should be completely full. But you probably won't achieve a perfect hash function.


Hm, really? Seems like filling the hashtable to 100% would result in needing more and more probes as you go, since the slots would be increasingly likely to already be taken. The last few calls to put() would be essentially O(n) operations. Same for get() on the same keys. I guess this is a trade-off between speed and memory, and one reason to use double hashing in the first place is to minimize memory, so maybe it makes sense in that context. Still, seems a bit odd to me; I'd have expected half full will provide much better performance, and still give better memory usage than a regular hashtable.

Does it just depend on your priorities, or am I missing something?
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think the only numbers I've ever seen about how many collisions are likely for a hash table filled to x% are from Wirths "Algorithms and Data Structures", and that's about 30 years old now. I took away from it that even a hash table 90% full has on average just a few collisions, so that's the load factor I generally use.
 
Paul Clapham
Sheriff
Posts: 28407
101
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Jim Yingst:
[Paul]: No, ideally the table should be completely full. But you probably won't achieve a perfect hash function.[/b]

Hm, really? Seems like filling the hashtable to 100% would result in needing more and more probes as you go, since the slots would be increasingly likely to already be taken.

A "perfect" hash function is one where you have N items to hash, and N slots to hash them into, and each of the N items hashes directly into a different one of the N slots. You don't get one of those things without a lot of up-front tinkering with your hash function, and you have to know in advance what the N items are. That isn't a common scenario. That's what I meant. I have no idea how much empty space you should budget for in a real-life hashtable, I'm sure that the phrase "it depends" would apply though.
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Paul, thanks; I'd forgotten that definition of perfect.

Ulf, do you remember if that discussion was for double hashing, or single? Seems to me that at 90% load using double hashing, you'd typically have around 9 collisions for a put(), or for a failed get(). Assuming an imperfect but uniform hash function.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think it was for single hashing, and the number of collisions was just shy of 5, IIRC. Not sure if I still have the book, but I'll dig around.
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sedgewick gives some formulas for the expected number of probes using double hashing with an "independent" double hash function for a load factor a: 1/(1-a) for an unsuccessful search, and -ln(1-a)/a for a successful one. The first seems to agree with what I said earlier: for a = .9 we get an average of 9 collisions before the 10th probe returns null. I don't see where the formula for successful searches comes from (Sedgewick doesn't show the derivation), but assuming it's accurate, at a = .9 we get 2.56 probes on average. Not too bad, but things are considerably better at a = .5, and they're pretty much horrible as we approach a = 1. I'd prefer to be around a = .5 unless memory is really running short.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I just noticed that Wirth's book "Algorithms and Data Structures" is available for free here. Chapter 5 on hashing includes the analysis I was referring to; it's an interesting read, and just 6 pages long.
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic