• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Hey, I think I created the "almost" perfect hash function

 
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I was fiddling with methods of validating the contents of Strings using checksums. And ended up combining some stuff that already exist in the SDK. For those interested here is the code:

I would also like to hear from you if you tested and tried to find any issue with it.

}

I've run many tests with random strings of random length and could not find a collision yet. This is one of the outputs I saved:

After 50,576,157 loops, Checksum got 0 duplicates( 0.0%)
After 50,576,157 loops, Hash got 252,668 duplicates( 0.026834579244767346%)


The output using random strings of fixed length is found a few collisions:
After 50,000,000 loops, Checksum got 2 duplicates( 4.0E-6%)
After 50,000,000 loops, Hash got 289750 duplicates( 0.5795%)

My tests generated so many strings that I had to run them with -Xms12G -Xmx12G.
 
Rancher
Posts: 4803
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What are your definitions of the criteria for "perfect"

If you are just trying to implement good values for hashCode(), using a CRC is serious overkill. An important criteria for most real world applications is that the calculation of the hash has to be fast. If you take too long, you could just sequentially search the list rather than hashing into it. Just because its Big Oh(1) doesn't mean its actually useful.
 
Jeronimo Backes
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Pat Farrell wrote:What are your definitions of the criteria for "perfect"

If you are just trying to implement good values for hashCode(), using a CRC is serious overkill. An important criteria for most real world applications is that the calculation of the hash has to be fast. If you take too long, you could just sequentially search the list rather than hashing into it. Just because its Big Oh(1) doesn't mean its actually useful.



Perfect hash functions are the ones that won't map two or more inputs into the same value. It means there is no possibility of collisions. These functions need to know the possible inputs in advance (e.g. EnumMap and EnumSet). This is not viable when using strings.

The usage of CRC in the code I've posted is limited to very short strings.

Creating a hash for 30,000,000 strings with random length (between 1 and 30) and random characters took 5 seconds using my method whereas String.hashCode took 2.5 seconds.

String.hashCode is twice as fast, but produces much more collisions in a massive range of random inputs.
 
Pat Farrell
Rancher
Posts: 4803
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jeronimo Backes wrote:Perfect hash functions are the ones that won't map two or more inputs into the same value. It means there is no possibility of collisions.



What design space cares about this? A normal hashCode() implementation may have identical values, and all the usual hashSet/Map/Tree/.... handle it cleanly. If the number of collisions "small" relative to the number of entries in the hashset/map/tree, life is good.

What is the point of your attempt at a perfect hash function?
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jeronimo Backes wrote:I've run many tests with random strings of random length and could not find a collision yet.



Then that tells me that your test strings are not random, and neither are their lengths. The number of Strings is so much larger than the number of ints that it isn't possible to make a hash function for Strings which is anywhere near perfect.

The maximum length of a String is 2^31-1. So if you had chosen lengths randomly, then half of the lengths you chose should have been greater than half of that number -- which is 2^30-1. You didn't do that, did you? And likewise I'm willing to bet that the chars you chose randomly to put into your Strings weren't random chars at all -- remember that there are 2^16 possible values for a char.
 
Jeronimo Backes
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

What design space cares about this?


Dealing with huge sets/maps/hash tables where the number of collisions influence the performance of fetch operations.
Creating checksums to verify whether or not information has changed over time.

What is the point of your attempt at a perfect hash function?


It is not possible to create a perfect hash function for strings as the possible combinations is exceeds any reasonable numeric representation. The perfect hash of a string would be the entire contents of the String itself.
My intention here is to find a more reliable method of creating hashes for Strings where collisions are extremely unlikely on the execution space of a program that has to deal with billions of different Strings.

The maximum length of a String is 2^31-1. So if you had chosen lengths randomly, then half of the lengths you chose should have been greater than half of that number -- which is 2^30-1. (...) there are 2^16 possible values for a char.


Yep, there is no way to create a unique number to represent each combination if we are limited to Long.MAX_VALUE only. The intention here is to reduce the likelihood of producing the same number for different Strings during the execution of a program to almost 0.

The code I presented does that and in 50M strings it gets 2 collisions in average when using fixed-length strings, whereas hashcode alone produces almost 300K collisions. CRC32 alone was even worse than hashCode.
In many independent runs of 50M strings with length = Math.random() * 1000 I still couldn't get a single duplicate.

My method is not a "perfect" hash of course (it is impossible to have it as discussed), but it does a much better job in avoiding collisions compared to Java's hashCode and CRC32.
 
Pat Farrell
Rancher
Posts: 4803
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Have you tried SHA1 as a hash? Its probably as slow as your CRC, but its a cryptographically strong hash, meaning that its proven to have good ransomness in its results. You can simply truncate the long result into whatever length you want to store.

I'm not seeing why anyone would want to use your hash function on 50 million strings. There are far more space and performance dense data structures, this is a well understood field. Many many PhD thesis topics cover this area.
 
Jeronimo Backes
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Pat Farrell wrote:Have you tried SHA1 as a hash? Its probably as slow as your CRC, but its a cryptographically strong hash, meaning that its proven to have good ransomness in its results. You can simply truncate the long result into whatever length you want to store.



Yep, SHA1 is reliable and so far I could not get collisions, but it is much slower. It took 25 seconds to create a hash for 30M strings. But thank you for the suggestion, it can be very useful.
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jeronimo Backes wrote:My method is not a "perfect" hash of course (it is impossible to have it as discussed), but it does a much better job in avoiding collisions compared to Java's hashCode and CRC32.


Hate to poke another hole in your theory, but all hash collections actually mangle your hash again in order to return a 'bucket index' for the current size. Have you actually proved that it produces fewer bucket collisions? I suspect it may do, but maybe not as much as you think.

Personally, I'd prefer to see some metrics that actually time 50,000,000 Strings rammed into a HashSet.

Like the old mantra says: Good/Fast/Cheap - pick any two.

Winston
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic