Win a copy of Cross-Platform Desktop Applications: Using Node, Electron, and NW.js this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

What would be the best datastructure to hold huge data  RSS feed

 
Mohanraj Naidu
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,
I want to design a datastructure in Java to hold lot of integer values. What would be the best datastructure for this purpose?
My requirement is:
I want to accumulate the primary keys of all the tables in some sort of map arranged by table name. There are around 600-700 tables and each tabble has goot huge number of records (30000-40000). I want to use some thing simillar to HashMap. But I doubt if it'll give me out of memory problem.
Can HashMap hold this amount of data? Is there any way to find out the limit? is there any better datastructure u can suggest?
Thanks in advance.
Mohan
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What do you want to do with that "map"?
 
Marc Haberkorn
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, whether or not you run out of memory has to do with the size of your Java heap, so the answer to this question is dependent upon your runtime environment.
To answer the question about a HashMap holding a large amount of information...you can specify the load ratio when you initialize the object. This limits the amount of unused space in the underlying hash table. Of course, restricting this unused space causes a performance hit, as you will have more collisions in the hash table, but depending on what you're doing, that may or may not be an issue.
 
Jeroen Wenting
Ranch Hand
Posts: 5093
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Given the amount of data I'd be more concerned about lookup times than storage space (at least if the system it is intended to run on has ample RAM).
To increase performance on that it might be a better idea to store the keys for each table in their own Map and do lookups on that, or to provide a staged approach in which you break up the keys into smaller subsets and first determine in some way in which of these subsets you should do a lookup (maybe store a key to these subsets in its own Map for that).
 
John Smith
Ranch Hand
Posts: 2937
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To increase performance on that it might be a better idea to store the keys for each table in their own Map and do lookups on that, or to provide a staged approach in which you break up the keys into smaller subsets and first determine in some way in which of these subsets you should do a lookup
The lookup operation is O(1) on the map, i.e. constant time no matter how large the collection is, assuming that the key objects have a decent hashing algorithm.
 
Ouaknin lionel
Greenhorn
Posts: 14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i will advise you to create your own data structure. We will discuss it further along but first let me say why you should use your own Structures:
- HashMap and HashTable can't store mere int, they have to store an Integer. And this is bad for performance because you have to do:
HashTable hash;
hash.put(new Integer(0));
and
int result=((Integer) hash.get(key)).intValue();
and this is bad for performance and memory footprint.
What i advise you to do is to create specific hashtable for int using underlying plain array. To know how to do that, take a closer look at the source code of Hashtable.
 
Ouaknin lionel
Greenhorn
Posts: 14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you do your own data structure right, there's no way to do something faster. Take a look at hashtable. You might take a look at this Bestseller two: Java Performance Tuning. JacK Shirazi.
Amazing book....
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24217
38
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi "HWBBH,"
Welcome to Java Ranch! We don't have many rules around these parts, but we do have a naming policy. Please check it out, then change your display name to something with a few more vowels. Thanks, and see you around the Ranch!
 
John Smith
Ranch Hand
Posts: 2937
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I want to accumulate the primary keys of all the tables in some sort of map arranged by table name. There are around 600-700 tables and each tabble has goot huge number of records (30000-40000). I want to use some thing simillar to HashMap. But I doubt if it'll give me out of memory problem.
Ok, let's see the memory utilization if we used the Map. Actually, in the implementation bellow, I used HashSets to store the info (one HashSet per table), and the HashMap to store all of the HashSets. So, it's a HashMap of HashSets. Also, I specified only 400 records per table (as opposed to 40000 records per table). Another assumption that I made was that the primary key would be a String, and the size of the key is 5 characters long. I ran this test on my Winows98 machine, JDK1.4.2, with the following options -Xms128M -Xmx128M (to prevent the heap from resizing).
Here is the complete code:

Output:
total memory used = 22 Mb
Since the number of entries in all the HashSets was 700 * 400 = 280,000 and the original post was about storing 700 * 40000 = 28,000,000 entries, the expected memory usage would be around 22Mb * 100 ~ 2.2Gb
Time to order some memory chips!
 
Ouaknin lionel
Greenhorn
Posts: 14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
sorry for the wacky alias. this is a real name.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Hah, you too! People sometimes look at me strangely when I put repeated GC's in front of a measurement like this, but the fact is that I really have observed different results depending on whether I GC 1, 2, or even 3 times. Never obeserved any difference past 3, so far, but it wouldn't surprise me. "When control returns from the method call, the Java Virtual Machine has made a best effort to reclaim space from all discarded objects." Yeah, right. :roll:
Yeah, I know that elsewhere it's clear that there aren't any guarantees for GC, but my position is that if you can do better by repeating something, then what you did the first time wasn't really your best effort.
 
John Smith
Ranch Hand
Posts: 2937
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hah, you too! People sometimes look at me strangely when I put repeated GC's in front of a measurement like this, but the fact is that I really have observed different results depending on whether I GC 1, 2, or even 3 times. Never obeserved any difference past 3, so far, but it wouldn't surprise me.
I've seen GC kicking in as much as 5 times before it reclaimed everything. Generational garbage collectors became too smart.
The best way you can observe it is to start the server process with the "-verbosegc" option and invoke System.gc() in a loop, periodically. What you would normally see, depending on the JVM version and the startup parameters, is that as you push GC to do its job, over, and over, it takes more and more time, for each invocation in the loop. Furthermore, the initial invocations will typically release only a small portion of the unreferenced objects, for performance reasons. On the console, you will see something like [GC: blah-blah]. But if you keep pushing, the full garbage collection kicks in, as indicated by the [Full GC: blah-blah] in the console. This full GC may take 3 to 5 seconds or even longer.
 
Kirk Pepperdine
Author
Ranch Hand
Posts: 71
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Eugene Kononov:
Hah, you too! People sometimes look at me strangely when I put repeated GC's in front of a measurement like this, but the fact is that I really have observed different results depending on whether I GC 1, 2, or even 3 times. Never obeserved any difference past 3, so far, but it wouldn't surprise me.
I've seen GC kicking in as much as 5 times before it reclaimed everything. Generational garbage collectors became too smart.
This full GC may take 3 to 5 seconds or even longer.

I always add a sleep after the GC. As you said, a full GC may take quite some time to complete and you do want GC to be completed before taking any timings. In some most cases, I've found this to be very effective, more effective than calling gc multiple times.
 
Frank Carver
Sheriff
Posts: 6920
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
want to design a datastructure in Java to hold lot of integer values. What would be the best datastructure for this purpose?
...
I want to accumulate the primary keys of all the tables in some sort of map arranged by table name. There are around 600-700 tables and each tabble has goot huge number of records (30000-40000).
...
is there any better datastructure u can suggest?

Is there already any pattern to which keys have been allocated? Are there a lot of "runs" of keys (e.g. 1234,1235,1236,1237,1238,...) in each table?
I can think of several ways to do this with much reduced memory usage, but the most important question is "What do you plan to do with these keys once you have them?"
Just loading a bunch of data into memory serves no useful purpose. For it to be a worthwhile effort you need to use the loaded data somehow. My guess is that you might plan to use it when you generate new primary keys to ensure that they are not already used, but you may have an entirely different use in mind.
If you just want to be able to generate a new unique key for a given table, all you really need to do is read all the keys of the table, and keep track of the largest one as you go.

Then, when you need a new key, just increment max[] for that table and use the new top value. Total memory use, one integer for each table !
If you really need to reuse "gaps" in the key sequence, then you might consider "run length encoding": For each table, store an array of (start,length) pairs. Updating this array as you read in each key is more complex than the above, but not impossible:
  • if the key is both at the start of one run and at the end of another, merge the two runs (use the lowest start, and add the two lengths + 1)
  • if the key is at the start or end of an existing run, add the key by decreasing the start or increasing the length of the run
  • if the key is not adjacent to any runs, create a new run of length 1


  • The bottom line for all of these solutions is that the need you describe sounds much more like a Set than a Map. Each key doesn't actually connect with another object, it just exists (or not) in the collection.
    For good examples in "thinking outside the box" like this, I recommend the "Programming Pearls" books and articles by Jon Bentley.
    Did any of this help, or have I misssed the point?
    [ September 28, 2003: Message edited by: Frank Carver ]
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!