This week's book giveaway is in the Design forum.
We're giving away four copies of Experimentation for Engineers: From A/B testing to Bayesian optimization and have David Sweet on-line!
See this thread for details.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Tim Cooke
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • Junilu Lacar
  • Rob Spoor
  • Jeanne Boyarsky
Saloon Keepers:
  • Stephan van Hulst
  • Carey Brown
  • Tim Holloway
  • Piet Souris
Bartenders:

Best way to store a huge string array/list

 
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am running a java program that has to read in about 250 million lines of string and hold them in memory.
each string contains 3 characters, I did the math and in theory the memory requirement of holding all these strings will
be is about (250M * 3 * 16bit =) 1.5GBytes.

I first attempted to sequentially read in each line as a String and put them in an ArrayList, but I am surprised
that for about every 10 million record it consumes 450MB of heap, and that I can get no near to 50 millions of records.

I know in Java String have quite some overheads, so I also try to make my own "String" class (wrapped around char[] )
but this still, costs about 320MB of heap every 10 million record.

I started to suspect my use of ArrayList but I saw somewhere that mentioned ArrayList have very little overhead,
and if this is true, I can't see any way of reading in such huge amount of data.

A few days ago I managed to use a *super* computer that has a total 32G of ram to test the program, with a massive
java -Xmx28000m Temp.java
it was aborted at about 100 million records and the log shows that the Maximum Vir mem used is 27G !

I may not totally understand how Java manages memory, but i was surprised about the overhead...
So here I am asking if someone knows a solution to:

store the above mentioned data within 24G (preferable even less than 12G, as I don't always have access to that *super* computer)
And yes, I have to hold them all once in memory, I know very few people do that but I will need to generate a Minimum perfect hash
for all the keys (Strings) so every single key has to be in memory to process. oh yes, no two Strings in my case are equal.

Thanks in advance!

p.s. I attached my test code here if it helps:



[NK: Added code tags. Use Codetags while posting code.]
 
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello,

You should implement a StringWrapper class (contructor with a string arg) that will convert the given string into a byte[] using the encoding "ISO-8859-1" (only 256 char (8 bit instead of 16 bit per char). Than you may use your wrapper in the list/map instead of strings.
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If this was my problem I would consider working entirely with bytes. If the characters are all ascii that is entirely feasible.

In any case I would not try to have an object for each String, get the whole monster in memory and have access methods that calculate the location of the three characters of line N. If they can be bytes, so much the better.

Anyway the above is a wild guess because I don't know: What has to happen to the data next?

Fascinating problem anyway, let us know what you come up with.

Bill
 
Andrei Matyas
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes indeed if you don't need high level methods you can forget the wrapper object (a useless ref of 2*4 byte on a 32 bit OS and 2*8 bytes on a 64 bit OS ).
 
Saloon Keeper
Posts: 26768
190
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As a very rough guess, I'd estimate the fixed overhead for any java object, whether it's String or build-your-own is going to be a minimum of 12 bytes, so the actual 3 bytes per string is paltry by comparison.

You can get rid of that by putting all the characters into one big character array, since the overhead for a single character array object is the overhead for a single object. But then the task of tracking where the individual strings are within that array becomes your job, not Java's. And if you want to support variable-length strings, there'd be additional overhead for tracking the start/end points of the strings. Plus you'd either have to size the array worst-case or manage cases where the array turns out to be too small and must be replaced by an array with more capacity.

For data sets of this magnitude, it's often better to process them as serial data streams or database objects instead of trying to keep everything in RAM. Of course, not everything works well done that way, so you have to tune the solution to match the problem.
 
Ranch Hand
Posts: 144
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Every 3 character String (even UTF-16 ones) can be converted to a 64-bit signed integer that uniquely represents that string (in most practical cases such as alphanumeric strings this can be done with a 32-bit int as well). That reduces your memory consumption to 64/8*250 million = ~2Gb of memory as a single long[] (or half that as an int[]). Added bonus is extremely fast searches (sort + binary search) on one of the most performance efficient data types. Personally I would not work byte arrays for this specific problem unless you need to manipulate/get individual characters within these strings. It's not a huge difference but byte[][4] arrays are significantly less efficient memory wise and a byte[] is only preferable in specific situations like mentioned above.

By the way, ArrayList is not your problem here (since you use the ArrayList(int capacity) constructor it never has to be resized so it basically takes about 2Gb from the start on 64-bit systems with a relatively tiny bit of overhead). Your issue is that your char[] is actually an Object instance which in and by itself requires at least 16 bytes (12 but rounded up to nearest 8 byte number) and I vaguely remember the footprint for array types being slightly larger in VMs still. Regardless of the solution you go for you can safely assume that you must avoid having to create an Object instances for every single string. That in turn means you are required to use a single or a limited amount of arrays to represent your data.

As others have mentioned it really depends on what you have to do with the data. There are very few scenarios where keeping this large a data set in memory is an absolute requirement on functional or performance grounds. Interesting stuff though, can you give us some insight into what exactly this is for?

 
Bartender
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Just a thought:

2^16 * 3 is about 200.000 combination. 200.000 * 3 * 16 / 8 is 1.2 MB
Why the 200K? Because the String pool should cache these Strings.

Or am I completely wrong? Of course there is an added cost of 250M * cost of a reference.
 
R van Vliet
Ranch Hand
Posts: 144
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Wouter Oet wrote:Just a thought:

2^16 * 3 is about 200.000 combination. 200.000 * 3 * 16 / 8 is 1.2 MB
Why the 200K? Because the String pool should cache these Strings.

Or am I completely wrong? Of course there is an added cost of 250M * cost of a reference.



it's not 2^16 * 3 but 2^16^3 to get all possible combinations. So yes, I'm afraid you're a bit wrong ;)

Either way it's pointless to store the strings themselves as I explained before. Simply generate a unique id for every possible combination and generate the actual strings on the fly when you need them. This saves both having to create object instances on the heap as well as storying references to that instance which are at least 8 bytes in size.

 
Wouter Oet
Bartender
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You're right. I already agreed with your solution but it was, just like I said, just a thought.

It's also relevant what the range is. Let's say that the range is a-z A-Z 0-9 and some others signs
then the number of posibilties drops to about 1M. Then you could store the unique id in a int.
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
We still are missing the most important information - what has to happen to the data when it is in memory??

Is it going to be used for some sort of look-up, sorted, or what.

The answer to that question will indicate the storage approach.

SO - Wei Lau - why do you have to have this in memory?

Bill
 
Rancher
Posts: 4803
7
Mac OS X VI Editor Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

William Brogden wrote:SO - Wei Lau - why do you have to have this in memory?


Bill is on to something critical here.

The obvious optimization is to not store all the strings in memory. Use a structure, such as a skip list to keep pointers to fast access in memory, and use the disk drive to actually hold the data. Use the data structure to get close to where you expect the data, read a few blocks in, and search that. Even a squential search will be fast, and more sensible that having all this data in memory.

This is also a case where a good working set cache will do wonders. In all real world cases, the access is clustered around some smaller working set than the whole dictionary.

Another idea is to not store the characters even in bytes, if they belong to a restricted alphabet. Seven bits per character works great for US ASCII, Six bits per character works well for upper case letters and all the usual punctuation, numbers, etc. Radix50 works for many character sets.
 
Happily living in the valley of the dried frogs with a few tiny ads.
The Low Tech Laboratory Movie Kickstarter is LIVE NOW!
https://www.kickstarter.com/projects/paulwheaton/low-tech
reply
    Bookmark Topic Watch Topic
  • New Topic