• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Large Map / HashMap

 
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I want to create a large Map with couple of millions of Lists of Integers tied to millions of keys. I am looking at 8 millions keys plus minus and a List of 10 Integers for each key. I tried storing all these in the memory and with 100k keys, it already taken 667MB duh!!!. Alternatively, I tried using the database but it is too slow for my application and "too much" for my simple need. I haven't try using a file because I do not know how. It is unlike List where you can read the Integers sequentially, for example. Any ideas or advice?

Thank you.

/lim/
 
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You haven't said anything about the keys -- perhaps first you should measure how much memory your 8 million keys would take up. Try just putting them all in a HashMap with the value set to null. How much space does that use? Is that acceptable? You may find that the keys are much bigger than the lists, or vice versa. But this will give you a better idea where to concentrate your efforts to reduce the memory usage.

Regarding the Lists of Integers, you can probably get better memory utilization using int[] arrays. 8 million different int[10] arrays should take a little over 300 MB of memory. I don't know if that's too much for you or not; it probably depends on how much space the keys take up. It may also be undesirable to expose a mutable object like an int[]. If so, you can write a class to wrap the int[] and make it behave like a List<Integer>, but with reduced memory usage.

However I think your best bet may be to use a database combined with an in-memory cache of the most frequently used data. That way you'll get fast access to in-demand items, but don't have to have everything in memory. This will work well assuming that some of the data is used much more than other parts of the data - which is often the case, but not always. It really depends on the situation. What does the data mean, and how is it being used?
[ October 05, 2008: Message edited by: Mike Simmons ]
 
T H Lim
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The keys are String with approximately 20 chars each. I want to keep my apps memory footprint at most 512MB if possible. So adding up with the keys it will come to a total close to 500MB which is very close to my design memory footprint. And this is only for 8 million rows. As I need to process more and more data in days / months to come, it will break my apps memory limit someday.

More I think of it, I doubt the caching database will help. The map is temporarily i.e. it exists during the life span of the application that creates it. So every time, when the app kicks off, it will re-insert all rows into the table. The insertion takes quite a while to complete. I have yet to try batch insertion. Probably, that will shave some time off.

I am more interested to look for a simple file based solution i.e. a HashMap that store parts of it data in the disk. I wonder if there is any library out there that could do that which I missed during Google search.

/lim/
 
author
Posts: 14112
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It would be much easier to talk about sensible solutions if we knew what that data was, and how it is going to be used...
 
T H Lim
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
There are 4 or more CSV files. Each CSV file has about 8 millions records. The 1st file contains customer's personal details. The 2nd file contains customer's spending habits.The 3rd and 4th files contains his subscriptions and services he subscribed. Each file is tied with a customer id and the records are not always consistently exist in all the files e.g. customer with ID 0000 might appear in file 1, 2 and 4 while customer ID 2222 will only appear in file 2 and 4 and so forth. You get the idea. So my application is to tie all these records together as 1 record in 1 CSV file e.g. customer ID 0000 with all his personal details, spending habits, subscriptions and services he has subscribed. Those "columns" not found or provided will be filled with NULL or default value. I hope my explanation is clear otherwise I can explain in more detail in those areas I didn't make clear.

Thanks

/lim/
[ October 06, 2008: Message edited by: TH Lim ]
 
Ranch Hand
Posts: 87
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hey Lim,

a reliable file based database solution could be the Oracle Berkeley DB. It is in fact a key value system storing the data in local files. I guess there are several other similar solutions out there.

Regards,
Thomas
 
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I agree that a database would still be better. CSV files with 8 millions rows are quite inefficient if you want to match or search.

If you all that data in a database, you can use joins and ordering to get the results you need. In your case, a full outer join would be best - it will not filter out rows of one table (CSV file) which is not present in any other.
 
T H Lim
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you guys for your input.

My initial thought was using the db but decided against it and used a different approach. Instead of storing all the rows in the HashMap I device a simple mapping technique where it mapped the customer id to the file offset. In the 1st run, I will have I will keep 1 offset for each row each file. So for 4 files, I will have 4 offsets. All these offsets are mapped to a single customer id. These information is stored in a HashMap in the memory. On the 2nd run, I will build the actual rows reading from the HashMap and combined all the rows from all the files into a single file. I thought that way I can manage to process all the CSV files with 512MB memory. I guess I was wrong.

I will try out Berkeley DB and see if I can accomplish what I want with it.

/lim/
 
Thomas Thevis
Ranch Hand
Posts: 87
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I thought that way I can manage to process all the CSV files with 512MB memory. I guess I was wrong.


What went wrong? Did you really exceed you 512M limit? Perhaps you can work on your keys. If your keys are the 'ID 2222' stuff mentioned above, you could try to convert them into integer/long and use something like
where the first index is your ID key and the second the CSV file index.

Regards,
Thomas
 
Ilja Preuss
author
Posts: 14112
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sounds to me like you were actually starting to implement your own small database system.

An alternative to Berkeley could be http://hsqldb.org/
 
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

The database route is pretty straightforward, if performance is the main problem, you can try H2 (http://www.h2database.com/html/main.html), it's considerably faster than HSQLDB.

I think there's a simpler solution, though. Even though there's some missing records, I'm assuming most of the ids exist in all files? Why not just sort the files by id and iterate over the files concurrently? ie, loop over the records in the first file, on each iteration check the id of the next record in all the other files and add their data (and advance the record pointer) only if it matches. You'll need to do a little bookkeeping for the case when the record is missing from the first file, but overall you'll barely need any memory at all.

The limiting step (as far as performance and memory efficiency go) will be the sorting, but sorting is quite well understood.

You could actually do the whole thing with standard UNIX commands, though I wonder how well they'd cope with 8 million records

If it does become a limitation, another way to go is to produce a non-redundant, sorted list of all the ids from all files (first pass; very quick), then individually sort just the ids from each file and store just their position (second pass) - since you can use the position in the global id list as the key, it's just an array of integers, so just 30MB per file. Load all those into memory, iterate over the global list and grab the data for each record with random file access, like you were planning to do (third pass). You could actually combine the first and second pass into one, with slightly more work.

Which way is better depends on what the data looks like (are the records mostly sorted already?) and whether the bigger limitation is memory size or disk speed.

Incidentally, this is exactly what the database engine will do, you can just be more efficient since you know more about the data.

Hope that helps.
 
Dmitri Bichko
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Oh, and regardless of which way you go, it might be helpful to look into more efficient (and faster) list and map implementations (especially useful for primitives), like Colt (http://acs.lbl.gov/~hoschek/colt/) or GNU Trove (http://trove4j.sourceforge.net/).
 
Ilja Preuss
author
Posts: 14112
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Dmitri, good points!
 
T H Lim
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

If it does become a limitation, another way to go is to produce a non-redundant, sorted list of all the ids from all files (first pass; very quick), then individually sort just the ids from each file and store just their position (second pass) - since you can use the position in the global id list as the key, it's just an array of integers, so just 30MB per file. Load all those into memory, iterate over the global list and grab the data for each record with random file access, like you were planning to do (third pass). You could actually combine the first and second pass into one, with slightly more work.



Is this something similar to attempt I mentioned, mapping the id to the position (start & end pos) to the data in each file? I tried using 400k lines (approx.) from 4 files (each file is not exactly 100k of lines). With HashMap<String,ArrayList> the JVM used approx. 667MB. Yesterday, I tried HashMap<String, int[10]> and the JVM used approx. 450MB. Based on calculation, it should roughly used about 24MB (400000 lines x (20 bytes (per key) + 40 bytes (per 10 int)). Probably, another 10MB for the HashMap and other classes overheads. I don't understand where are the additional bytes are consumed. Is my calculation correct?

All these memory consumptions were obtained using OS X "taskmanager". I invoked Runtime.getRuntime().freeMemory() to get memory consumed by my application. The values I got are way off the 1 reported by the "taskmanager". How do I calculate the memory consumed programmatically?

Is there any way to determine the JVM memory usage without using any sophisticated memtools?

I am planning to use H2 and been writing some simple tests to determine if it can solve my problem.

Thanks.
[ October 11, 2008: Message edited by: TH Lim ]
 
Dmitri Bichko
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by TH Lim:

All these memory consumptions were obtained using OS X "taskmanager". I invoked Runtime.getRuntime().freeMemory() to get memory consumed by my application. The values I got are way off the 1 reported by the "taskmanager". How do I calculate the memory consumed programmatically?

Is there any way to determine the JVM memory usage without using any sophisticated memtools?

I am planning to use H2 and been writing some simple tests to determine if it can solve my problem.

Thanks.

[ October 11, 2008: Message edited by: TH Lim ]



Your task manager will report the amount of memory that the JVM has allocated, not what it's actually using. To get the memory used: runtime.totalMemory() - runtime.freeMemory()

And yes, the approach is similar to what you describe, except I would process the files individually, and map the string ids to numeric indices (like someone has mentioned already), using an ObjectToInt map from one of the libraries I posted.
 
T H Lim
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Alright, after testing with various configurations using HashMap<Long,Integer[]> and HashMap<String,Integer[]> I finally gave up. This setup which contains 4 millions simple entries alone took 400~500MB of memory. Compounded with other classes and libraries and the full 8 millions records, I foreseen it will easily used more than 1GB of memory.

But all is not lost. One fine night, by accident, I discovered Apache Hadoop, http://hadoop.apache.org/, in Javaworld. It solves my problem perfectly. Fast, simple (as far as my requirement is concerned), uses much less memory and it is scalable to terabytes or even petabytes of key/value entries. Should one day I need to process tens of millions of CSV records I have no worries.

/lim/
 
Marshal
Posts: 79179
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Welcome to the Ranch sam gulhane.

It is very discourteous to the previous authors on that thread to post an unrelated question. We do not allow such hi-jacking, so I shall take the liberty (I hope Robert and Rob and Jesper don't mind) of deleting your post.
 
reply
    Bookmark Topic Watch Topic
  • New Topic