This week's book giveaway is in the Jython/Python forum.
We're giving away four copies of Murach's Python Programming and have Michael Urban and Joel Murach on-line!
See this thread for details.
Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

create a faster index file for big number of value pairs  RSS feed

 
Roger Palacios
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello.

This is my first post.

Well, I am trying to process large text files that contains a lot of information (5GB each one).
Data have the following columns:

INDEX1|INDEX2|Data|Data|Data...

The columns INDEX1 and INDEX2 are repetitives numbers, also are long type numbers.

I need to count how many diferents INDEX1-INDEX2 pairs are.

I used HashMap<Long, ArrayList<Long>> and HashMap<Long, long[]> (setting the JVM Xmx=3.5GB) but RAM memory is not enought.

I did make a class that create a binary index file :
INDEX1a | INDEX2a | (NEXT INDEX POSITION)
INDEX1b | INDEX2a | (NEXT INDEX POSITION)
INDEX1c | INDEX2b | (NEXT INDEX POSITION)
But is too slow because when a need to find a INDEX1 - INDEX2 pair, I iterate the file from zero until I find INDEX1, then i use NEXT INDEX POSITION to jump to the specified row until i find INDEX2.

There are some way to implements a hash map file or something like that for save memory?

Thanks in advance!
Sorry for my bad english.

Regards,

Nashuald

 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello, Roger/Nashuald. Welcome to JavaRanch.

I don't know of any good file-backed hashmap implementation in Java. There may well be one out there, somewhere. Probably it's used internally in one of the various database products out there. Probably you won't get a direct API into a FileBackedHashMap - but I would try some of the popular databases, and see if they can solve your problem. Create a table with a composite key of the two indexes, and insert all the data you read from those files. See if the performance is acceptable; if not, talk to specialists in that database for tips on how to optimize it, or try another database.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!