• Post Reply Bookmark Topic Watch Topic
  • New Topic

Best way to compare file records ..  RSS feed

 
shaju joseph
Ranch Hand
Posts: 30
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
I need to compare three files looking to see if certain fields are present in all three of them. If not I need to store them as error records. The number of records could be from 100000 to 1000000. My question is can I store these records in an ArrayList and do the comparisons ? Can ArrayList handle this volume ? What is the best way to do this ?
Any help is appreciated.
Thx
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's going to take a lot of memory. I believe an ArrayList will use a minimum of four bytes per entry, so one million entries will take at least 4 MB. And that's just for the ArrayList, not the objects inside. For that, it depends on what your record structure is like. The simplest possibility I imagine is a record consisting of a single String - this will take at least 20 bytes per record. So now you're looking at 24 MB, times 3 because you're doing this for 3 files, right? That's 72 MB for the simplest, shortest possible records. Probably a lot more in practice. If you've got the memory available, it's possible, but I'd really try to find another way.
If the files are sorted somehow, you're in business - you can open three different readers, one per file, and read through all three files simultaneously, using the sorting to keep your readers in sync (so they're all looking at the same parts of each file). I describe something like this here. You may well find it's best to handle the files two at a time for simplicity. First compare file1 and file2, logging any differences - then compare file1 and file3 (or file2 and file3 if you prefer). Dealing with only two files at once will be much simpler to code and debug, I think - don't try to handle three files at once until you've got two working well.
If the files are not sorted in advance, I think it will really be in your interest to sort them by some attribute (choose whatever's conventient), and then use the method described above. Sorting may be problematic for memory reasons (as described above). I'd look for a sorting algorithm which allows you to make use of external memory (files) rather than keeping everything in RAM. A balanced k-way merge sort seems like a good candidate. This will probably take some time to do right; I'd think someone may well have already implemented this in Java somewhere, so I'd take some more time searching for existing implementations if you do need to sort the data. Good luck...
 
shaju joseph
Ranch Hand
Posts: 30
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you so much for your insight.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!