This week's book giveaway is in the Kotlin forum.
We're giving away four copies of Kotlin in Action and have Dmitry Jemerov & Svetlana Isakova on-line!
See this thread for details.
Win a copy of Kotlin in Action this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Comparing two huge files  RSS feed

 
Nischal Tanna
Ranch Hand
Posts: 182
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Guys
I have 2 files (worst case - each 20 MB). Suppose file 1 : test1.txt and file 2 : test2.txt

Now, i need to read test1.txt and compare the entries in test2.txt. If text of test1.txt not found in test2.txt, i need to process else ignore.
I have been using a BufferedReader to read from the files and am building a ArrayList out of each file. My code is as below:



I am getting OutOfMemoryError. Now, i am thinking of reading test1.txt in parts, storing it in a java Object (may b ArrayList) and than having a seperate loop wherein i read test2.txt and compare the text with the ArrayList of test1.txt(partly read file). However i want to know how can i resume reading test1.txt from the location where i had stopped reading last time

pseudo functionality :

read partly from test1.txt
store this part in a java Object
while reading test2.txt, compare the text whether it is present in the java object from above.

resume reading next part of test1.txt
-- again open test2 to read.



Guys, i know this may not be an optimized way. Kindly help.
 
David Harkness
Ranch Hand
Posts: 1646
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is a pretty common problem. One fairly simple solution (minimize your effort -- usually a good thing) is to use an external utility to sort each file to a new file. If you're tight on RAM, do one at a time rather than in parallel. Once that's complete, performing a "diff" is a snap.

Here's an example of two sorted files, using single letters rather than whole lines though there is no difference.The algorithm is slightly different if you can make certain asumptions like "Some lines in F1 aren't in F2 (must be processed), but every line in F2 is in F1". In other words, F2 is a subset of F1. Once you understand the general case, however, handling assumptions is easy.

I'll put it into words instead of psuedocode to provide a challenge.

Logically, you have a pointer into each file to a "current line". They both start at the first line of their file. In code, this entails simply a BufferedReader to maintain the pointer and a currentLine to hold the line just read. I'll refer to the lines as L1 and L2.

Do the following until you run out of lines. Compare the two lines using String.compareTo(). [note: does case matter? whitespace?]
  • If they are equal, read the next line for both files and continue the loop.
  • If L1 < L2 (sorts before it), F2 doesn't contain L1. Process L1 and read the next line from F1 only.
  • By necessity L1 > L2, meaning F1 doesn't contain L2. Do whatever you feel is necessary to L2 (flog, shread, decompile) and then read the next line from F2.

  • Give that a shot, and let me know if I glossed over too much.
     
    Stan James
    (instanceof Sidekick)
    Ranch Hand
    Posts: 8791
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    The algorithm David mentioned for sorted inputs is one of my favorites dating back to my first excercise in training. In the days of tape data storage it was often used in master-update programs, one file is the master file, the other is transactions.

    If you cannot sort the files, say you are comparing Java sources or you have to detect MOVE as well as ADD or DELETE changes, you can look at the algorithm I used for Text Diff and see if you could find a way to NOT store every line in memory. Some thoughts ...

    Store hashcode, offset and length in the file for each line instead of storing the text for each line. This could get you some false positive matches so you might have to use random access to re-read the line and compare when you get hashcode matches.

    Build the data structures in a database instead of in memory. Sure to be much slower!

    ---
    Editing: Just remembered I have another compare algorithm in a drawer (20 year old listing) that brings some number of lines into memory at a time. It can detect ADD, MOVE, DELETE actions within "n" lines, but not on broader scales. If your data has relatively localized differences that might work. Let me know and I can go look for it.

    Let us know what you work out!
    [ May 04, 2005: Message edited by: Stan James ]
     
    nana Amr
    Ranch Hand
    Posts: 53
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    plz help
    i need to compair tow files without saving ,just know if thay equal or not files content dont need to be in same order??how could i??
     
    nana Amr
    Ranch Hand
    Posts: 53
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    there is a method to compair files:
    public int compareTo(File pathname)

    plz i dont know how to use it ?
     
    Steven Bell
    Ranch Hand
    Posts: 1071
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    That method only compares file names.

    You've been given a couple of solutions already. The horribly performing brute force method would be:
     
    nana Amr
    Ranch Hand
    Posts: 53
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    if the file content are equals but in diffrent order or contain spaces or "\n" dose it work?
     
    Steven Bell
    Ranch Hand
    Posts: 1071
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Considering that is sudocode and I've put no definition on what counts as a matching line, yes it would work. However it is the worst possible implementation in terms of performance.
     
    nana Amr
    Ranch Hand
    Posts: 53
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    should i use FileReader/FileWriter and BufferedReader/BufferedWriter?
     
    Stan James
    (instanceof Sidekick)
    Ranch Hand
    Posts: 8791
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Buffered fer sure. That does fewer and larger physical IOs.
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!