• Post Reply Bookmark Topic Watch Topic
  • New Topic

File Comparison with 1.6 Million in Each  RSS feed

 
Jay Shukla
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I have two big text file with approx 16 Lakh (1.6 Million) records in each. Both file contains Strings in each line.
I would like to compare two files and find the descripencies and print them into separate file.

Could anyone please suggest me a way how can achieve it in fastest way.

I can find some idea on google or on internet but would like to achieve it faster way.

Thanks in Advance
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37518
554
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jay,
What's in the file? Is it sorted? What algorithm are you thinking of that isn't fast enough? Can you use the operating system diff command on UNIX?
 
fred rosenberger
lowercase baba
Bartender
Posts: 12565
49
Chrome Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
We need more details. What makes a discrepancy? Do the files have to be line by line identical? If file 'A' has

1
2
3

and file B has
1
3

what would you epxect it to output?

Will you be doing this once, or something that has to be done over and over?

All of these - plus what Jeanne asks - are important factors in coming up with a solution.

Also..."Fastest way" isn't a good metric. There is always a faster way - but at some point, the cost becomes prohibitive. You need to decide AHEAD of time how fast it has to be.
 
Jay Shukla
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for reply.

The difference would be 2.
If you use diff command it will check the position of the value, but position of the value is changed in file it would be fine but the value should be present at least in file.

File has String value with 32 length each.

Thanks
 
fred rosenberger
lowercase baba
Bartender
Posts: 12565
49
Chrome Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
again...details matter.

unless you can explain, in great detail, what exactly should happen in all situation, nobody here can help you. my question was only one possibility. what if the files were 1,2,3 and 1,3,2? or "1,3" and "1,2,3"? or "1,3,4,5,6....735,2" and "1,2,3,4....735"?

etc.

is it correct to say you want to print every line that is in A but not B, and in B but not A?
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jay Shukla wrote:The difference would be 2.

Which is basically what diff will give you. Specifically, it gives you a minimal "patch" or ed script to convert file a to file b (or vice-versa).

If you're only interested in differences (ie, outer joins) where position is NOT important, then there is also comm; but it requires both files to be sorted first (I have written a version in awk that doesn't have this requirement which could probably be replicated using HashMaps in Java).

I should warn you that diff is not a simple algorithm, so I'd hesitate to roll my own. Google appears to have one though.

HIH

Winston
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!