• Post Reply Bookmark Topic Watch Topic
  • New Topic

Fastest way  RSS feed

 
Arjunkumar Shastry
Ranch Hand
Posts: 986
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have to compare for some duplicate records if any from 23,000 records.I have those records in Excel file.Record consists of 3 strings say ,address,phone,city.
What will be the efficient way to compare these records for duplicates?
Two records are duplicate if their address,phone,city all match.
 
M Beck
Ranch Hand
Posts: 323
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Arjunkumar Shastry:
I have to compare for some duplicate records if any from 23,000 records.I have those records in Excel file. Record consists of 3 strings say address, phone, city.
What will be the efficient way to compare these records for duplicates?
Two records are duplicate if their address,phone,city all match.


the efficient way? import the Excel file into MS Access, specify the address, phone, and city fields as a composite key, and see if Access' import wizard complains about any import errors. twenty-three thousand records is more than enough data to justify using a database such as Access instead of further abusing Excel.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How do you read from Excel? You can do a SQL query with JDBC/ODBC bridge, order by the three columns, read one line at a time and see if you get the same values two times in a row.

But I like the "real database" option, too.
 
Arjunkumar Shastry
Ranch Hand
Posts: 986
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I copied the data to text file.One program to read and format at into proper string.For example,row may look like "Thomas Watson 3038484 Beverly Hills".Read the string,put into ArrayList,sort the list Then sort again using comparator where you can override compare method.In this method if (String1.equals(String2)),print the message.
[ April 19, 2005: Message edited by: Arjunkumar Shastry ]
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Arjunkumar Shastry:
I copied the data to text file.One program to read and format at into proper string.For example,row may look like "Thomas Watson 3038484 Beverly Hills".Read the string,put into ArrayList,sort the list Then sort again using comparator where you can override compare method.In this method if (String1.equals(String2)),print the message.


Much easier is to put the strings into a HashSet and check the return value of the add method...
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It would be interesting to compare speed and memory usage (you asked about fastest) to export in sorted order or use an external sort on the text file and just read the lines without putting them into any kind of in-memory Java structure. I don't know if you'd notice HashSet slowing down on 23000 lines or not but with enough millions of lines you'll blow the heap. Just reading lines could handle any number in linear time.

I guess I'm assuming the external sort (like the DOS sort command) would be faster than HashSet. Something to try and see.

Of course you could identify the dupes in Excel to start with . Define a new column:

=if( cola = colb, "Match!", "" )
 
Arjunkumar Shastry
Ranch Hand
Posts: 986
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Stan James:
How do you read from Excel? You can do a SQL query with JDBC/ODBC bridge, order by the three columns, read one line at a time and see if you get the same values two times in a row.


I've just today learned a cool SQL syntax to do that all in the DB:



That will give you all the duplicates.

I suspect that Excel won't understand that query, though...
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!