Win a copy of Head First Agile this week in the Agile forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Check duplication in a very large file  RSS feed

 
Vince Hon
Ranch Hand
Posts: 117
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have an excel file (or csv) that contains many records (e.g. e-mail address).
aaa@aaa.com
bbb@bbb.com
fff@fff.com
ccc@ccc.com
.
.
.
.
Some of the records may be duplicated and I want to delete the duplicate one and provide a new list that without duplications.
Personally, I first sorted the record alphabetically and then comparing each record one by one.
As the excel file contains thousounds of records, the processing time is critical.
Is my algorithm work ? or are there any better solutions (I will be happly if API or code are provided ) ?
Thanks
 
Loren Rosen
Ranch Hand
Posts: 156
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorting will work, but its O(n log n) and we can get a O(n) algorithm.
But first, how many thousands do you mean? For just say 5,000 records, the difference between n and n log n is pretty small and the time will be dominated by the time to read and write the input and output. On the other hand for very large numbers the data won't fit into memory at once and different algorithms might be needed.
I'll assume though that you're somewhere in-between. But I also bet you're doing this for a class assignment, so I'll just give a hint. As you read in the records, you want to stick them into some data structure. You want the data structure to have the property that if something is already in the structure, it won't get added again. Now, take a look at the classes in java.util and see if there's something that does what you need.
 
Dana Hanna
Ranch Hand
Posts: 227
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Having been 13 days since this post, I'll clarify what the previous author was saying:
Use a HashSet.

The resulting Set is garaunteed to contain unique values only if you traverse it (use the Iterator). If you'd like them to sort while they add, substitute TreeSet for HashSet!
 
Jon Strayer
Ranch Hand
Posts: 133
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Try
sort | uniq
If you aren't on a UNIXish box download cygwin. There is no need to write a single line of code to solve this problem.
 
Jon Strayer
Ranch Hand
Posts: 133
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also remember that email addresses are not case sensitive. Both Java and sht unix 'sort' command are (by default).
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you aren't on a UNIXish box download cygwin. There is no need to write a single line of code to solve this problem.
True, but if you don't already know Unix, it does take some time to learn - and while cygwin is cool, it takes some time to download, and the little differences between it and a "real" unix can be distracting. And we don't know if the solution needs to be easily portable to other platforms. Over in Java-land, HashSet or perhaps TreeSet are easy to use and well-suited to this problem, and if you're programming in Java at all there's a high incentive to learn to use collections effectively.
 
ravivedala
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If this is a one time task and NOT automated process which works for many excel files i have a simple solution for you.
Do this way :
Open the .csv in excel file.
suppose all the email ids are in column A starting from 2nd row.
Now go to B2 and write the following formula :
=IF(OR(EXACT(A2,A$2:A$9999)),"Duplicate","")
Thatz all !!
 
John Smith
Ranch Hand
Posts: 2937
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Over in Java-land, HashSet or perhaps TreeSet are easy to use and well-suited to this problem, and if you're programming in Java at all there's a high incentive to learn to use collections effectively.
Seconded that. A hash-based collection is the standard solution to the problem posted, -- there is no need to reinvent the wheel and come up with the suboptimal solutions. Let the professionals at Sun solve it for you.
 
Jon Strayer
Ranch Hand
Posts: 133
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Eugene Kononov:

Seconded that. A hash-based collection is the standard solution to the problem posted, -- there is no need to reinvent the wheel and come up with the suboptimal solutions.

Just for the record, there were two solutions provided. One was an Excel macro (given that the data was in an Excel spreadsheed that's not a bad idea). The other was the use of two common command line tools.
Neither was what I would call "reinventing the wheel".

Let the professionals at Sun solve it for you.
While that's not bad advice when it come to writting Java code. When there is no need to write the code it falls under the headding of "reinventing the wheel".
[ September 14, 2003: Message edited by: Jon Strayer ]
[ September 14, 2003: Message edited by: Jon Strayer ]
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yeah, there are a lot of different ways to do this, depending on the exact requirements, and what tools the programmer is already familiar with. And the original questioner is long gone, so it's sort of a moot point.
When there is no need to write the code it falls under the headding of "reinventing the wheel".
Not really, IMO - the Java version is simple enough that I'd say just do whichever solution comes to mind first, unless requirements dictate otherwise. Advice like "just download cygwin" could be a lot more effort for someone, compared to just writing a really simple Java program. Though, like my advice about collections - it's probably good for them in the long run.
Also remember that email addresses are not case sensitive. Both Java and sht unix 'sort' command are (by default).
Easy to fix:

Or if you want it sorted:

Nothing wrong with the other solutions either, if the user already has and is sufficiently familiar with those technologies.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!