This week's book giveaway is in the Programmer Certification forum.
We're giving away four copies of OCP Oracle Certified Professional Java SE 21 Developer (Exam 1Z0-830) Java SE 17 Developer (Exam 1Z0-829) Programmer’s Guide and have Khalid Mughal and Vasily Strelnikov on-line!
See this thread for details.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • paul wheaton
  • Ron McLeod
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:

Need an approach to reduce the number of iterations

 
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All,

I need an approach to achieve the below requirement.

I am having a file with 10 lakh lines. I need to compare each line with the remaining lines in the file and find out the lines that are similar.
In this case i need to iterate over the same file for 10 lakh times which will take lot of time.
I thought of putting all these 10 lakh lines in an array list and then iterate over this list but this will take lot of heap space and it may run out of memory.

Is there any better approach to achive this...

Please reply with your suggestions,

Thanks in Advance,
K.Chandra Sekhar
 
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chandra kambham wrote:...
and it may run out of memory.
...



Ah, "may"!
First try it and see what happens. You wouldn't be the first programmer who'd worry prematurely about a problem while a simple, and easy to implement solution would have done the trick.
 
Marshal
Posts: 80666
478
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Do you mean similar or the same?
Consider other types of Collections, which might be able to mark mappings, or delete duplicates.
 
chandra kambham
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Campbell ,

I want the lines with the same content. I tried to implement this by using ArrayList and when i ran the program the CPU utilization was at it's peak and system got hanged for some time...
 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
chandra,
Instead of throwing all the lines in a single Collection at once, you can group the lines by which character they start with. That will significantly reduce memory consumption.
If somehow it still isn't enough, you could group them by the first 2 characters.
 
author
Posts: 23959
142
jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Piet Verdriet wrote:
Ah, "may"!
First try it and see what happens. You wouldn't be the first programmer who'd worry prematurely about a problem while a simple, and easy to implement solution would have done the trick.



Totally agree. 10 lakh is a million lines right? So if each line is 80 characters, then we are talking about 80 (or 160) megabytes of data right?

Henry
 
Greenhorn
Posts: 28
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
why dont you try adding each line you read into a set, in case the line is a duplicate entry, the add operation will return a value of false, so you will get all the duplicate entries as well.
 
Sheriff
Posts: 28401
100
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I would just throw all of the lines into a database table with one column, then let the database do the heavy-duty work it is designed to do.
 
chandra kambham
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes Henry,

The file size is about 100 Mb.
 
chandra kambham
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Paul,

The initial thought was to add all these lines to a database and then process them by using the DB commands.

But the problem was i have some 50-60 files each of size approximately 100 Mb and storing all these lines in DB will be a waste of DB space and also increases the load on DB server.
 
Ranch Hand
Posts: 96
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
First, you have to define "similar".

That said, create a data structure which represents a line with the following:
- A integer hash of the line (not necessarily the object's hashcode).
- Another int representing the line number.

Also, create two comparators:
- One which sorts by hash, followed by line number.
- One which sorts by line number (the list using this comparator should be much smaller!)

Once you have this, you can create a pair of lists and use the collection sort method.

There's still quite a bit more you need to do after you have your sorted lists, but I think you have enough to get started!
 
Campbell Ritchie
Marshal
Posts: 80666
478
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
No longer a "beginner's" question. Moving.
 
Piet Verdriet
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chandra kambham wrote:Hi Paul,

The initial thought was to add all these lines to a database and then process them by using the DB commands.

But the problem was i have some 50-60 files each of size approximately 100 Mb and storing all these lines in DB will be a waste of DB space and also increases the load on DB server.



Well, don't you want to put your computer to work? Isn't that what a computer is supposed to do: doing a lot of things every second?
You also ignored, or not respond, to my first reply so let me ask you again: have you tried implementing your problem in a straight forward fashion? Like I said: you wouldn't be the first programmer who would look for an super-fancy-optimized solution while something less complicated may have done the trick. If you have done so and have concluded this straight forward solution was performing too poorly, I'd appreciate it if you say so: then I wouldn't be nagging you about this. ; )
 
chandra kambham
Ranch Hand
Posts: 74
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Piet,

Initially i have implemented this by throwing all the data into a single ArrayList but that was taking too much of memory and too high times of execution.
Then with the idea of having multiple arraylists based on the character (Which was given by Tom) i could implement this with less memory consumption and was able to execute the program with in 4-5 Seconds of time.

Thanks all for your valuable suggestions.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic