This week's book giveaway is in the Agile and Other Processes forum.
We're giving away four copies of The Journey To Enterprise Agility and have Daryl Kulak & Hong Li on-line!
See this thread for details.
Win a copy of The Journey To Enterprise Agility this week in the Agile and Other Processes forum! And see the welcome thread for 20% off.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Campbell Ritchie
  • Tim Cooke
  • Bear Bibeault
Sheriffs:
  • Paul Clapham
  • Junilu Lacar
  • Knute Snortum
Saloon Keepers:
  • Ron McLeod
  • Ganesh Patekar
  • Tim Moores
  • Pete Letkeman
  • Stephan van Hulst
Bartenders:
  • Carey Brown
  • Tim Holloway
  • Joe Ess

Parsing Huge Files - Size in 1-4 GB  RSS feed

 
Ranch Hand
Posts: 313
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
We have a requirement where we will be getting the files of sizes varying from 1 to 4 GBs (might increase in future). We need to sort them based on certain criteria and then perform some operations on the records. What is the optimal approach to accomplish this.
 
Marshal
Posts: 4455
284
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's quite a vague question so is hard to make a good suggestion.

You say "sort them" but what does that mean? Sort them how? Sort them where? Then "perform some operations" means what?

If you're doing search and replace operations then perhaps a tool such as sed might be more appropriate.
 
Vaibhav Gargs
Ranch Hand
Posts: 313
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Tim,

The file contains the transaction records containing fields such as: Transaction Id, Date, Amount, Type etc.

By sorting we mean that we need to sort the transactions based on their id & date.

By processing, I mean that we will be applying some logic around those records before persisting in DB.

Regards.
 
Tim Cooke
Marshal
Posts: 4455
284
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ah ok, so it sounds like you want to parse the file line by line without pulling the whole thing into memory. Something like this would give you a Stream where each item is a line from the file, then you can consume it as you wish
 
Marshal
Posts: 59715
187
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Won't you overwhelm the available heap memory with a 4GB file? Not it you process one element at a time, as Tim suggested.
If you have a database, let the database sort the output for you; databases are very good at handling large datasets like that.
 
Tim Cooke
Marshal
Posts: 4455
284
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could probably achieve this with very little memory usage at all.
Your memory usage will not exceed the size of a single record, assuming you do it sequentially in a single thread.
 
Saloon Keeper
Posts: 9123
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you REALLY need to sort the data before processing and saving it to the database (if records depend on earlier records), you need to use an external sorting algorithm, such as a K-way merge sort.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!