• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Devaka Cooray
  • Liutauras Vilda
  • Jeanne Boyarsky
Sheriffs:
  • Knute Snortum
  • Junilu Lacar
  • paul wheaton
Saloon Keepers:
  • Ganesh Patekar
  • Frits Walraven
  • Tim Moores
  • Ron McLeod
  • Carey Brown
Bartenders:
  • Stephan van Hulst
  • salvin francis
  • Tim Holloway

Parsing Huge Files - Size in 1-4 GB  RSS feed

 
Ranch Hand
Posts: 370
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
We have a requirement where we will be getting the files of sizes varying from 1 to 4 GBs (might increase in future). We need to sort them based on certain criteria and then perform some operations on the records. What is the optimal approach to accomplish this.
 
Sheriff
Posts: 4573
286
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's quite a vague question so is hard to make a good suggestion.

You say "sort them" but what does that mean? Sort them how? Sort them where? Then "perform some operations" means what?

If you're doing search and replace operations then perhaps a tool such as sed might be more appropriate.
 
Vaibhav Gargs
Ranch Hand
Posts: 370
2
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Tim,

The file contains the transaction records containing fields such as: Transaction Id, Date, Amount, Type etc.

By sorting we mean that we need to sort the transactions based on their id & date.

By processing, I mean that we will be applying some logic around those records before persisting in DB.

Regards.
 
Tim Cooke
Sheriff
Posts: 4573
286
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ah ok, so it sounds like you want to parse the file line by line without pulling the whole thing into memory. Something like this would give you a Stream where each item is a line from the file, then you can consume it as you wish
 
Marshal
Posts: 61766
193
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Won't you overwhelm the available heap memory with a 4GB file? Not it you process one element at a time, as Tim suggested.
If you have a database, let the database sort the output for you; databases are very good at handling large datasets like that.
 
Tim Cooke
Sheriff
Posts: 4573
286
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could probably achieve this with very little memory usage at all.
Your memory usage will not exceed the size of a single record, assuming you do it sequentially in a single thread.
 
Bartender
Posts: 9494
184
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you REALLY need to sort the data before processing and saving it to the database (if records depend on earlier records), you need to use an external sorting algorithm, such as a K-way merge sort.
 
Consider Paul's rocket mass heater.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!