Win a copy of Classic Computer Science Problems in Swift this week in the iOS forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Parsing Huge Files - Size in 1-4 GB  RSS feed

 
Ranch Hand
Posts: 221
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
We have a requirement where we will be getting the files of sizes varying from 1 to 4 GBs (might increase in future). We need to sort them based on certain criteria and then perform some operations on the records. What is the optimal approach to accomplish this.
 
Sheriff
Posts: 4389
280
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's quite a vague question so is hard to make a good suggestion.

You say "sort them" but what does that mean? Sort them how? Sort them where? Then "perform some operations" means what?

If you're doing search and replace operations then perhaps a tool such as sed might be more appropriate.
 
Vaibhav Gargs
Ranch Hand
Posts: 221
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Tim,

The file contains the transaction records containing fields such as: Transaction Id, Date, Amount, Type etc.

By sorting we mean that we need to sort the transactions based on their id & date.

By processing, I mean that we will be applying some logic around those records before persisting in DB.

Regards.
 
Tim Cooke
Sheriff
Posts: 4389
280
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ah ok, so it sounds like you want to parse the file line by line without pulling the whole thing into memory. Something like this would give you a Stream where each item is a line from the file, then you can consume it as you wish
 
Marshal
Posts: 58830
179
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Won't you overwhelm the available heap memory with a 4GB file? Not it you process one element at a time, as Tim suggested.
If you have a database, let the database sort the output for you; databases are very good at handling large datasets like that.
 
Tim Cooke
Sheriff
Posts: 4389
280
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could probably achieve this with very little memory usage at all.
Your memory usage will not exceed the size of a single record, assuming you do it sequentially in a single thread.
 
Saloon Keeper
Posts: 8763
163
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you REALLY need to sort the data before processing and saving it to the database (if records depend on earlier records), you need to use an external sorting algorithm, such as a K-way merge sort.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!