• Post Reply Bookmark Topic Watch Topic
  • New Topic

Parsing huge CSV file/ Time to load in the memory  RSS feed

 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

Will I face any issues in parsing a huge CSV file (in GBs) ? If yes what should be the plan of action?

I would be applying logic row by row. How would this affect the performance ? Is the usage of core java for the parsing csv file using it efficient ?

Thanks,
Aditya
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The size of the file to be parsed isn't as important as the amount of processing done per row, and the amount of data retained in memory per row. If both are limited, then there should be no problem; it will still take some time, though, purely because of the amount of data. It should be easy to write the parsing code and verify that it works with a file of that size.
 
Rob Spoor
Sheriff
Posts: 20903
81
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is is really necessary to load the entire file into memory? Usually, you handle the file line-by-line - you read a line, you process it, you discard it again. That way, at most one line of the file is in memory.
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks a lot Ulf and Rob,

I would be having the csv file in my disk. I've tried opencsv and hence i'm using the class CSVReader(FileReader(D:\\Sample.csv)

So , I would put this in a loop and read every row , put the data of the row in variables or a structure and process that iteration. Then move to the next row and repeat.

This would continue for the entire csv file.

So how can I make this efficient considering that the csv resides in the physical drive ( hard drive)

thanks,
Aditya
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think CSVReader builds an in-memory model of the complete CSV, so that might not be a good approach for parsing very large files. That would be easy to try and find out, of course.
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks a lot Ulf,

Any other suggestion which might help in this ? opencsv seems to be simple to parse the CSV file and get data into our local variables.

Using something else would be difficult

~Aditya
 
steve souza
Ranch Hand
Posts: 862
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I recently did something similar, and it was pretty simple, efficient and surprisingly fast. On my macbook I could read about 10 MB of the file a second. I would read in each row, process it and discard it. Reading and processing was about 8 mb a second (of course that will vary depending on what you do). Here is the basic idea. In your case the tokens would be the characters between each ',' in the csv.


 
Rob Spoor
Sheriff
Posts: 20903
81
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:I think CSVReader builds an in-memory model of the complete CSV, so that might not be a good approach for parsing very large files. That would be easy to try and find out, of course.

If we're talking about opencsv's CSVReader, it only stores the last read line when you use the readNext() method. And when I say store, I mean store temporarily, until the method ends. Apart from a BufferedReader for reading the file line by line, it only has a few primitive fields. If we're talking about storage, it's quite efficient.
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Steve and Rob,

I tried parsing a 6 GB file using CSVReader and it took 31 minutes to parse that file. I needto process the data further now for every row read.

Can this 31 minutes be reduced somehow ? How will increasing the RAM help ?

CSVReader uses BufferedReader hence the reading would be efficient. Can I somehow decrease the reading time? Is there any optimization technique here?

Thanks,
Aditya
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is it the reading of the file that takes up most of the time, or the processing of each row?
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Ulf ,

Its just the reading of the file that is taking 31 minutes. Somehow I want to optimize it more so that it reduces the time significantly.

Thanks,
Aditya
 
Rob Spoor
Sheriff
Posts: 20903
81
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could try using a BufferedReader with a larger buffer size, but in the end you will need to read through all 6GB of data. And that will simply take some time, even in native programs.
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey Rob ,

Will that help me reduce the time atleast by 10 mins ? Also will increasing the RAM help ?

How efficient will Perl be in parsing the file in terms of time ? ( Though I dont know Perl )

Thanks,
Aditya
 
Rob Spoor
Sheriff
Posts: 20903
81
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I doubt you will be able to trim as much as 10 minutes of your time, regardless of the technique. File I/O is slow, and 6GB simply will take a long time.

Still, feel free to try it out. You may surprise me.
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I shall try increasing the buffer size of the buffered reader and make it double. Hoping to see atleast 5 mins efficiency after that

Thanks,
Aditya
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

Just increased the buffersize of the Buffered Reader class to 32,768. No significant improvement found in the time reduced

Thanks,
Aditya
 
Rob Spoor
Sheriff
Posts: 20903
81
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just as I feared. It's not your program that is causing the long processing time but the hard disk. That's usually the bottleneck.
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you mean 32768 bytes? That's almost nothing. Try something in the megabyte range. But i agree with Rob that you're unlikely to see substantial improvements.
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

Yeah its 32,768 bytes. I guess the default is 8192 bytes right ? I'll try putting something in Megabytes then.

Thanks,
Aditya
 
Rob Spoor
Sheriff
Posts: 20903
81
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Adi Kulkarni wrote:I guess the default is 8192 bytes right ?

Correct.
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello all ,

One good news . I removed the System out statements for the rows and the time has reduced to 13.999 mins with 32,768 bytes as buffer size.

Thanks,
Aditya
 
Rob Spoor
Sheriff
Posts: 20903
81
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That makes sense. System.out.println() again are I/O based system calls.

As for the buffer size, don't make it too large. I've been playing around with a 2GB VMware hard disk file using BufferedInputStream, but performance got worse when hitting mega bytes. In the end, the file system's page size would be ideal.
 
Adi Kulkarni
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Rob/Ulf,

Thanks a lot for your continued support.

Following were the results.
Time required to parse the file : 13.99 minutes Buffer size: 32,768 Bytes
Time required to parse the file: 13.77 minutes Buffer size: 2097156 Bytes
Time required to parse the file: 13.2215 minutes Buffer size: 16777216 Bytes

So no significant change in time.

Thanks,
Aditya
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!