• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Reading a CSV file--> Fastest way

 
Ranch Hand
Posts: 167
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi
I have to read a csv file and an excel file (these are the 2 input files to me). The csv file contains data about individuals and the excel file contains mapping info. The csv file is expected to have about 300 rows of 100 cols each and the excel file about 50 rows of 10 cols each.
The csv file will contain a header row as the first row.
I have to pick the values from the csv file,see if a matching mapping exists in the excel mapping file and do some calculations. Not all the 100 cols of the csv file will be used, it depends on the mapping file (which can have diff mappings).i have to find all those values in the csv file, which have a mapping in the excel file and do caluculations for those values. This has to be done for each row of the csv file.
Now the question is, do i read each line of the csv, use Stringtokenizer to split and do the processing or use some of the api like ostermiller etc. speed is of utmost importance to me.(Of course if i use Stringtokenizer, after reading the forst data row, i will save the details of which useful col(to be used in calculations) is at which index, so that i need not do the search for col in the subsequent rows)
i read a comparative study about opencsv and it said that using StringTokenizer may be faster.
I am also ok with reading the entire file in 1 shot(saving the i/o) and do the processing, but a bit wary about the size of the input csv file which may be around 1 MB.

Expert comments are needed.

Thanks inadvance

Jhakda
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
1 MB isn't much. I'd start with using one of the CSV libraries and see if that isn't fast enough for your purposes. If it isn't, you can always code up something else, although I'd guess that these libraries aren't particularly slow. (Note that reading CSVs isn't all that trivial once you consider newlines characters, cells containing commas or semicolons, and quoted content.)

For reading XLS files, check out the Apache POI library.
 
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.)
When to optimize
 
Jhakda Velu
Ranch Hand
Posts: 167
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi
Thanks Ulf and Joe for the really prompt replies.
Well,i've been assured that the input csv file will be mostly numeric values,without comma or other spl chars. So i don't need the extra functionality provided by the CSV libraries.
And reading using a BufferedReader won't be a problem, atleast.
What worries me the most is speed.
Correct me if i'm wrong, won;t these libraries also be builidng on BufferedReader to read the csv? If yes, won't using BufferedReader be a faster approach, if i don't need the extra functionalities?

Thanks

Jhakda

[ December 08, 2008: Message edited by: Jhakda Velu ]
[ December 09, 2008: Message edited by: Jhakda Velu ]
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Since it's so easy to test both ways, it hardly seems worthwhile spending a lot of thought on this.

But heed Joe's advice, and only start thinking about this once you've made sure that you need to worry about performance.
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Jhakda Velu:
Correct me if i'm wrong, won;t these libraries also be builidng on BufferedReader to read the csv?

The library I looked at once for CSV processing didn't use that, or anything else. It took Strings that I passed it and parsed them into tokens as per the CSV rules. Which is as it should be. CSV parsers shouldn't be involved in reading from disks or URLs or anything else.

But I agree with Ulf. It's time to stop speculating now. You have plenty of information to work with.
 
Jhakda Velu
Ranch Hand
Posts: 167
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All
Thanks for the advice. I will try out one of the ways today and let you all know whether i meet the performance criteria,which happens to be 9 secs. This includes the time for uploading the 2 files,processing it at the server, producing the output file and sending it back to client. So roughly i have 2-3 secs to do the processing. And yes, i'll be reading from a stream at the server.
I'm likely to do the read line--> split string approach.

Thanks

Jhakda
 
Jhakda Velu
Ranch Hand
Posts: 167
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All
I have adopted the read each line of file approac, instead of using the libraries like OpenCSV and others. Major reason being i didn't need any of the fnacy additional functionalities provided by them
I am able to upload the input csv file and the mapping xls file, read them at the server and do the processing and return the output file in 6-7 secs. I used jxcel api to read write/excel file .
My inout file had data about users, each row representing data for 1 user.For each user, there were about 100 cols of data, all of which were are not used to generate teh output file. from the mapping file, i get the meaning of the cols of the input file,and the cols in the input file are not in the same order as the mapping file.For each user date, i end up generating around 70 rows in the output file.This could change if the mapping file is updated. for these reasons, i did not use the VO approach, as that would mean updating the code everytime they update the mapping or input file.
The approach i used was to read the first row of the input file, which contained header infor. this info is used to map witht the mapping file. So as soon as i read the first row, i added the col heading and their respective position as key value pair in a HashMap. For subsequent rows, i need not loop to find the column position.

Thanks
Jhakda
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic