• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

which is better - array or arraylist??

 
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi all,

I am working on a project and facing some performance problems. I need to parse csv data and store that data in various tables in a data base. Some of the fields(Max 5-8) in csv corresponds to same database fields, so before inserting them in DB I need to store them in a list or array.

There are more than 10,000 rows in csv, so I need to optimize this thing. So I need to know which is the best thing to use in this case, a simple array or an array-list.

Any help/suggestions or links are welcome.

Thanks
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How do you know that you need to optimize? Have you coded it up, found it too slow, and determined that your choice of data structure makes all the difference in terms of performance? If the answer is no to any of the three, then it is too early for this kind of optimization.

Also, 10000 does not sound like a big number in this context (maybe if each row contains hundreds of cells).

See EnterprisePerformance for some hints and pointers about optimization in general.
[ April 16, 2007: Message edited by: Ulf Dittmer ]
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Based on my extensive experience with text parsing, the array versus arraylist question is minor compared to String handling optimization and the database operations.

Sure - if you know in advance the largest array you will ever need, it will save some object creation to just reuse a String[] but I bet it will be small.

How much of this csv data needs to be in memory at any one time, can you handle it on a line by line basis?

If I had to bet, I would bet that by far the slowest thing will be the database writing operations.

Bill
 
Ravinder Rana
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Ulf and William for your comments. Ulf, I have finished the coding part of it and found that it's working fine for csv data containing 5000 rows with 50 cells each but as the no. of rows increases the application slows down and sometimes it also throws OutOfMemoryException. I think it's because I am creating a lot of objects. Following is the part of code I have written:



So, I am creating customInfo and phonesNos objects for each row of csv data.
Can you tell me how to avoid this thing?

William, you are right that database writing operation are going to be slowest, so to avoid this I am using bulk insert approach. This saves me the no. of database calls.

Regards,
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm not convinced that it's the choice of data structure that will make a difference. As William suggests, you can avoid memory problems by not keeping all data in memory, but intermittently storing them in the DB.

By the way, the CSV code has problems. It doesn't look like it handles the case of either line breaks or quotes inside of text.
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

String[] lines = csvString.split("\n");



Does that mean you have read the entire file into a String?
If you don't absolutely need the whole data set in memory at one time, why not read and process line by line? For example with the readLine function of BufferedReader.

What exactly is the source of this csv data text? A file or a stream of characters?

Bill
 
Ravinder Rana
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Ulf I have updated my code to support line breaks or quotes inside of text.

William, I have read the whole data at once because the data is not in a csv file but am getting this from some other program and I have no control over it. So can you suggest some String handling optimization techinques, for this particular case.

Thanks
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

but am getting this from some other program and I have no control over it.



Is this other program handing you a Java String or is there a character stream involved?

If you are stuck with a String, you might use it to create a StringReader which can then be used to create a LineNumberReader or BufferedReader. Both of those have a readLine() method that will give you the lines one at a time. With your present code, your memory has to hold the original String plus the same contents duplicated as String[].

Bill
 
Ravinder Rana
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

create a StringReader which can then be used to create a LineNumberReader or BufferedReader.



But, the string will still remain in memory, so what's the benefit of using this?
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
One option (probably simplest but not best) is to just increase the amount of memory your JVM is allowed to use, using the -Xmx option.

Since memory is your problem, and that initial String is a big part of it, I would try to get rid of that String as soon as possible. My first choice would be to change the outside code to give you a Reader rather than a String. It's a bad API that forces you to keep all the data in memory at once.

If that's not possible (as you seem to indicate), then in your own method, you could write the String to a file as soon as you get it. (Assuming that the OutOfMemoryError doesn't occur before your own code starts.) Then read it back, one line at a time, and process each line as you get it. Writing and reading a file is probably slower than keeping everything in memory, but it avoids the OutOfMemoryError. It also may not be slower, as it means garbage collection won't have to work as hard. Also the performance of writing and reading a local file may well be irrelevant compared to the database. The most important thing is to avoid the OutOfMemoryError. Fix that, then see if the performance is acceptable, and if not, identify where the slowness is really coming from.

Incidentally, CSV parsing can have a number of gotchas that are not initially obvious. It's usually simpler to use an existing library. Google "java csv parser" for a number of options.

In the code shown, you don't actually do anything with customInfo or phoneNos after you gather their data, and they become available for garbage collection every time you move to a new line. Is there more code inside the loop that you're not showing? You said: "Some of the fields(Max 5-8) in csv corresponds to same database fields, so before inserting them in DB I need to store them in a list or array." Does this refer to the ArrayList and HashMaps shown in the code here, or are you putting these records into another big list? If you can send each record off to the database as you read it, rather than putting it in a big list of some sort, that will also help your memory usage quite a bit.
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
[Bill]: With your present code, your memory has to hold the original String plus the same contents duplicated as String[].

Well, split() and the Strings created by the StringTokenizer aren't quite that bad, as they don't exactly duplicate their content. They both ultimately rely on substring() which creates a new String which shares the same backing char[] array as the original String. So yeah, each new String does take some more memory - typically 20 bytes I think - but not as much as if it had copied the content into a new char[]. Still, that's 20 bytes for each little phone number, plus the memory usage of the HashMaps (I'm assuming the ArrayList is GC'd after conversion to String[]). It does all add up.
[ April 18, 2007: Message edited by: Jim Yingst ]
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

They both ultimately rely on substring() which creates a new String which shares the same backing char[] array as the original String



Oh yeah, that bit me one time because it hangs on to the big string even after the last direct reference is gone. I had been under the impression that this was changed but I see it is still true in the 1.5 String code.

However, I'm not sure that the expression being used
split("\n")
is in fact using substring - looks like that invokes a regex Pattern split in which the new Strings are created with subSequence( index, index ).toString() and does a whole lot more.

Still looks like lots of extra String creation to me, so use a readLine to get one String at a time.

Bill
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
[Bill]: I had been under the impression that this was changed but I see it is still true in the 1.5 String code.

Me too, but it was just pointed out here that I was mistaken. I think they just removed char[] sharing with StringBuffer because of threading / memory model issues sharing the char[] between a mutable and immutable object. They couldn't reliably fix these without adding synchronization or a transient field to the String class, which would slow it down unnecessarily for all the other more common cases. For substring() this isn't an issue because both parent string and substring are immutable, so the char[] array is now held by a final variable, avoiding the threading issues.

looks like that invokes a regex Pattern split in which the new Strings are created with subSequence( index, index ).toString()

The implementation of subSequence() in String for JDK 5 and 6 calls substring(). And toString() in String just returns itself.
 
Ranch Hand
Posts: 862
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
in general it is not a good idea to let your program consume memory unbounded. What happens if next month you have 50,000 rows?

I would put as part of my code the ability to read a set ammount (say 1000 rows), and then write them to the database.
 
Ravinder Rana
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks everyone for helping me out. All comments helped me in better understanding the concepts. Now I changed my design a little bit and that I think is performance wise scalable.

Once again thanks a lot
 
reply
    Bookmark Topic Watch Topic
  • New Topic