• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Handle Huge Data

 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Requirement:

1. Read huge data(1.1 million records) and write to a flat CSV File.

Implementation Existing:
1. I have DAOImpl class that connects to database and get the records by executing a query which is stored in a List.
2. I iterate thro the list and write it to a flat CSV file.

Issue in Existing System:
1. Since the records i am trying to store in List is huge, the system fails and error out with OutOfMemory.


Proposed Solution I Fetched the records as chunk size and write it to file and it took 24 minutes to write one report of 850000 records to CSV file)

Code Implementation:

1. I created two methods as follows by storing one chunk of data from resultset into a list and process the list to write to file( chunk size could 1 to 100000) and the system worked fine with writing a huge data(850000 records) into flat CSV file in 24 minutes(CSV output file size is 45MB). here is the method i used to implement this scenario.

Client Class Implementation for Proposed Scenario I:

Question :
1. If i modify my getReportDataChunk method to get the fetchSize of resultSet and then set the rs.fetchSize(100000), the same report is generated in 2:13 Minutes. What is happening in the resultset fetch and also when i did a debug of the resultset while writing it to the file. once the initial fetchsize of 100000 is written to the file still i see the m_rows of resultset implementation holding the previous rows in memory?? Am i wrong in my implementation. If you guys could give me a solution to achieve better performance results that would help me a lot.

Thanks
Gandhi

[added [code] tags - Jim]
[ November 11, 2005: Message edited by: Jim Yingst ]
 
Ranch Hand
Posts: 1923
Scala Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
thank you, for using code-tags; that's much better readable.

(updated, to say 'thank you'.)

[ November 10, 2005: Message edited by: Stefan Wagner ]
[ November 11, 2005: Message edited by: Stefan Wagner ]
 
Ranch Hand
Posts: 305
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Of course this may be besides the point, but, why would you want to store >1 million records in a flat file?

Also, you might look into multiple, smaller files and multi-threading to speed up your queries.

Also, you might want to look into a CachedRowset implementation in order to free up your database connection while processing the results.
 
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Jeffrey Hunter:
. . . multi-threading to speed up your queries.



Threads are useful for making a program appear to do two things at once, like update a progress bar while processing data. Unless one is on a multiple CPU system, these two threads of execution must share the CPU and due to the overhead of bookeeping and context swapping, will actually run SLOWER than a single threaded app.
Typical Database Bottlenecks:
1. Memory = In order to execute a query, a database must load all the records in all the tables in a query into memory. This can cause quite a lot of churn as the contents of physical memory are swapped out to disk to make room for more records. If your server doesn't have enough memory it is wasting time swapping when it can be searching.
2. Other Hardware = A database server is only as fast as its hardware (CPU, disk array and so on). If you aren't using enterprise-class hardware, forget about enterprise-class performance.
3. Network = if the database is on a remote server, the amount of data you can receive is limited to network throughput. If everybody is pummeling your 10mbps network downloading wares, forget about getting those 1 million records anytime soon.
4. Local Database = If the database is on the same machine as your app, you can have the DB and application fighting for resources. This would multiply the effects of the problems above unless you have multiple CPU's and many gigs of memory.
Gandhi, what you need to do is figure out where your bottleneck is and address the problem. Our guessing isn't going to help you. The
Java Platform Performance book is a good start on how to measure and improve performance but unfortunately it doesn't cover JDBC.
[ November 11, 2005: Message edited by: Joe Ess ]
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Seems to me that multithreading could help if either the database or the file being written is local to the machine doing the work since the network and local file system requests can run independently. If both processes need network bandwith, it may be another story.
Real measurements will be required.
Bill
 
Jeffrey Hunter
Ranch Hand
Posts: 305
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I've just implemented a multi-threaded solution to speed up a slow query. Had a query running against time series data, which contained >55 million rows. Our final data set, once retrieved, was about 20MB.

I found that splitting the query up into three threads, each charged with retrieving an interval of the overall time series, cut down the query time by 30%. Haven't done substantial testing on this, but I think with some more tweeks, we can cut it down even more. Of course eventually the more threads used, the less likely you will save time, and it may end up costing you time, but in this case, three threads shaved off about 30%.
 
Ranch Hand
Posts: 71
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As you have already discovered by setting a larger fetchsize, that most of your time was being spent in the network round-trips between the java app and the database (you reduced it from 24 minutes to 2 minutes).

What database are you using and is the resultset type scrollable ?
 
reply
    Bookmark Topic Watch Topic
  • New Topic