Forum:

Performance

Handle Huge Data

Greenhorn

Posts: 3

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Requirement:

1. Read huge data(1.1 million records) and write to a flat CSV File.

Implementation Existing:
1. I have DAOImpl class that connects to database and get the records by executing a query which is stored in a List.
2. I iterate thro the list and write it to a flat CSV file.

Issue in Existing System:
1. Since the records i am trying to store in List is huge, the system fails and error out with OutOfMemory.

Proposed Solution I

Fetched the records as chunk size and write it to file and it took 24 minutes to write one report of 850000 records to CSV file)

Code Implementation:

1. I created two methods as follows by storing one chunk of data from resultset into a list and process the list to write to file( chunk size could 1 to 100000) and the system worked fine with writing a huge data(850000 records) into flat CSV file in 24 minutes(CSV output file size is 45MB). here is the method i used to implement this scenario.

private long reportDataAmount = 0;  
 
public void getReportDataChunk() throws BaseSystemException
{
List reportDataList= new ArrayList();
ReportDataDTO reportDataDTO= null;
String sql = SQL ;
try {
log.info("Getting Connection to Database");
conn = getConnection();
log.info("Connected to Database");
st = conn.createStatement();
log.info("Query to be Executed :" + sql);
rs = st.executeQuery(sql);
reportDataAmount = 0;
} catch (SQLException se) {
BaseSystemException be = new BaseSystemException("Error getting data", se);
ExceptionLoggingUtil.logBaseException(be, this.getClass().getName());
throw be;
}
}
 
public List getNextReportDataChunk(int chunkSize) throws BaseSystemException
{
List reportDataList= new ArrayList();
ReportDataDTO reportDataDTO = null;
boolean closeResources = true;
try {
if (!conn.isClosed()) {
int recCount = 0;
while (rs.next()) {
reportDataDTO = buildFromResultSet(rs);
reportDataList.add(reportDataDTO );
recCount++;
if (recCount == chunkSize) {
reportDataAmount= reportDataAmount+ recCount;
closeResources = false;
break;
}
}
}
} catch (SQLException se) {
se.printStackTrace();
} finally  {
if (closeResources) {
closeResultSet(rs);
closeStatement(st);
closeConnection(conn);
}
}
return reportDataList;
}

Client Class Implementation for Proposed Scenario I:

Question :
1. If i modify my getReportDataChunk method to get the fetchSize of resultSet and then set the rs.fetchSize(100000), the same report is generated in 2:13 Minutes. What is happening in the resultset fetch and also when i did a debug of the resultset while writing it to the file. once the initial fetchsize of 100000 is written to the file still i see the m_rows of resultset implementation holding the previous rows in memory?? Am i wrong in my implementation. If you guys could give me a solution to achieve better performance results that would help me a lot.

Thanks
Gandhi

[added [code] tags - Jim]
[ November 11, 2005: Message edited by: Jim Yingst ]

Stefan Wagner

Ranch Hand

Posts: 1923

I like...

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

thank you, for using code-tags; that's much better readable.

(updated, to say 'thank you'.)

[ November 10, 2005: Message edited by: Stefan Wagner ]
[ November 11, 2005: Message edited by: Stefan Wagner ]

http://home.arcor.de/hirnstrom/bewerbung

Jeffrey Hunter

Ranch Hand

Posts: 305

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Of course this may be besides the point, but, why would you want to store >1 million records in a flat file?

Also, you might look into multiple, smaller files and multi-threading to speed up your queries.

Also, you might want to look into a CachedRowset implementation in order to free up your database connection while processing the results.

Joe Ess

Bartender

Posts: 9626

I like...

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Originally posted by Jeffrey Hunter:
. . . multi-threading to speed up your queries.

Threads are useful for making a program appear to do two things at once, like update a progress bar while processing data. Unless one is on a multiple CPU system, these two threads of execution must share the CPU and due to the overhead of bookeeping and context swapping, will actually run SLOWER than a single threaded app.
Typical Database Bottlenecks:
1. Memory = In order to execute a query, a database must load all the records in all the tables in a query into memory. This can cause quite a lot of churn as the contents of physical memory are swapped out to disk to make room for more records. If your server doesn't have enough memory it is wasting time swapping when it can be searching.
2. Other Hardware = A database server is only as fast as its hardware (CPU, disk array and so on). If you aren't using enterprise-class hardware, forget about enterprise-class performance.
3. Network = if the database is on a remote server, the amount of data you can receive is limited to network throughput. If everybody is pummeling your 10mbps network downloading wares, forget about getting those 1 million records anytime soon.
4. Local Database = If the database is on the same machine as your app, you can have the DB and application fighting for resources. This would multiply the effects of the problems above unless you have multiple CPU's and many gigs of memory.
Gandhi, what you need to do is figure out where your bottleneck is and address the problem. Our guessing isn't going to help you. The
Java Platform Performance book is a good start on how to measure and improve performance but unfortunately it doesn't cover JDBC.
[ November 11, 2005: Message edited by: Joe Ess ]

[How To Ask Questions On JavaRanch]

William Brogden

Author and all-around good cowpoke

Posts: 13078

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Seems to me that multithreading could help if either the database or the file being written is local to the machine doing the work since the network and local file system requests can run independently. If both processes need network bandwith, it may be another story.
Real measurements will be required.
Bill

Jeffrey Hunter

Ranch Hand

Posts: 305

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I've just implemented a multi-threaded solution to speed up a slow query. Had a query running against time series data, which contained >55 million rows. Our final data set, once retrieved, was about 20MB.

I found that splitting the query up into three threads, each charged with retrieving an interval of the overall time series, cut down the query time by 30%. Haven't done substantial testing on this, but I think with some more tweeks, we can cut it down even more. Of course eventually the more threads used, the less likely you will save time, and it may end up costing you time, but in this case, three threads shaved off about 30%.

Virag Saksena

Ranch Hand

Posts: 71

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

As you have already discovered by setting a larger fetchsize, that most of your time was being spent in the network round-trips between the java app and the database (you reduced it from 24 minutes to 2 minutes).

What database are you using and is the resultset type scrollable ?

<a href="http://www.auptyma.com" target="_blank" rel="nofollow">The Peak of Performance</a>