I am working on an interface which does DB interaction on some system. As part of my work, I am supposed to query the source db, invoke some procedure, get the data in a reference cursor and populate the destination db. As the data volume can be huge, I am using multi threading on the destination db to invoke the procedure. For ex, if the total number of entries that are supposed to loaded is 1 million, then on destination db, the procedure is invoked say 10 times with 100K records each. This arrangement is working fine except when the data volume at the source db is huge (for ex more than 2 million entries). I have set around 20 GB of heap space for processing the record but my program is failing with heap memory error. I want to know if there is a way to query the data from the source db in parallel mode (for ex, assuming, a total of 2 million records is fetched from the source stored procedure, my program should first fetch a subset of this record and then move on to next or something like that).One of the solution that I have proposed is to send the records in this manner though db side but I want to know if there is a better alternative. Please suggest
Is the data move part of an ETL job ?
>> Yes.. but the current system does not use any ETL tool
Where was the multi-threading achieved ? Through multiple parallel JDBC calls to a procedure ?
Where was 20GB of heap allocated ? On a java program ?
If this is part of an ETL job, you are much better off using an ETL tool. The tool would have already solved many problems that you face now and ones that you have not even begun to consider. There are open source / free ETL tools out there. Use one of them to extract the data instead of using a custom java program to do it.
I've used ETL tools before to load millions of rows into tables and I've never once worried about RAM.
nitinram agarwal wrote:Hello Bala,
Thanks for your response. Unfortunately client does not have budget for ETL products and hence for the moment we have to continue with java based program..
Some of these ETL products are open source. You can use them for free and pay for support when you need to. If you want to stick to the java program and not run out of memory I'd recommend limiting the threads through a thread pool and keeping an eye on memory and what consumes the most memory through a profiler. Running out of 20GB of heap space is abnormal even for a program that loads tons of rows into a database.
The issue of running out of memory is happening when multiple jobs of similar nature is invoked at the same time (more than 10).. I have done profile check on my code to ensure that memory leak is fixed. On my desktop, I have a heap size of 1 gb and some of these jobs fail with heap error. The jobs cannot be executed one after another as some of the processing is SLA bound. I agree that the process might need relook but I was trying to see if something is possible technically for fetching the data in parallel and also some control on invoking the multiple jobs.
Why do you keep the records in memory? If the process just copies records from one connection to another (possibly manipulating some records on the way), there should not be need to keep them in RAM. Such process should need next to nothing in memory terms, even if it uses several threads at once.
(Sometimes the JDBC driver does some caching. Make sure you're using forward-only, read-only resultsets.)
Why can't we use the batch size less than 10k?
Is it a 32 bit or 64 bit system because in 32 bit you will not be able to allocate memory to avoid swap space errors
is it on unix or windows?
stack size tuning helps resolving memory issues
can you implement a check on source DB to provide you data in smaller chunks?
is your code scalable so you can run multiple java processes?
nitinram agarwal wrote:Where was 20GB of heap allocated ? On a java program ?
Then, unless the units of data you're processing are huge, it sounds to me like you do have a memory leak. Even allowing for some overhead, running out of 20Gb of memory on 2 million entries suggests that each one is using around 10k of space each, and I've never seen a table that had rows that big (and I've worked on some pretty big databases).
How big is the database itself?
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here