• Post Reply Bookmark Topic Watch Topic
  • New Topic

Reading a huge file and dumping in DB  RSS feed

 
Gavi Raaghav
Ranch Hand
Posts: 82
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a file with around 10 million rows. Currently a SQL server utility (Bulk Insert) reads the the file and places it in a temp table.Another SP reads the temp table data,massages it and updates it in the final destination table.

The same functionality needs to implemented using Java. Following is the approach which i am thinking of:

1. A java program reads the file which consists of 10 million rows.
2. The program will read every row and convert it in a java object.
3. Business logic will be applied on the object and if the the object satisfies the validation it will be placed in a collection.
4. After every row has been read and the collection is ready, it will be passed to another java program which will fire the executeBatch and will insert the rows(objects) in the database.

There are a few questions regarding the above approach:

Q1. Is the original approach better or the java approach better performance wise?
Q2. What could be an alternative approach using Java or any other technology(probably Unix).
Q3. In the Java approach described above can multiple threads update the DB simultaenously by taking x rows each from the collection?
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If I read you literally ...

You will create 10 million objects in a collection. There's some chance that won't fit in memory. You may want to make a tunable number "n" objects at a time.

You mention passing the collection to another Java "program". If you mean another program started from the command line "java program2" there's no way to pass a collection directly from one to another. If you really meant another Java class, I'm with you.

Given this task, I'd probably put "n" rows at a time into the collection or maybe right into a SQL statement. That way I get a commit every "n" rows. Back in my mainframe days programs like this had "restart" capability. If the first 1000 batches worked and 1001 failed, you could fix up the data or the code and tell the next run to skip (read by) the first thousand batches.

Do you have any idea how long it will run? Probably longer than the native DB load?
 
Gavi Raaghav
Ranch Hand
Posts: 82
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
James thanks for the reply.
I was aware of the fact of creating 10 million objects in the collection.Can you please explain what do you mean by tunable n objects?
Also will it be ok if i insert row on a one by one basis in the database?

Also how do you put restart capability in the class?

Overall is this aproach sane?
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The batch mode of statement is made for things like this. The simplest solution:

would be relatively slow and would only save to the database if the whole input file worked.

would hit the database and commit only every "n" rows instead of every row, which should be significantly faster. Experiment to see how much faster and to find the best batch size.

As far as a restart, you might display or save somewhere the number of rows or batches that have been successfully comitted. Then just put a skip loop before the other one:

That's awfully abbreviated. Let me know if it made sense.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I had to insert about 10 million rows to a postgresql-database, and coulnd't insert more than 100,000 rows uncommited.
So I used an ' if (rows % 90000 == 0)' to commit from time to time.

I used threads to split the work:
a) read source
b) parse and do some calculation
c) insert into database

This way I started filling the database before the reading of all rows finished.
 
Gavi Raaghav
Ranch Hand
Posts: 82
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Stefan,
The approach which you stated is exactly what i want to do.
So if i understand correctly you had 3 threads running simultaenouly say T1,T2 and T3.
T1 keeps reading the rows and keeps inserting them in a temp collection,T2 keeps reading the temp collection and applies validation to it and keeps them in anothe temp collection from where T3 picks them up keeps inserting these processed rows in the database.
Is the above understanding correct?

What was the performance of the system in terms of time elapsed and what was the architecture followed?
Also i would appreciate if you could throw some more light on the implementation.
Also how did you manage the deadlock situation in case of using threads.
 
Guy Allard
Ranch Hand
Posts: 776
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You have a total row count at the start, yes?

Then I would suggest that you make 'rows procesed per thread' a tunable, perhaps specified in a .properties file.

So, you might have 2 Threads, you might have 100.

I use this to good effect in some image processing functionality.

You may still have memory issues ......

Regards, Guy
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Gavi Raaghav:

Is the above understanding correct?

What was the performance of the system in terms of time elapsed and what was the architecture followed?
Also i would appreciate if you could throw some more light on the implementation.
Also how did you manage the deadlock situation in case of using threads.


Your understanding is correct.
My whole program is available under the GNU-License from here:
http://home.arcor.de/hirnstrom/freecddb/index.html

It's aim is, to read the CD-Databasefile from freedb, which consists of 1 million CDs and about 10 million tracks.
I calculated a new hash-code for every cd between reading and writing.

The program was running on an x86-Laptop (1 Ghz, 256 MB RAM) and reading from a local file.
The database was sitting on an second x86-PC, 1 Ghz, 1 GB RAM, running postgresql on Linux, with no other users connected.

I don't remember the time needed, but it was at least 2 hours, probably more.

I used Vectors as collections to store my Objects, and waited for the vector to return a value, if it was empty, and accessed it synchronized.
I didn't get problems with this technique.
 
Rick O'Shay
Ranch Hand
Posts: 531
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Off the top of my head, my solution would be as follows:

Use JDK 1.4NIO package
 
Rick O'Shay
Ranch Hand
Posts: 531
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Rick O'Shay:
Off the top of my head, my solution would be as follows:

Use JDK 1.4NIO package


Sorry, accidentally hit the Enter key. As I was saying:

Use JDK 1.4 or greater with NIO package for memory mapped files. That will take care of your memory constraints on the input. Second, use a prepared statement to perform a batch insert/update of large blocks of processed records. PreparedStatement supports batch updates: addBatch.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!