My project has a high-priority requirement that would require bulk data processing, be more specific, we need to parse an Excel file with 500,000 rows by POI and insert these 500,000 rows into database. Since this would be a time-consuming task for the application server. My strategy is
1) Use JMS queue to process the JDBC operation, so once file upload is completed. An event will be sent to the listener and put the database operation into JMS queue
2) Instead of regular JDBC operation, batch update will be applied when actually updating the database
3) An email will be sent to the user when the update is completed
The question I have is
1) Will POI 3.2 be able to able to handle 500,000 rows at a time?
2) More important, is there any other alternative design what you could recommend to process bulk data from Excel to database?
In a somewhat similar situation I had to parse and load large XML files and store certain values in an Oracle database, roughly over 10 million rows per night. This was in a JBoss ESB environment. Some files were small, several KB, but some of them were large, for instance 20GB and over. Instead of reading and sticking so much data in the JMS queue, I simply had the process "lock" the XML file (rename it) and only sent the file name in the queue. Then the receiving action class would pick up that file and parse it with WoodStox and use Hibernate to do the DB inserts. This worked quite well and very fast, and with multiple threads (multiple files at once) that was configurable in the ESB settings.
Does the data have to be in Excel format? Can't it be in CSV? Most servicing systems that generate large amounts of data have an easier time generating CSV because it's plain text. Also, most databases provide utilities that allow you to bulk load CSVs into the database. The database provided utility is generally orders of magnitude faster than doing JDBC batch update operations yourself. You can make your JDBC batch updates as fast as the database utility, but it would require quite a bit of reinventing of the wheel to get there. By using database utility, you are tying yourself to the database. However, IMO, it's worth it.
If you can push back on the requirement and go for CSVs instead of Excel, you might not need background processing. We routinely load 300-400K worth of records using sqlldr within 10s. It;s not a good idea to run the sqlldr on the web server in the long run. It would be better to offload that processing to another node. However, if you are looking at something quick and dirty, or doing a proof of concept, you can always run sqlldr on the web server today, and then build the logic to run it on a different node later.
Yes, it would indeed be a good idea to see which loading or inserting tools you have with your database. You may also be able to tune your database further to perform faster on bulk inserts. For instance for Oracle, there are big performance differences in which tools or inserting methods you are using. Check this article out: http://dba-oracle.com/t_optimize_insert_sql_performance.htm
In the project I am working on , Excel file is the only choice they have at this point, and database is managed by another company so there will be political obstacle that has to be overcome in order to run SQL Loader directly on Oracle server. In this case, the solution has to be constrained to Excel file on J2EE container itself. Any choice other than JMS queue?