Before a couple of days I was new to Spring Batch and I have done some reading.
We have a process written in batch that is supposed to read close to a million rows with only one column from a table and write it to a text file.
It points to ORACLE DB and the batch read size is default (10).
Driver used is oracle.jdbc.driver.OracleDriver.
ItemReader we have used is Spring's JdbcCursorItemReader and we have implemented ItemWriter, ItemStream, StepExecutionListener, InitializingBean for the writer.
We have a row mapper.
The whole process of read/write takes 5 hrs everyday which is not acceptable to the client.
When i check the logs I see that the time taken from the time a row is mapped till it is written (for each record) only 1 or 2 milliseconds are being used.
However the next row is mapped only after 15 milliseconds - what is Spring doing here in between? We want to optimize this.
Where should I start?
Please suggest, I would be grateful !
However, I realized that my knowledge of commitInterval was completely wrong as I thought it to be related to DB commits.
After your post I tried the commitInterval and it has made the process faster for sure. How much - I am still seeing, will let you know.
I am assuming if I keep the commitInterval too high, it will keep items in memory which means heap issues and all. Right?
Thanks for that great help.
One more question - if I am fetching only one column from the DB, should i need a RowMapper?
I think I can save some time if I can get rid of the mapper somehow.
I have tried removing it and my tomcat throws an error stating about the mandatory mapper.
I am digging more into this, but would appreciate if you had any ideas.
If you do that then you don't need the rowMapper. While I doubt you will see a great increase in performance, it still is very easy to quickly write.
Mustafa, as for as commitInterval. The way the Spring Batch works is called chunk processing. In the Read and process of a step it done as Item by Item, then the write portion gets a Collection where the collection has the number in the commit interval/chunk size. If I set that to 1000, it will read 1000 rows one by one, then pass each row to the RowMapper, after each read/process that Item is placed into the Collection, when that collection gets 1000 items, then the collection is sent to the Writer. The Writer writes the 1000 in a batch, then commits the transaction. Then we do the reading/processing again for the next 1000 Items. The reader does some caching of data until the Transaction is commited. So in your Oracle batch read size of 10, would mean that the reader will contact the database 100 times per chunk. Since the commit interval here is set to 1000. I would make the chunk size and batch read size in the database to be the same number. 1000 will be good, 10000 will also be fine too.
The reader is buffered and most likely the number of records read from the database in one go, would be independent of the commit count. It would depend on the Driver, Resultset setting and other JDBC related setting.
So why does the commit interval of one take longer time? There are two reasons
a) Like Mark explained below spring follows a chunk oriented approach. So at the end of every commit interval - it would save the resource being written too (save the file, commit the database). This take some time.
b) Spring also saves meta-data information at the end of every commit interval, namely the STEP_EXECUTION_CONTEXT and the Number of Rows read and processed. It needs to do so for re-start-ability of the batch job. This takes the major chunk of time in your case.
Commit intervals up to 1,000 is safe. Anything over, you hardly notice any performance gains. Spring folks recommend a commit interval in the range of 100.
I got commitInterval added to the code.
On a little more research of our tomcat's memory settings (1 GB max) and the fact that this process runs every night at 2:00 AM which means less load on server, i deduced that i did not need to worry much about high value of commitInterval overflowing the memory and set the value to 10000.
The process completed in 44 mins.
Should I lower this value? (I am sorry I didn't read Sam's latest post carefully about the ideal value of commitInterval).
Thanks again mates.
Mark Spritzler wrote:You can always test it out. Can you make a staging like server with the same settings talking to a temp database, and just try different intervals and see how it affect it.
Been there done that! 1000 or 10,000 makes no big difference to the time. Anything over 10,000 might slow things off (memory tradeoff).
Mustafa Garhi wrote:For one million records i just found difference of a few seconds between 1000 and 10,000.
We decided to stick to 10,000.
CommitInterval has done the trick.
Thanks everybody !!
Cool. Also consider this, if it has any affect. If a step fails and you want to restart. It will restart on the last chunk that caused the job to fail and have to do the entire chunk. So that would be 10K or 1000 depending on your chunk size. So there is a trade of to higher or lower. Just pick the happy medium for success completion time with restarts.