I have a requirement to perform clean insert (delete + insert), a huge number of records (close to 100K) per requests. For sake testing purpose, I'm testing my code with 10K. With 10K also, the operation is running for 30 secs, which is not acceptable. I'm doing some level of batch inserts provided by spring-data-JPA. However, the results are not satisfactory.
My code looks like below
I have also set below properties
Looking at logs, I noticed that the insert operation is taking around 3 - 4 secs to save 2000 records, but not much on iteration. So I believe the time taken to read through the stream is not a bottleneck. But the inserts are. I also checked the logs and confirm that Spring is doing a batch of 40 inserts as per the property set.
I'm trying to see, if there is a way, I can improve the performance, by using multiple threads (say 2 threads) that would read from a blocking queue, and once accumulated say 2000 records, will call save. I hope, in theory, this may provide better results. But the problem is as I read, Spring manages Transactions at the thread level, and Transaction can not propagate across threads. But I need the whole operation (delete + insert) as atomic. I looked into few posts about Spring transaction management and could not get into the correct direction.
Is there a way I can achieve this kind of parallelism using Spring transactions? If Spring transactions is not the answer, are there any other techniques that can be used?
I took Person class as an example here. But the Entity looks like as below
Unique constraint is defined on PERSON_DT and PERSON_ID and a sequence
Stephan van Hulst
posted 3 months ago
Not saying that any of these is necessarily wrong, but I'm just curious:
Why do you declare a sequence generator / generated value, instead of relying on just @Id?
Why do you have both a surrogate ID as a person ID?
Why do you set the batch size to 40 while creating lists of 2000 elements?
Anyway, I don't think multiple threads are going to help you (note that it's very iffy to create new threads in the web application in the first place). Maybe you can use a network analyzer to see whether the individual batches take a long time to process by the database. At any rate, hopefully a database wizard on these boards can help you out.
Just to add that nothing runs faster than a well written stored procedure that will run within the database. Not sure how this would work with your existing code.
I've also worked on systems in which the information is marked as "deleted", typically a boolean flag column in the table. This then excludes the information from any future queries. At a later date all rows that have been marked for deletion are then deleted and/or moved into an archive. Of course all database queries need to consider this flag.
I want to caution about stored procedures. They can, indeed be very effective, but they do tend to lock you in to a particular vendor's database (and if you think that isn't a problem, you don't work for Amazon - they're moving the entire enterprise off Oracle). Also you can end up with cases where you have to "treasure hunt" code between apps and database. And source management for stored procedures historically wasn't as good as for regular code. So use stored procedures wisely.
Most likely the real bottleneck here is that when you add many records at a time, the database server goes crazy indexing them all. Adding the data is fairly fast, but often the indexes are balanced binary trees or something similar and the DB server ends up doing massive amounts of I/O rebalancing the trees with each record add.
I learned this the hard way almost the first day of my professional career and have been re-reminded occasionally ever since on a variety of platforms.
Common industry practice is to delete whatever indexes you have, load the table, then re-create the indexes. That way instead of death by 1000 cuts, you get one big hit whose duration is very much shorter than the sum of all the little indexing operations.
Of course, these days, a database might be continuously online and performance for queries would suffer while the indexes were removed, but traditionally you didn't dump that much data in while online systems were running.
It's also worth noting that almost all DBMS's have a vendor-supplied utility that can do batch loading. It may be faster than attempting the operation in Java code.
When it comes to destroying a civilization, gas chambers cannot hold a candle to echo chambers.
During a (spring) web presentation I heard about a [yet another] new spring project Spring Data R2DBC - seems to be what you were asking for.
I don't know anything much about this new project, and this new "reactive" [buzz] word was new to me but from the docs;
Part of the answer why R2DBC was created is the need for a non-blocking application stack to handle concurrency with a small number of threads and scale with fewer hardware resources. This need cannot be satisfied with reusing standardized relational database access APIs - namely JDBC – as JDBC is a fully blocking API. Attempts to compensate for blocking behavior with a ThreadPool are limited useful.