Win a copy of Java XML & JSON this week in the XML and Related Technologies forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Devaka Cooray
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Junilu Lacar
  • Paul Clapham
  • Knute Snortum
Saloon Keepers:
  • Ron McLeod
  • Tim Moores
  • Stephan van Hulst
  • salvin francis
  • Carey Brown
Bartenders:
  • Tim Holloway
  • Frits Walraven
  • Ganesh Patekar

Slow performance of bulk inserts into large MongoDB collection  RSS feed

 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have data in JSON format containing millions of records that I want to insert into MongoDB database. I created a JAVA program that reads the JSON file, parses it and bulk inserts it to the MongoDB collection using the insertMany() method. Each bulk insert contains 10000 documents. Average size of the document is 13 kB. After inserting roughly about 300 000 documents to the collection, the performance of the inserts progressively starts slowing down. There are no indexes on the collection apart from the default one provided by MongoDB.

I have looked into the mongod.log to diagnose the problem and it looks like after the collection contains about 300 000 documents, every following bulk insert causes an aggregate command with COLLSCAN on the entire collection. After the collection contains 3 000 000 documents, the COLLSCAN took about 30 seconds. The time of the bulk insert operation itself does not change, staying at average 200 ms/10000 documents.

The complete log file from MongoDB can be found here: https://pastebin.com/STDZTJJU

The following JSON output is an example of the aggregate command that is executed after every insert extracted from mongod.log file. Here the COLLSCAN took more than 6 seconds.

Is there anything I can do to avoid the collection scans after every bulk insert?

I COMMAND  [conn2] command diploma.patent command: aggregate {
   aggregate: "patent",
   pipeline: [
       { $match: {}
       },
       { $group: {
           _id: null,
           n: { $sum: 1
               }
           }
       }
   ], cursor: {},
   $db: "diploma",
   $readPreference: { mode: "primaryPreferred" }
}
planSummary: COLLSCAN
keysExamined: 0
docsExamined: 2453599
cursorExhausted: 1
numYields: 19422
nreturned: 1
reslen: 123
locks: {
   Global: {
       acquireCount: {
           r: 19424
       }
   },
   Database: {
       acquireCount: {
           r: 19424
       }
   },
   Collection: {
       acquireCount: {
           r: 19424
       }
   }
} protocol:op_msg 6274ms
 
All of the world's problems can be solved in a garden - Geoff Lawton. Tiny ad:
RavenDB is an Open Source NoSQL Database that’s fully transactional (ACID) across your database
https://coderanch.com/t/704633/RavenDB-Open-Source-NoSQL-Database
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!