Originally posted by Edward Chen: what collection type (vector, arraylist?) we need to use ?
Most likely you'd want to use no collection; the 4GB file would most likely take even more RAM than that, and so it's unlikely you could hold the whole thing in the Java heap (unless you have a 64bit JVM; even in that case, storing the whole thing is wasteful of space.) It would make more sense to read, say, 50 records, batch up the JDBC inserts, and then commit them; then go back and do 50 more.
As far as file I/O: be sure to use a BufferedReader.
If the conversions only apply to the data within one single line (i.e. lines are not related), then you should just read the file line-by-line and convert each line. You could then write that converted line to the database.
You certainly don't want to read the whole file into a Collection!
I'm no JDBC expert, but I suspect you could improve performance by holding-off writing to database until you had read a few lines, then writing a batch to database. Presumably, you would also use pre-compiled Statements for writing to database.
Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.
posted 12 years ago
Thanks for your reply.
I am thinking, do we have a performance comparison between java, C++ and C# based on 1. same database 2. same size of files, 4GB 3. same job: read, parse, convert and save it into database 4. any technolgy (ie, NIO) could be used, including third party library.
You can read it into some type, but none of the collections, since they are poorly designed. Instead you'll have to write your own type that is "lazily evaluated" (I have written many such types released under the CPL) - since actually, how you read your file is dependent entirely on what you do with that file.
Does this mean that the entire String exists in memory when this function is itself evaluated? Absolutely not. You can replicate exactly this behaviour in Java and in some ways, the core API has done so even if not explicitly stated and more often than not, in a horribly contrived manner.
In short, you'll have to provide the case for what you are actually going to do with the file to provide a more thorough answer, but until then, the answer is "read it into a lazily evaluated structure (of course!)".
As far as comparing Java and C++: if the code in both languages is well-written, you'll see no difference at all. The performance of your disk I/O (i.e., the OS itself) and database access (i.e., the database engine, and communications with it) will totally swamp any computational overhead.
As far as a "well-written" example goes, there are plenty of them out there; just the simple ones in Sun's I/O tutorial are fine. There are a few simple principles to adhere to:
- Use buffered I/O. Just wrapping a FileInputStream in a BufferedInputStream makes an enormous difference.
- Don't read just one byte at a time, but rather a decent-sized array full.
- Don't read 4GB using BufferedReader.readLine(), because creating all those Strings will kill you! Instead, try to process the data without creating any objects at all, if you can.
As far as splitting up the file: if that's a possibility, then it might be worth a try; multiple threads might be able to process data while others are waiting on I/O. You could simply use RandomAccessFile and start from N different locations within the file; figuring out what are valid start points might be tricky. As you say, NIO's asynchronous I/O capabilities are another possible option, although "let's use NIO" is not the magic speed bullet many people seem to think it is -- remember that all the FileReader/FileInputStream/RandomAccessFile/etc classes have been reimplemented on top of NIO in recent JDKs.