nikolaus heger

Greenhorn
+ Follow
since Aug 27, 2004
Merit badge: grant badges
For More
Cows and Likes
Cows
Total received
In last 30 days
0
Forums and Threads

Recent posts by nikolaus heger

i just looked into NIO + did some performance tests. NIO isn't faster. it also may cause memory problems.

therefore, i would forget about it and just use BufferedInputStream.

i found that BufferedInputStream is about 3 times as fast as NIO, and that the custom read(buf) directly on FileInputStream is 3 times faster than that (all after the file is already in the system cache - the difference will be much less dramatic when reading a file for the first time, which is what you will be doing).
so if you just read your large file sequentially, and parse for your xml beginning and end tags, and store and write out what's between, it should go pretty much as fast as your disk is. which is the max speed you can achieve under any circumstance.
i wouldn't really recommend plugging a SAX parser in because it will parse all the XML (doing a lot of work) for absolutely no reason - wasting lots of time. you can detect begin/end tags much cheaper than that, with a simple char matching, and since you have a very special situation at hand, you would be able to do a lot of optimization with it, too. i imagine a scan-as-you-go algorithm...

the reason i respond is that you will most definitely have a problem if parsing these large files on windows XP because XP has a terrible caching problem. if you read in 1G of memory, it will swap out all your other apps out of memory to make room for a huge disk cache. there is no way that i know of how to turn this off. i run into this all the time and it slows the system to a crawl... this may or may not affect you, just keep it in mind. you can see the behavior if you watch the windows task manager, the system cache section. it will grow totally out of proportion when doing operations on large files - at the cost of all other apps and even the system. it's mad.

win2000 and others do not have this problem, AFAIK, and linux is probably even better
20 years ago