Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

nio package and memory mapped interface  RSS feed

 
Bhasker Reddy
Ranch Hand
Posts: 176
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How do i use nio package to split a file. I need to split a large file of the order gigs which inturn contain multiple xml files. I was told by someone to use nio package and memory mapping interface. But I am not sure
what memory mapping interface is or how to use nio package.
Does anyone of you have an idea. Basically what I need to do is split a large file that contains multiple xml files that start <!xml version. I currently use regular io package, it's very slow. If I use nio, is it very faster. Please let me know
Thanks
 
Joe Ess
Bartender
Posts: 9406
12
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A MappedByteBuffer is a portion of a file mirrored to memory. There's some examples on NIO here. The current implementation of java.io uses java.nio behind the scenes so I don't know if you can count on a big increase in performance just from switching over. Sounds to me like you have a throughput problem that probably won't be solved by mapping the file to memory. If the file is large you may cause other problems by mapping it. Check out this chapter from Java Platform Performance on IO Performance for some more general IO advice.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In what way does the one big file "contain" multiple XML files? Are the multiple files simply appended one after another to form a single file? Does the big file contain well-formed XML? Only one top-level element is allowd in XML, so to concatenate multiple files you'd need to introduce a higher-level element to bind them all together, e.g.

Or is the big file a zipfile or some other structure which has a special format for including subfiles?

If you can work out where each file begins and ends, it's relatively easy to copy contents into a new file. I recommend using FileChannel's transferTo() or transferFrom() methods.

Regarding using a MappedByteBuffer - does your system have enough memory to store the entire contents of the big file in memory at once? If not, then it's going to be difficult for you to use memory mapping, mostly because you can't reliably release memory used in a MappedByteBuffer except by letting the MBB get collected by garbage disposal. (Exen then, there are no guarantees.) There are ways to do this with memory mapping, but I think it will be complex.

Another issue: do you know what character encoding is used in the file(s)? Are all the characters ASCII, or are there some characters from languages other than English? Parsing ASCII would be simpler, but realistically you may have to allow for the possibility that other encodings are used. If a variable-length encoding like UTF-8 is used, you may have more difficulty determining exactly where (in terms of bytes) one file ends and another begins. (Assuming the files have simply been concatenated, one after another.)
 
Bhasker Reddy
Ranch Hand
Posts: 176
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My big file will have the following structure
xmlfile1
xmlfile2
xmlfile3
xmlfile4.
It will have xmlfiles one after another. As you said the entire file will be loaded into memory. That could be a problem in my case, because the
files are in the order of 4 to 7 gigs. I cannot afford to load them in memory. Do you think memory mapping is helpful in this case.
Thanks
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you think memory mapping is helpful in this case.

Probably not. You probably need to first concentrate on locating where to split the files, without worrying about how to optimize it with NIO. I'd probably try using a SAX parser, or regular expressions. The place where the SAX parser fails from illegal XML syntax, that's where the second file starts.
 
nikolaus heger
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i just looked into NIO + did some performance tests. NIO isn't faster. it also may cause memory problems.

therefore, i would forget about it and just use BufferedInputStream.

i found that BufferedInputStream is about 3 times as fast as NIO, and that the custom read(buf) directly on FileInputStream is 3 times faster than that (all after the file is already in the system cache - the difference will be much less dramatic when reading a file for the first time, which is what you will be doing).
so if you just read your large file sequentially, and parse for your xml beginning and end tags, and store and write out what's between, it should go pretty much as fast as your disk is. which is the max speed you can achieve under any circumstance.
i wouldn't really recommend plugging a SAX parser in because it will parse all the XML (doing a lot of work) for absolutely no reason - wasting lots of time. you can detect begin/end tags much cheaper than that, with a simple char matching, and since you have a very special situation at hand, you would be able to do a lot of optimization with it, too. i imagine a scan-as-you-go algorithm...

the reason i respond is that you will most definitely have a problem if parsing these large files on windows XP because XP has a terrible caching problem. if you read in 1G of memory, it will swap out all your other apps out of memory to make room for a huge disk cache. there is no way that i know of how to turn this off. i run into this all the time and it slows the system to a crawl... this may or may not affect you, just keep it in mind. you can see the behavior if you watch the windows task manager, the system cache section. it will grow totally out of proportion when doing operations on large files - at the cost of all other apps and even the system. it's mad.

win2000 and others do not have this problem, AFAIK, and linux is probably even better
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!