• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Chunking an overly large xml file?

 
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have a very large xml file 300,000kb that I need to parse. (size can vary and could be larger)

I need to look for a certain start tag, and then it's corresponding close tag, and then process that chunk before moving on to the next one.

My first thought was that I could use Smooks' splitting and routing but I have been unable to find a way to make it work. You can split but the only options for routing seem to be file, jms, or database. I really just want to route the chunk to a class/method so that I can check see if the chunk qualifies and then decide what to do with it.

I have also experimented with Readers and FileChannels, but there doesn't seem to be an easy and fast way to accomplish this task.

Any ideas? I'm not overly familiar with IO as I don't often code it, so I'm hoping that I'm just overlooking an obvious solution.

Thanks
 
Marshal
Posts: 28226
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't know anything about Smooks but when I looked at their home page it actually said "JMS, File, Database etc". So perhaps the "etc" part would cover your requirement?

I don't see how Readers and Channels would help at all. You need to parse the XML so you need an XML parser, which is at a higher level than the low-level choice of file access methods. So you can't do anything until you choose your parser.
 
Sheriff
Posts: 3063
12
Mac IntelliJ IDE Python VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't know anything about Smooks either, but it seems to me that what you want to use a SAX parser. Unlike DOM parsers, which force to load an entire XML file into memory before working with it, a SAX parser lets you parse as you go, if you get my meaning. You don't have to do anything clever to split the file into chunks. Just use a normal buffered reader to read a block, stream it through the SAX parser, and then go on to the next block.
 
Ranch Hand
Posts: 2187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Both Simple API for XML (SAX) and Document Object Model (DOM) are programming API which are used to write XML-data processing applications. Neither one of them is an actual XML Parser.

SAX enables you, the programmer, to write code that receives data, i.e. method calls, from a parser. SAX is a low-level API as it communicates directly with a SAX-compliant XML parser. You write the application based on the SAX API.

DOM is a higher-level API that builds an object model based on an XML instance which uses the SAX API internally. You, the programmer, then write your application based on the DOM API not the SAX API.

Apache Xerces is the most popular XML parser and a reference implementation was added to the Java SE some time ago.

In regards to chunking and writing a Java-based application to do this, you would certainly need to write to the SAX API. If you are planning to pass this XML fragment to a method, you need to make sure that you create a small enough chunk so you don't exceed the memory of your JRE instance.

A good alternative for this would be writing the chunking code in Perl and then reading the chunks with a SAX or DOM applicaiton
 
Sheriff
Posts: 22784
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
There are (at least) two more alternatives to SAXParser and DocumentBuilder: XMLEventReader and XMLStreamReader.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic