• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Jeanne Boyarsky
  • Liutauras Vilda
Sheriffs:
  • Rob Spoor
  • Bear Bibeault
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:
  • Frits Walraven
  • Himai Minh

Best way to process large/complex XML/schema ?

 
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Best way to process large/complex XML/schema ?
==============================================

Thanks for reviewing this thread.

I like to figure out a way to process large complex XML and push the XML data to flat file or Data base.

Here is the high level view.



I did see a old thread in this forum posted 13 years ago.

how about process large XML file(bigger than 1GB) in Java?

https://coderanch.com/t/203004/java/process-large-XML-file-bigger


the links of the above threads are broken. We may have a better way to address the above due to technology advancement.

Is Pyhton the right one to do the above for performance?

What are the option we have ?

Thanks for your guidance.
 
Saloon Keeper
Posts: 6969
164
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Welcome to the Ranch.

Java is perfectly able to handle large XML files - IF you use the right API. A DOM-type approach that builds an in-memory representation of the entire document is likely to fail, but there are other APIs specifically designed to be more memory-conscious. StAX is one of them, and it's part of the Java API.
 
Saloon Keeper
Posts: 23689
161
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
There are basically 2 ways to process an XML file: as a 2-dimensional document in memory (tree) or as a data stream.

The DOM (Document Object Model) approach provides the 2-dimensional  approach. Its virtue is complete access to all parts of the document with no extra I/O required (excluding virtual memory paging, of course).

A variation of the DOM approach is provided by services such as the Apache Digester, which, instead of loading to a generic tree framework allows you to load into a graph of JavaBeans (POJOs).


The other option is stream processing. It does not allow random access to XML elements, but insted deals with them sequentially. The advantage here can be substantial memory savings. There are 2 primary frameworks to support this: SAX and StAX.

SAX is a low-level raw token parser and it offers the most access to the XML tokens as they come through linearly. StAX is a more intelligent parser that can be used to be more discriminating.

In actuality, SAX almost always gets involved, whether loading up a DOM or Digester or running StAX.


If I was to parse XML and load it into a database, I probably wouldn't need a 2-dimensional structure and therefore I'd likely use either SAX or StAX.

Although in actuality, I'd be using an ETL tool like Pentaho where the software was already coded and debugged and I only had to link together blocks in the GUI designer.
 
Mdri Na
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for your reply and suggestion.

I am attaching a sample complex XML meseage file.

I am not expert of SAX and StAX. Are you able to point me to coding sample of SAX and StAX to process attached complex XML meseage file

I am not able upload my attachment with 839kb .

Thanks for your guidance.
 
Marshal
Posts: 26596
81
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mdri Na wrote:I am not expert of SAX and StAX. Are you able to point me to coding sample of SAX and StAX to process attached complex XML meseage file



I searched the web for tutorials about SAX and STAX. Here's some:

Parsing an XML File Using SAX Parser

Parsing an XML File Using StAX
 
Mdri Na
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for sharing.

What are the env set under windows for SAX/Stax?

Are there any IDEs to use with SAX/Stax?

Thanks for sharing.
 
Paul Clapham
Marshal
Posts: 26596
81
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mdri Na wrote:What are the env set under windows for SAX/Stax?



Sorry, I don't know what an "env set" is.

Are there any IDEs to use with SAX/Stax?



They are simply Java code, as you will see if you examine the tutorials linked to. So any Java IDE would be just fine.
 
Mdri Na
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for your reply.

I meant environment settings.  As I see many items related to SAX/STax via google. It shows as linux kind.

I am attaching a .xml file extracted from big xml to show . It has complexly nested structure and it has its .xsd as well.

The header portion looks like




Are you able to point me to a code snippet /template to process this XML ?

Thanks for your guidance.
 
Ranch Foreman
Posts: 112
4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm not part of the number crunchers scene - but whenever one has to deal with more data than a couple of KiB it comes down to: What do you actually need to have in memory at a given time? If the overall process is linear than you're fine with just the current block. If you have a process that can be well paralellized it does make sense to have as many datablocks in memory as you can process at ones.

As some out of bounds example: With modern day systems it's rather easy to read in all the source of the current linux kernel at once. And maybe to perform some analysis and optimuzation one might come up with doing so. But in the end your goal is to compile the source into something executable, which can only be done in chunks even with very powerful systems.
But the question cones up: Does it have any benefit trying to compile all the source at once? Is it really faster? Is it even possible do to so?

You see: Even if your data is TiB in size - if you can only process a couple 100 MiB at once just streamline the process and operate in chunks/blocks.
 
Paul Clapham
Marshal
Posts: 26596
81
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mdri Na wrote:I meant environment settings.  As I see many items related to SAX/STax via google. It shows as linux kind.



I don't know what that refers to; SAX and StAX are part of the standard Java API so if you already have Java running then you don't need to do anything else, your environment is already set up.

Are you able to point me to a code snippet /template to process this XML ?



The web is full of examples of how to process XML documents. But you aren't going to find a tutorial which deals with exactly that document, of course. What you have to do is to take one or more of those tutorials and get some understanding of how to process XML documents. Then use that understanding to write code to deal with your XML document.

If you have questions as you go along, then the forum is here for you. Or if you have specific questions now, go ahead and ask them.
 
Mdri Na
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for your reply/suggestion. I appreciate for extending help in this exercise.

As I mentioned this is big XML, if I go with above element by element with explicit navigation using SAX. it is  a hard task to pull it up.

We may have 45K to 50K xml elements to traverse this way. How do we handle this kind ?

We know SAX do parallel parsing .

Is there a way to pull elements using XPATH in SAX?



Thanks for your guidance.
 
Paul Clapham
Marshal
Posts: 26596
81
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well look, Matthew Bendford pointed out earlier that Java actually supports a large amount of memory. So if you already have a plan about how to process the XML using the XML DOM, I'd suggest you try it out to see if you really have a problem. I agree that SAX and StAX don't understand XPath -- although it's possible that somebody has built some open-source code to fix that. But you've decided without evidence (I think) that your XML is too big. So like Matthew says, give DOM a chance first.
 
Matthew Bendford
Ranch Foreman
Posts: 112
4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Short addition: One may has to tweak max stack size with -Xss along with startup (-Xms) and max memory (-Xmx) as default max stack size is rather small. This can lead to stackoverflow even with couple of GiB assigned to the VM.
 
I've read about this kind of thing at the checkout counter. That's where I met this tiny ad:
Thread Boost feature
https://coderanch.com/t/674455/Thread-Boost-feature
reply
    Bookmark Topic Watch Topic
  • New Topic