• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Tim Cooke
  • Junilu Lacar
Sheriffs:
  • Rob Spoor
  • Devaka Cooray
  • Jeanne Boyarsky
Saloon Keepers:
  • Jesse Silverman
  • Stephan van Hulst
  • Tim Moores
  • Carey Brown
  • Tim Holloway
Bartenders:
  • Jj Roberts
  • Al Hobbs
  • Piet Souris

How to parse 100mb xml file

 
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi all,

I am using JDOM to parse a 100mb xml file. But I am getting outofmemory error even i specify jvm -Xmx1024m parameter. What else can I use to parse it ?

thanks in advance
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Building a DOM in memory always takes much more memory than just the space the file would take as a String. Any particular reason to use JDOM instead of the standard library parser?

Further advice depends on what you need to do with the data in the document - please elaborate.

You might want to read a good tutorial on the variety of parsers available in Java. Try a google search for "sun xml tutorial" to find one.

Bill
 
Rajab Davudov
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am using JDOM because it is easy and it is the only parser I have ever used.
I need to extract values from some tags and create an excel file.

Someone suggested me to use Xerces, but I don't want to write handler classes. It would be simple something like jdom where you use getElement() kind methods...
 
Marshal
Posts: 26910
82
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
From the JDOM FAQ:

JDOM is not an XML parser, like Xerces or Crimson. It is a document object model that uses XML parsers to build documents. JDOM's SAXBuilder class for example uses the SAX events generated by an XML parser to build a JDOM tree. The default XML parser used by JDOM is the JAXP-selected parser, but JDOM can use nearly any parser.

So: you can use Xerces with JDOM. You can also use Xerces as a SAX parser, which is what you are really talking about when you mention "handler classes". Don't confuse the parser (whose job is to convert an XML document into an internal form) with JDOM or DOM (whose job is to allow you to manipulate that internal form).
 
Rajab Davudov
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I need only to read data from xml not to manipulate. Therefore, I have to use SAX Parser and also write these handler classes. Is that right ?
 
Rajab Davudov
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Can I read data from "internal form" or I still have to use DOM kind of things ?
 
Paul Clapham
Marshal
Posts: 26910
82
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
A DOM is an internal form, whether it's the DOM that is implemented by the standard Java API or the DOM implemented by JDOM. And if you need to use SAX, then the stream of SAX events (the calls to the handler's methods) constitute the internal form.

I have seen estimates that say that if you build a DOM, its memory usage will be 5 or 10 times the size of the document you built it from. So you would be looking at 1 gigabyte of memory for 100 megabytes of XML document. If you don't have a gigabyte, and if you are able to extract the data as a SAX parser reads it, then using SAX would work just fine.
 
Rajab Davudov
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have already written a handler class. And it seems working. Here is the java source. thanks for help. It was not so hard

import java.util.ArrayList;
import java.util.HashMap;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class MuyapHandler extends DefaultHandler {

private String item = null ;
private String performerId = null ;
private HashMap performerMap = new HashMap() ;
private ArrayList songList = new ArrayList() ;
private String[] values = null ;
private int index = -1 ;

public void startDocument() throws SAXException {
performerMap = new HashMap() ;
songList = new ArrayList() ;
item = null ;
performerId = null ;
values = null ;
index = -1 ;
}

public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equals("song")) {
String songID = attributes.getValue("songID") ;
values = new String[5] ;
values[0] = songID ;
} else if (qName.equals("songName")) {
item = "songName" ;
index = 1 ;
} else if (qName.equals("songPerformerID")) {
item = "songPerformerID" ;
index = 2 ;
} else if (qName.equals("producer")) {
item = "producer" ;
index = 3 ;
} else if (qName.equals("label")) {
item = "label" ;
index = 4 ;
} else if (qName.equals("performerDisplay")) {
performerId = attributes.getValue("performerDisplayID") ;
}
super.startElement(uri, localName, qName, attributes);
}

public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equals("song")) {
songList.add(values) ;
}
super.endElement(uri, localName, qName) ;
}

public void characters(char[] ch, int start, int length) throws SAXException {
if (item != null) {
if (item.equals("songPerformerID")) {
values[index] = (String) performerMap.get(new String(ch, start, length)) ;
} else {
values[index] = new String(ch, start, length) ;
}
item = null ;
} else if (performerId != null) {
String performerName = new String(ch, start, length) ;
performerMap.put(performerId, performerName) ;
performerId = null ;
}
}

public ArrayList getSongList() {
return songList;
}

}
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Watch out for this trap in the


your current code assumes characters will only get called once for the text content of a field. This is true sometimes but NOT all the time because the parser deals in blocks of characters in a buffer. You may get only one character on the first call if it is the last character in the block.

Your code must provide for collecting characters from multiple calls.

Any good tutorial on SAX will show code for this and the problem is frequently discussed in this forum.

Bill
 
reply
    Bookmark Topic Watch Topic
  • New Topic