Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

XML parsers, encoding and byte order marks

 
Kelly Dolan
Ranch Hand
Posts: 109
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have an xml file that contains the following declaration preceded by a BOM (byte order mark) representing UTF-8:

(BOM)<?xml version="1.0" encoding="UTF-8"?>...

I need to run this file through an XML parser without modifying the file and am currently using xerces.jar (v2.6.2).

When I attempt the following I get the exception that follows. If I uncomment the 4th line, the parser succeeded. Basically, the getBOMEncoding(bis) method moves the file pointer/input stream to the first byte *after* the BOM (i.e., it skips it). My assumption: the parser doesn't recognize or like the existance of the BOM before the XML declaration.

My questions are am I doing something wrong? is there a parser/version that recognizes a BOM (although I really need to use the one I'm using)? is it documented somewhere that the parser I'm using does not support this?

Any suggestions are welcome!

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(validate);
DocumentBuilder builder = factory.newDocumentBuilder();
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(input), 5);
// getBOMEncoding(bis); // this will skip over the BOM if present; our parsers do not handle the existence of a BOM.
InputSource is = new InputSource(new InputStreamReader(bis, encoding));
is.setSystemId(input.getParentFile().toURL().toString());
result = builder.parse(is);

Exception:

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:386)

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:387)

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:391)

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:392)

org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed.

at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)

at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:626)

at org.apache.xerces.framework.XMLDocumentScanner$XMLDeclDispatcher.dispatch(XMLDocumentScanner.java:809)

at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)

at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:172)

at scratch.FileOpsNewParser.parseContent(FileOpsNewParser.java:151)

at scratch.FileOpsNewParser.main(FileOpsNewParser.java:396)
 
Madhav Lakkapragada
Ranch Hand
Posts: 5040
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

// getBOMEncoding(bis); // this will skip over the BOM if present; our parsers do not handle the existence of a BOM.


This I believe is an inhouse method for your ContentHandler, did I understand that correctly ?
Thanks.

- m
[ September 30, 2004: Message edited by: Madhav Lakkapragada ]
 
Madhav Lakkapragada
Ranch Hand
Posts: 5040
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Unless I am missing something, your best bet is to do what line 4 is already doing. Having any characters before the prolog is a well-formdness constraint and hence the fatal Exception.

Alternatly (academically speaking) you could override the fatalError message itself.

http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/helpers/DefaultHandler.html#fatalError ( org . xml . sax . SAXParseException )

(added spaces allover so that UBB will allow me to post this link. They say its Maps fault!)


Thanks.

- m
[ September 30, 2004: Message edited by: Madhav Lakkapragada ]
 
Kelly Dolan
Ranch Hand
Posts: 109
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the reply.

I looked at the XML spec (http://www.w3.org/TR/2004/REC-xml-20040204/) and what it says about well-formed xml documents and you are correct when you say it must start with the prolog.

Unfortunately, I can no longer find the original web page that I was reading about boms, unicode and xml files. However, what can you say to Appendix F of the XML spec? This section talks about auto-detection of encoding and mentions boms. Would I be correct in saying that the XML working group recognizes the use of boms and that they would precede the prolog but that this is not a required feature to be supported - and therefore most likely why the parsers I've been playing with don't take them into account?
 
Kelly Dolan
Ranch Hand
Posts: 109
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just learned that the following works. If I simply pass the File object into the parse() method (vs. through stream and reader objects so that I could specify the encoding), it recognizes the bom and successfully parses the document.

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(validate);
DocumentBuilder builder = factory.newDocumentBuilder();
result = builder.parse(input);
 
Madhav Lakkapragada
Ranch Hand
Posts: 5040
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Glad you mentioned. Thanks for the tip on Appendix-F, never looked closely at the appendices so far. I learnt something new today.

- m
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic