I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:
This program would work on the same document without the default namespace, hence, it would not be necessary to include �ns� prefix along in the XPath statements (line 6-7) either. Moreover, I was using �org.apache.xerces.parsers.SAXParser� to have successfully retrieve content of <a> from the same document without default namespace in the past.
I would like to achieve the following objectives if possible:
( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done? ( ii ) If this is not possible, how to include it in XPath statements (line 6-7) so that the value of <a> is picked up correctly? ( iii ) Would changing from �org.apache.xerces.parsers.SAXParser� to �org.ccil.cowan.tagsoup.Parser� make any difference as far as using XPath is concerned? ( iv ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?
I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.
I can confirm that the XPath using Saxon parser ("org.ccil.cowan. tagsoup.Parser" ) is working with default namespace.. I made the mistake of assuming that the XML document converted by TagSoup was identical to using light_html2xml in the past.
Consequently, what is outstanding still, even though not critical, but nice to have, is ( i ) to exclude DTD from XML file. If this is not possible, ( iv ) to setup local SYSTEM EntityResolver in this JDOM environment.
Below is an example of what I am trying to achieve in ( iv ) in a DOM environment:
Would anyone be able to give me some idea on how to do this?
It looks to me like org.jdom.input.SAXBuilder has the setValidation() and setEntityResolver() methods you need. Is there some reason these don't work?
posted 11 years ago
Thanks for responding to this question.
Below is where the Sax parse is defined:
line 1. SAXBuilder saxBuilder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser", false); line 2. saxBuilder.setValidation(false); line 3. saxBuilder.setEntityResolver(???);
( a )Are you referring to the boolean parameter in line 1 and 2? Are they both equivalent? This setting appears to be working as no Internet online connection is needed to parse the XML file. However, I am wondering whether it is possible to exclude the DOCTYPE in the converted XML document altogether during parsing/conversion. Otherwise, how to possibly use line 3 set a local SYSTEM DTD? I am looking for something like setEntityResolver(false) so that I could open it up without it referencing the PUBLIC DTD. ( b ) I would also like to exclude the namespace from being included during parsing/conversion in order to simplify my XPath searches?
TagSoup lets you disable namespaces by setting the standard SAX feature �http://xml.org/sax/features/namespaces� to false. Unfortunately for you, JDOM will turn it back on before parsing. You might need to use a standard DOM instead; Java 5 and Java 6 have built-in XPath support.
You can set the same EntityResolver with saxBuilder.setEntityResolver(...) as in your previous sample using DocumentBuilder, or use the one from the Apache XML Commons Resolver library, to use a local file.