• Post Reply Bookmark Topic Watch Topic
  • New Topic

Parsing HTML Table using Java XPath  RSS feed

 
Ranch Hand
Posts: 235
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I am trying to extract/parse the content of the following nested HTML Patient table using JDK 6.1 XPath Class without success:



Below is the content of XPathEvaluator.java used to extract/parse the above HTML file:



This code has worked successfully on catalogue.xml from http://www.onjava.com/lpt/a/5554 tutorial but generated the following error when trying to parse the above HTML file:

[Fatal Error] :1:78: White spaces are required between publicId and systemId.

Am I using the wrong tool? I have used the htmlparser in the past but could not achieve the same objective.

Any suggestion would be appreciated.

Thanks,

Jack
 
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Looks like the program is choking on the first line of the HTML file. HTML is frequently ill-formed XML.

If this was my problem I would grab the entire file into a String, locate the <table and </table and create a new source from that chunk of text alone. Hopefully this would avoid illformed HTML at the start of the file.

Bill
 
Jack Bush
Ranch Hand
Posts: 235
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi William,

I was hoping for a cleaner method than this but I will have to use it if no other better solution available.

Thank you very much for your suggestion.

Jack
 
Jack Bush
Ranch Hand
Posts: 235
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi William,

Would you recommend that I should use JTidy toolkit can cope with ill-formed HTML and create a DOM for an HTML document?

Thanks,

Jack
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Unless the table HTML markup is illformed I would still go with extracting the table part of the document and treating that as XML.

If the table itself is ill formed then JTidy is the next thing to try but I dont know how it will treat the start of the file that the parser could not handle.

It is an interesting problem, let us know what you end up with.

Bill
 
The glass is neither half full or half empty. It is too big. But this tiny ad is just right:
ScroogeXHTML 7.1 - RTF to HTML5 / XHTML converter
https://coderanch.com/t/690611/ScroogeXHTML-RTF-HTML-XHTML-converter
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!