• Post Reply Bookmark Topic Watch Topic
  • New Topic

Parsing and HTML document. Char set problem.

 
Daniel Cote
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi, I have one 'doubt'.

I have this code to parse an HTML file/stream:

URL url = new URL(urlStr);
content = url.openStream();
in = new BufferedReader(new InputStreamReader(content));
ParserDelegator parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = new HtmlParser(htmlContentTempBean);
parser.parse(in, callback, false);

I'm parsing pages from a SAME website. All works fine but with some pages I got this exception:

javax.swing.text.ChangedCharSetException
at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:169)
at javax.swing.text.html.parser.Parser.startTag(Parser.java:372)
at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1846)
at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1881)
at javax.swing.text.html.parser.Parser.parse(Parser.java:2047)
at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:106)
at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:78)
......


I've traced the error and it seems to be with this line in HTML documents:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

the 'doubt' is that this line is also in HTML documents in which the parsing works fine.

I've changed the line 'parser.parse(in, callback, false);' to 'parser.parse(in, callback, true);' in my code to ignore the char set and not it works fine with all pages.

Someone could explain the 'Why' of this behaviour because I don't understand nothing
 
Harald Kirsch
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You do an

and I guess this contacts some web server. The web server tells you in the HTTP header which charset is used for the page content. The stream's decoder is then adapted to the charset and you start parsing.

The parser then encounters

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

and it this most probably does not match the charset declared in the HTTP header. Now what should your parser believe?

You could say this is a misconfiguration of the server, but I rather think it is a bug in the parser implementation. If I write a static web page x.html with a charset called hutzlivutzli, the parser should better adapt to this. But maybe the server would actually be required to extract the meta from the text and copy it to the header instead of just sending off a default header entry (don't know).
 
Daniel Cote
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for your explanantion. I really don't understand too much about this.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!