• Post Reply Bookmark Topic Watch Topic
  • New Topic

Parsing an HTML doc

 
Bill Compton
Ranch Hand
Posts: 186
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm working on a project that requires parsing an HTML stream. Rather than write the parser myself, it seemed promising to use the HTML / HTMLDocument classes. I expected to write something roughly like:
HTMLDocument h = new HTMLDocument("http://page.domain.com");
or maybe ...= new HTMLDocument( aStreamReader );
Followed by a (possibly recursive) set of calls like:
Element elem = h.getNextElement();
However, I can find nothing like this (or any other approach) among the Java doc'n. Can anyone offer a suggestion or (even better!) a code snippet?
 
Carl Trusiak
Sheriff
Posts: 3341
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Check out the javax.swing.text.html package Here, It might just be what you need.
Hope this helps
 
Bill Compton
Ranch Hand
Posts: 186
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks, Carl! The article is a goldmine. Here's the distilled code. I plagiarized the getConnection method from some Paul Wheaton code.

[This message has been edited by Bill Compton (edited November 30, 2000).]
 
Jarvis Ka
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Given the sample code above, could anyone tell me how can I be able to search the "data"? For example, if I want to find out whether a web page contain the word "JavaBean"?
Could anybody help me out? Or point me to a right direction?
Thank you!
Jarvis
[ January 03, 2002: Message edited by: Jarvis Ka ]
 
Bill Compton
Ranch Hand
Posts: 186
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sure - the type of the variable 'data' is char[], which can be used to instantiate a String:
String s = new String( data );
Then the String methods like indexOf() can easily be used to find particular words you care about. In my case, I built a simple parser to break the input stream into tokens that could be "understood". The parser received String types created in this way.
 
Joshua
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I tried running that code on diferent websites (www.cnn.com & www.nytimes.com) and I'm getting an error. Any ideas?
Exception in thread "main" javax.swing.text.ChangedCharSetException
at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentPa
rser.java:169)
at javax.swing.text.html.parser.Parser.startTag(Parser.java:372)
at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1846)
at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1881)
at javax.swing.text.html.parser.Parser.parse(Parser.java:2047)
at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java
:106)
at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.ja
va:78)
at ParseURL.main(ParseURL.java:31)
 
Sudharsan Govindarajan
Ranch Hand
Posts: 319
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi!
Is this code not working for you even for a single site or it does not work for some specific sites?
 
Tomer Libal
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Try setting ignoreCharSet to true in the ParserDelegator.
The problem for me is when using the HTMLEditor, i have no place to set this flag to true.
does anybody know of any html parser that returns a document with elements including all attributes including css (outer and inner).
thanks
Tomer
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You've already done the hard work, so this comes a bit late. I use the Quiotix HTML Parser that parses into a DOM. The DOM supports the visitor pattern that could be used to check the text of each node or do lots of other cool stuff.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!