I'm working on a project that requires parsing an HTML stream. Rather than write the parser myself, it seemed promising to use the HTML / HTMLDocument classes. I expected to write something roughly like: HTMLDocument h = new HTMLDocument("http://page.domain.com"); or maybe ...= new HTMLDocument( aStreamReader ); Followed by a (possibly recursive) set of calls like: Element elem = h.getNextElement(); However, I can find nothing like this (or any other approach) among the Javadoc'n. Can anyone offer a suggestion or (even better!) a code snippet?
Given the sample code above, could anyone tell me how can I be able to search the "data"? For example, if I want to find out whether a web page contain the word "JavaBean"? Could anybody help me out? Or point me to a right direction? Thank you! Jarvis [ January 03, 2002: Message edited by: Jarvis Ka ]
posted 18 years ago
Sure - the type of the variable 'data' is char, which can be used to instantiate a String: String s = new String( data ); Then the String methods like indexOf() can easily be used to find particular words you care about. In my case, I built a simple parser to break the input stream into tokens that could be "understood". The parser received String types created in this way.
I tried running that code on diferent websites (www.cnn.com & www.nytimes.com) and I'm getting an error. Any ideas? Exception in thread "main" javax.swing.text.ChangedCharSetException at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentPa rser.java:169) at javax.swing.text.html.parser.Parser.startTag(Parser.java:372) at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1846) at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1881) at javax.swing.text.html.parser.Parser.parse(Parser.java:2047) at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java :106) at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.ja va:78) at ParseURL.main(ParseURL.java:31)
Try setting ignoreCharSet to true in the ParserDelegator. The problem for me is when using the HTMLEditor, i have no place to set this flag to true. does anybody know of any html parser that returns a document with elements including all attributes including css (outer and inner). thanks Tomer
You've already done the hard work, so this comes a bit late. I use the Quiotix HTML Parser that parses into a DOM. The DOM supports the visitor pattern that could be used to check the text of each node or do lots of other cool stuff.
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
I didn't say it. I'm just telling you what this tiny ad said.