• Post Reply Bookmark Topic Watch Topic
  • New Topic

HTML Parser unrecognized tags  RSS feed

 
Fritz Gaschler
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi!

i try to write a kind of webcrawler which gathers the content from a webpage and parses it and prints out information between tags or links!

the following class handles the callbacks triggerd by the HTMLParser. unfortunately the parser event prints errors by parsing a very simple webpage.
The strange thing about that is, that i looked up the HTML.TAG - Class in the Java Source and recognized that tags like <h1> etc should be recognized by the parser. As you can see it checks <html><body> by default but why it cant handle the others?

maybe somebody has an idea...i thought about to handle those tags by myself by implementing the logic in the start and end methods in the callback class, by that would be quite crappy .


cheers


The HTML source
<html>
<body bgcolor="#CCFFFF">
<h1>It works!</h1>
Google
</body>
</html>



The output of the parser
html
invalid.tagattbgcolorbody?
tag.unrecognized h1??
end.unrecognizedh1??
tag.unrecognized a??
invalid.tagatthrefa?
end.unrecognizeda??
body bgcolor = #CCFFFF
It works!
Google
body
html



The Parser class .... don't worry i instantiate it the right way ...


The Parser's CallbackClass which handles the tags

 
Bear Bibeault
Author and ninkuma
Marshal
Posts: 66207
151
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hardly an HTML question. Moved.
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not sure what's going on, but the javax.swing.text.html and javax.swing.text.rtf classes haven't received any attention in years.

If this was my problem, I'd use a library like NekoHTML or TagSoup instead; that's what everybody seems to be using for parsing HTML.
 
Fritz Gaschler
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mhmh ! seems to be this way! well... thanks at all ! i'l try the library!

see you
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!