Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

HTML parser

 
Maha Hassan
Ranch Hand
Posts: 133
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all
I am using the HTML parser, but it has some problems as it sometimes extract some of the javascript code as part of the test in the HTML..
Do you know a better parser.

For example when I tried it with "http://www.google.ca/ig?hl=en" it generated that as part of the text
"'; _gel('t6').innerHTML = htmlmsg; } function tarot6() { var prefs = new _IG_Prefs(6); var sign = prefs.getString("sign"); "

Thanks
Maha
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What is the HTML parser ?
 
Maha Hassan
Ranch Hand
Posts: 133
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
this is HTMLParser
[ September 13, 2006: Message edited by: Maha Hassan ]
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Don't know about that one, but JTidy, NekoXNI and TagSoup seem to be more widely used.
 
Maha Hassan
Ranch Hand
Posts: 133
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am now using JTidy
I want to extract the text within the tags the thing is it does not understand things like copyright sign,"-"," " and other special characters and when i change the encoding things do not get better

Anyideas??
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic