Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

remove javascript from html web page

 
asit dhal
Greenhorn
Posts: 13
C++ Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I need to remove all tags(html tags and javascript code) from a web page.

Can somebody tell me how to do this ?
 
Winston Gutkowski
Bartender
Pie
Posts: 10430
63
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
asit dhal wrote:I need to remove all tags(html tags and javascript code) from a web page.

Can somebody tell me how to do this ?

I suggest you look at a parser for SAX or DOM. Java has implementations for both. The first is generally easier to use, and I'm pretty sure it will do what you want; however you may need to convert the HTML to XHTML first. For that, there is a utility called JTidy, which I believe has it's own SAX-like parser built-in; but I've never used it, so have no idea how easy it is.

Tip: DON'T think about a regex-based solution if there is any "awareness" required. They are very powerful, but not well-suited to hierarchical logic.

Winston
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic