• Post Reply Bookmark Topic Watch Topic
  • New Topic

Removing all kind of tags  RSS feed

 
sahar eb
Ranch Hand
Posts: 38
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
HI,
I am working on a project that collects alot of data from web(saved in .80 files). So, I have any possible form of data in these pages: HTML, XML, HTML inside XML, CSS. and also pages like this one: http://www.usustatesman.com/se/the-statesman-rss-1.544390

I need to remove ALL the tags(ANYkind) from the content of these pages ang get pure texts.

is there any parser that can do this for me? or any other way to remove these tags?

Thank you so much!
 
Tim Cooke
Marshal
Posts: 4051
239
Clojure IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Would probably forget about using Java altogether and just use sed with the search expression <[^[<>]*> and a delete action. That will get rid of all <tags> <that> </look> </like> <this> and leave everything else as is.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!