• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • paul wheaton
  • Ron McLeod
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:

Converting HTML into text

 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I wrote a script which gets a webpage and dumps the entire thing into a file.

Is there anyway to remove all the html and formatting stuff so I'm left with the text?

Also, how do I extract the URL's in the file?

Thank you in advance
 
author
Posts: 77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Parsing HTML, unless it's extremely simple HMTL, is tricky business. You're probably best off using a Java library, like HtmlEditorKit, for that. If you do a Google search for "HtmlEditorKit extract text from html", you'll come up with some examples.

As for identifying URLs in a file you can use regex for that, though it can get ugly too. Here's a JavaRanch thread with an example: https://coderanch.com/t/382015/Java-General/java/regex-find-url

Have fun,
Dave
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic