• Post Reply Bookmark Topic Watch Topic
  • New Topic

Extracting html from webpages?  RSS feed

 
Ranch Hand
Posts: 40
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there a built in function to extract html off of webpages? Say for instance i wanted to extract all of the "plain text" off of the javaranch.com website, is there a simple way to go about this?
Thank You...
Nick Ueda
 
Ranch Hand
Posts: 1873
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
by 'text' u mean removing all the html tags and haev the remaining part right?
well, as far as i know there is not a simple way of doing that.
regards
maulin
 
Nick Ueda
Ranch Hand
Posts: 40
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
well what about just getting the html file off of a webpage?
 
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not 100% what you are looking for. However, if you want to remove all the HTML tags from the source file I suggest using regular expressions.
I wrote a very simple (and not all inclusive) Perl script that removed tags.
<code> ~s/<(?:[^>""]*|([""]).*?\1)*>//g; </code>
With jdk 1.4, you can use regular expressions in Java. I was able to replicate the script above. Check out java.util.regex and build off the snippet above.
 
Ranch Hand
Posts: 51
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
not simple,this solution, but it is precise and it is a java solution.
while this is alovely site, the markup is not wellformed. however, if you download JTidy, you can run this tool in java and it will give you a well-formed XHTML representation of a page. then you can use a simple XPath expression in an XSL stylesheet that selects all the text from the <body> tag downwards <xsl:value-of select="//body/text()" />
i know - a complicated option. but still an option
peter
 
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
JTidy sounds cool. I have also used the Quiotix HTML Parser. It builds a DOM and provides a Visitor interface for walking the DOM and some sample visitors.
Was that the original question, or were you trying to get the HTML from a server in the first place? Here's an example of doing that with URL:

You have to know the URL you're after, so it won't automatically grab all the content of a site. You could grab a page, parse it, look for links, grab linked pages, parse them, etc. Watch for circular links and watch for a ticked off webmaster who doesn't appreciate you taking expensive mips and bandwidth from the regular customers while copying copyrighted material.
Some sites that WANT you to do this use RSS publishing. Neat trend.
[ July 03, 2003: Message edited by: Stan James ]
 
Author and ninkuma
Marshal
Posts: 66307
152
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
well what about just getting the html file off of a webpage?

Check out the URL.openConnection() method.
hth,
bear
 
Nick Ueda
Ranch Hand
Posts: 40
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Bear Bibeault:

Check out the URL.openConnection() method.

Thanks I will do that.

[ July 03, 2003: Message edited by: Nick Ueda ]
 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!