Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

how to parse html webpage

 
naga raaju
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi guys can anybody give idea to parse html webpage live url parsing

using java.


i have code but the out put is in the form of html tags
so how can i split the tags so give idea friends

import java.net.*;
import java.io.*;

public class URLReader {
public static void main(String[] ar) throws Exception {

URL yahoo = new URL("http://finance.yahoo.com");
BufferedReader in = new BufferedReader(new InputStreamReader(yahoo.openStream()));
BufferedWriter wr=new BufferedWriter(new FileWriter("sample.txt"));

String inputLine;
while ((inputLine = in.readLine()) != null)
// System.out.println(inputLine);
try
{
wr.write(inputLine);
}catch(Exception e)
{
e.printStackTrace();
}
in.close();
}
}
bye
Naga
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There are many things you might want to accomplish with a downloaded web page. You need to tell us what you're trying to do with it.

If you want to extract the text, I'd start by converting the HTML into well-formed XML; libraries like NekoXNI, JTidy and TagSoup can do this for you.
 
naga raaju
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi
thanks for your reply,
i need some text from the web pages.so what sholud i do.


can i depend on third party API. or that is possible with java coding.


bye
Naga
 
Joe Ess
Bartender
Posts: 9300
10
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There is an HTML parser provided in the Java API. As Ulf says, it depends on your exact requirements whether it will fit the bill or not.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That depends on the specifics. Are you talking about one particular page on one particular site? Several pages? Several sites? Is the layout of the page(s) predictable? Are there ID tags on which you can rely?

You will need to do some coding, but the libraries I mentioned will help you get started.
 
Randi Randwa
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

You can also use biterscripting (.com for free download) for parsing html. It works great.

They have a sample script posted at http://www.biterscripting.com/SS_URLs.html . This script extracts referenced URLs from a page. Another sample script http://www.biterscripting.com/SS_SearchURL.html will search a page for specific search words. The sample script http://www.biterscripting.com/SS_SearchWeb.html is de facto your own search engine.

You can get started with these scripts.

If you come up with new html parsing scripts of your own, can you please post them for the rest of us ? Thanks.

Randi
 
Campbell Ritchie
Sheriff
Pie
Posts: 49411
62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Welcome to JavaRanch, Randi but please don't resurrect 10-month old threads. Have a look at this FAQ.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic