• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Tim Cooke
  • Devaka Cooray
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Rob Spoor
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Piet Souris
  • Mikalai Zaikin
Bartenders:
  • Carey Brown
  • Roland Mueller

scanning the webpage

 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have to write a program that will be supplied with the URL of a webpage, and it will then scan the page to get all the links to the other page and print those links. I am new to this, please if someone can give me a idea to handle this, it will be great.
Thanx
Regards
Harbir
 
Ranch Hand
Posts: 245
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Should I reply here or over here?

I'd start with something like Apache Commons HttpClient to get the page. After that you'll likely want to use some regular expressions to look for "<a" and an href. This part isn't trivial - there are very few well written web pages out there. And you will still miss anchors that get defined in Javascript.
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You don't really need the Apache Jakarta HttpClient library if all you want to do is read a web page. You can use class java.net.URL for that:

If you want to parse HTML, you could use a library like this one: http://htmlparser.sourceforge.net/
 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This program will ask for url. You enter the correct url and press enter.

import java.net.*;
import java.io.*;

public class URLRead
{
public static void main(String[] args) throws Exception
{
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String s;
System.out.println("Enter Your URL :");
s = br.readLine();


URL url = new URL(s);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

String inputLine;

while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);

in.close();
}
}
 
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Once you have the page as a String, to dig out the links it would be good to have a solid HTML parser. I like the Quiotix Parser because it has a neat Visitor interface.
 
Scott Dunbar
Ranch Hand
Posts: 245
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Great ideas. The only reason I suggested HttpClient is for the non-trival cases - SSL, Cookies, Basic and Form based auth, etc. Harbir - alot of it depends on your requirements.
 
No more fooling around. Read this tiny ad:
We need your help - Coderanch server fundraiser
https://coderanch.com/wiki/782867/Coderanch-server-fundraiser
reply
    Bookmark Topic Watch Topic
  • New Topic