• Post Reply Bookmark Topic Watch Topic
  • New Topic

Java search engine query  RSS feed

Jeremy Quartey
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have tried to compose a suitable answer to the following question but dont seem to understand the practical concepts very well. Could anyone pls point me in the right direction?
"A Web crawler is a program which wanders around the World-wide Web looking for specific documents required by the user of the crawler, for example the user may want the URLs of all the documents which contain the words ‘Java’ and ‘compiler’.
What classes in Java do you think would be used in such a program? In answering this question you will need to explain your choice of classes with respect to the functions required of the Web crawler."

My incomplete answer is:
To search and cache the URLs of all the documents which contain the words ‘Java’ and ‘compiler’ you must first make connection to web documents, set up inputstreams to read from the URL documents and then process the text using StreamTokenizer – you can then check (with the aid of an algorithm) if any tokens in the document contain the text searched for – if they do then this URL can be flagged and cached ( added to a Hashtable)
You would need to implement the following classes for the related functions
Web classes in java.net - Their main functions are to allow you to access Web documents, for example they let you read the documents.
URL and URLConnection classes(Web classes) allow you to access Web documents via their URL’s (Uniform Resource Locators); pointers to "resources" on the World Wide Web
URL class handles connections to Web documents. A URL constructor which has a string parameter treats the string parameter as the URL eg,
URL oldWeb=new URL(“http://info.cern.ch/hyper/old.tex” ;
sets up a URL object associated with the WorldWideWeb protocol (http), a host info.cern.ch, a directory hyper and a file old.text
All URL constructors throw MalformedURLException if the format of the URL is incorrect, if no protocol is specified, or an unknown protocol is found.
The abstract class URLConnection is the superclass of all classes that represent a communications link between the application and a URL. Instances of this class can be used both to read from and to write to the resource referenced by the URL
In order to process the text within a URL object the simplest strategy is to use the method openStream() defined in URL. This opens an input stream connected to the URL object so that the input stream methods can be used to read from the object, e.g.;
URL queryWeb=new URL(“http://infor.cern.ch/oldFin/lemor.txt” ;
InputStream is=queryWeb.openStream();
DataInputStream ds=new DataInputStream(is);
//code for processing the data input stream ds – the HTML and text contained in the URL object can be accessed using methods such as readChar defined in DataInputStream
catch (IOException e){
System.out.println(“Problem with connection” ;}
You will need to import the java.io package in order to carry out input output operations above
StreamTokenizer class:
StreamTokenizer splits up the characters which are read by an InputStream. The constructor for this class has one argument which is a Stream object
StreamTokenizer st=new StreamTokenizer(is);
sets up a StreamTokenizer based on the InputStream or Reader object is. This means that tokens can be read from this stream, with tokens being delineated by white space.
The main method used to extract tokens is nextToken() which returns an integer which describes what token has been read
The string tokenizer class allows an application to break a string into tokens.
StringTokenizer st = new StringTokenizer(s, “” ;
while (st.hasMoreTokens()) {

I am very interested to know how the implementation below would work-I am unsure as to how it can search through links ? perhaps with a never ending while loop-I am also not sure how StreamTokenizer actually functions and whether it would really solve part of the problem.

I cant think of any related classes ? Have I missed something
Rikard Qvarforth
Ranch Hand
Posts: 107
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi, well after reading the story i think you should sheck out the regex package thats a very interresting package when you want to datamining a page. You should alaow try to see the recalling the same method ( like loping throw a file system . .. or a system of html pages (inet) or something?) .
hope it helps !
if it dosent dont read it
Consider Paul's rocket mass heater.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!