• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Ron McLeod
  • Jeanne Boyarsky
  • Paul Clapham
Sheriffs:
  • Tim Cooke
  • Liutauras Vilda
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • fred rosenberger
  • salvin francis
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Carey Brown

Retreiving information about a website ...

 
Ranch Hand
Posts: 1585
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I want to develop an application that returns information about a given input website.

As an example, if a user input a given website address to it, it needs to return information about that website.

Let's say that that the user enters www.google.com, in this case the application has to return some brief information about google.

Now the question is : Is that possible doing it with java, so i can open a socket on the site and read some given information ?

Are there any methodologies that let me do so ?

Thanks in advance ...
 
Sheriff
Posts: 21974
106
Eclipse IDE Spring VI Editor Chrome Java Ubuntu Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What kind of information do you want?

URLConnection and HttpURLConnection can help you out a bit:
 
Vassili Vladimir
Ranch Hand
Posts: 1585
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The information i want to retrieve back is some kind of a description about the website itself, a textual description.

Are there any HTTP headers that get sent from the web-server back so i can read ?

Thanks ...
 
Bartender
Posts: 9615
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't think HTTP headers are going to be useful to you.
You could try parsing META tags from the HTML source. They're used by convention to indicate key words to search engines. Of course, that means they may be just a list of keywords and not necessarily meaningful.
Your best bet may be to parse out the title tag, since it's human readable. Of course, both of these solutions rely on the HTML programmer to obey convention, and that is not really practical. For example, google.com has no meaningful META tags and the page title is simply "Google".
[ March 04, 2008: Message edited by: Joe Ess ]
 
Vassili Vladimir
Ranch Hand
Posts: 1585
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there a parser to read the contents of the meta and title tags ?

Thanks ...
 
Rancher
Posts: 43016
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Is there a parser to read the contents of the meta and title tags ?


You could use something like HtmlTidy to clean up the HTML and hand it to you as XML (which makes it much easier to extract the parts you're interested in). A library like jWebUnit makes this even simpler.

But ultimately it's probably going to be fruitless, as the information you're looking is just not there, or rarely there.
 
He was giving me directions and I was powerless to resist. I cannot resist this tiny ad:
Thread Boost feature
https://coderanch.com/t/674455/Thread-Boost-feature
    Bookmark Topic Watch Topic
  • New Topic