• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

web services for scraping/extracting data from travel/airline sites

 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am interested in exploring what web services are available for making requests to and scraping/extracting data from multiple airline ticket/travel sites such as expedia, travelocity, etc.

I know there are many meta-search sites out there and the applications they run are complex (there are even companies that build software and then sell it to travel sites who later integrate it into their sites), but I am curious as to whether anyone has an understanding how these work.

Regards,
Chris
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I would only apply the term web service to APIs that are published by the travel, etc. sites themselves.

"Screen scraping" of data out of pages created by travel sites makes them very unhappy since you are bypassing their money making featurs, and as I recall, has resulted in lawsuits and legal tangles.

If you want to fake a browser session, look into the HttpClient toolkit in the Apache commons project. For interpreting the returned HTML, consider the JTidy package. Parsing HTML is made difficult by the fact that much commercial HTML is badly formed from a XHTML point of view.
Bill
 
Bartender
Posts: 2968
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by William Brogden:
Parsing HTML is made difficult by the fact that much commercial HTML is badly formed from a XHTML point of view.


Swing contains an HTML parser that works like a SAX parser.
With more modern web pages you may have to extract the XmlHttp service that feeds the Javascript of the page.
[ November 01, 2006: Message edited by: Peer Reynders ]
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Swing contains an HTML parser that works like a SAX parser.


I have never used that one - how does it do when fed the badly formatted HTML typical of so many pages?

The use of XmlHttp etc is certainly going to make scraping even trickier in the future.
Bill
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic