Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

web services for scraping/extracting data from travel/airline sites  RSS feed

 
Chris Tokar
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am interested in exploring what web services are available for making requests to and scraping/extracting data from multiple airline ticket/travel sites such as expedia, travelocity, etc.

I know there are many meta-search sites out there and the applications they run are complex (there are even companies that build software and then sell it to travel sites who later integrate it into their sites), but I am curious as to whether anyone has an understanding how these work.

Regards,
Chris
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would only apply the term web service to APIs that are published by the travel, etc. sites themselves.

"Screen scraping" of data out of pages created by travel sites makes them very unhappy since you are bypassing their money making featurs, and as I recall, has resulted in lawsuits and legal tangles.

If you want to fake a browser session, look into the HttpClient toolkit in the Apache commons project. For interpreting the returned HTML, consider the JTidy package. Parsing HTML is made difficult by the fact that much commercial HTML is badly formed from a XHTML point of view.
Bill
 
Peer Reynders
Bartender
Posts: 2968
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by William Brogden:
Parsing HTML is made difficult by the fact that much commercial HTML is badly formed from a XHTML point of view.

Swing contains an HTML parser that works like a SAX parser.
With more modern web pages you may have to extract the XmlHttp service that feeds the Javascript of the page.
[ November 01, 2006: Message edited by: Peer Reynders ]
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Swing contains an HTML parser that works like a SAX parser.

I have never used that one - how does it do when fed the badly formatted HTML typical of so many pages?

The use of XmlHttp etc is certainly going to make scraping even trickier in the future.
Bill
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!