Win a copy of Kotlin in Action this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Scrape Stock Brokerage Site with Groovy?  RSS feed

 
Siegfried Heintze
Ranch Hand
Posts: 417
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd like to write a little app that would tally the value of multiple brokerage accounts. I'm thinking of sites like Etrade or charlles schwab where one might have an account for trading stocks.

Should I use the java.net.URL class or apache's HttpClient?

Should I use Java.net.passwordAuthentication as exemplified at http://www.java2s.com/Code/Java/Network-Protocol/javanetPasswordAuthenticationPasswordAuthenticationStringuserNamecharpassword.htm or just pass the username and password as post parameters?

I assume these issues are independent to java v. groovy.

Is there a nice GUI tool that will give me the XPATH for a desired tidbit of data in the raw HTML that I scrape?

Thanks,
Siegfried
 
Joe Ess
Bartender
Posts: 9428
12
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is what you propose permitted by the End User Agreements of the sites you plan to scrape?
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Back in the days before "web 2.0" - when you could expect output like what you are talking about to be composed as a single HTML page - you could expect to capture the response from a single URL and get content you could "screen scrape." (A term which dates back to mainframes and terminals.)

This has not been the case for quite a while - these days what looks like a simple page could be composed from dozens of separate requests. I suggest you use something like the Firebug add-on for Firefox and take a close look at the captured conversation that creates the page.

Your sites may already provide for SOAP or RESTful requests to get formatted data you can easily use. Services like Amazon or Google have been exposing these interfaces for years.

Bill
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A library like HtmlUnit would be much easier to use than HttpClient or java.net.URL. It provides high-level methods for accessing page elements, including XPath.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!