Win a copy of Kotlin in Action this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Help with webcrawling .aspx pages!  RSS feed

 
Devasia Manuel
Ranch Hand
Posts: 57
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey Guys,

I've been using Java for a couple of years and have no problem web crawling common HTML pages but quite recently, I stumbled upon a website that seems impossible to crawl...here's the link:

http://stock-forecasting.com/RealtimeDemo.aspx

My final aim is to make a program that captures the table at the bottom of the screen when the 'Predict' button is clicked.

No matter what I do, I just can't figure out how that data selected in the drop down box is communicated back to the server. I tried using both the GET and POST commands, but no results!

It's an ASPX page so I guess it doesn't use GET or POST (I've got no clue how ASP .NET works and I'm just assuming)

Furthermore, I used that handy little tool in Firefox called 'Tamper Data', it allows you to see all communications from the client side to the server, I tried to identify any sort of command that would relay the selected stock symbol back to the server. But to no avail!

I'm really desperate, will someone please help me?

Thanks in advance,
Devasia Manuel

 
Joe Ess
Bartender
Posts: 9429
12
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Devasia Manuel wrote:
It's an ASPX page so I guess it doesn't use GET or POST (I've got no clue how ASP .NET works and I'm just assuming)

If it has HTTP in the URL, it is using Hypertext Transfer Protocol. That is to say, GET or POST (less commonly PUT, DELETE, etc).
Many sites protect their resources from being crawled by using sessions or user agent detection. I'll bet that's what you've encountered.
 
Devasia Manuel
Ranch Hand
Posts: 57
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the reply Joe.

Well, I was using Tamper Data (its this handy tool from Firefox that intercepts GET and POST requests) and I didn't seem to find any sort of session info encoded into the POST and neither did the site drop any Cookies that could carry the data on my PC.

As for the user agent, I configured the Request Header of my Java program to use the same user agent value as Firefox

But that still didn't seem to do the trick. Any ideas?

P.S. I know that crawling this site can be done because I posted this exact same problem on another forum a long while back and this very helpful programmer wrote the whole code for me! (I'm not trying to suggest anything here) Anyway, he handed me the compiled class files which have been long lost, and I can't seem to reach him again to explain how he did it.
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That page loads a bunch of other resources in turn - various JavaScript files (including one that indicates that Flash is involved), and some ".ashx" files, whatever that is. Does your crawler know how to handle all that?
 
Devasia Manuel
Ranch Hand
Posts: 57
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey Ulf,

Frankly speaking, I haven't built the crawler yet, I just messed around a bit with the GET and POST requests and when I realized that it didn't work...I knew I could rely on help from JavaRanch.com. I'm hoping to pick up some tips that will help me crawl this "uncrawlable" site.

As I said before, I'm not interested in all those Flash files it loads and the JavaScript but simply the table which appears at the bottom of the screen, which is just simple text (as far as I know).
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Start by figuring out how that table gets created. If it's part of the initial page then things are relatively easy. It's possible that it gets created by some JavaScript code, though, in which case you will have to get interested in that.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!