Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Using a Java app for web page scraping.  RSS feed

 
Darrin Smith
Ranch Hand
Posts: 276
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm gathering some data from a site using the HTMLEditorKit. The first page works all well and good, but from that page I get URLs that lead to me other pages that require a login.

Normally my browser (Firefox) handles this so I don't have to sign in each time, but how do you do this using what Java has available?

In other words, how to I cache off the sign in information like IE and Firefox both do so when I go to the protected links I don't have to sign in again and again?

Thanks.
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could use a library like jWebUnit (on SourceForge).
 
Darrin Smith
Ranch Hand
Posts: 276
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks.

I'll check into that!
 
Darrin Smith
Ranch Hand
Posts: 276
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I checked it out but couldn't find any reference to doing what I need to do (although it seems like a nice tool for testing).

Do you know of any documentation that shows you how to use it for doing what i describe?

There was very little information available about how to use it that I could find.

Thanks.
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Quick Start page gives an overview what's possible. For logging in, check out the section "Working With Forms" (assuming that's how the login works). If the login sets a cookie, jWebUnit will remember that, and use it from then on.

If you need the complete page source (as opposed to specific elements), you can get that using the getPageSource method in the class that extends WebTestCase. You can also use saveAs to save the last accessed page to a file. The getElementTextByXPath and getElementAttributByXPath methods help getting at particular parts of the page.

And, yes, the library is meant for testing web apps, but it's superb for automating any kind of access to a web app by any Java client.
 
Darrin Smith
Ranch Hand
Posts: 276
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well it looks like the tool does log me in as I can display the cookie by using dumpCookies() and see that my nickname is correct, but when I try to click a protected link it sends me back to the login page just as if I never logged on.

Is there some undocumented setting that I need to make to get jWebUnit to use the cookie maybe?

BTW, here is what I do:



When the clickLinkWithText gets executed, the login page gets returned (I know by inspecting the source in the debugger).
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not sure, but it could be that "beginAt" starts a new conversation, and dumps any existing information, including cookies. Can you programmatically navigate to that page by following links or buttons?
 
Darrin Smith
Ranch Hand
Posts: 276
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Bingo!

I would think that they would have an option to keep the same conversation alive as using beginAt is so convenient, but at least it can be worked around.

Thanks!
[ August 19, 2007: Message edited by: Darrin Smith ]
 
Victor Ionescu
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I resolved this issue by attaching the session cookie to the test context. I've created a function performLogin() in some base class of all my tests and in each test I call the login procedure. The login procedure calls the function makeConversationSessionAware(). Note that I misspelled the function getKookie because the correct name was not accepted by the posting engine. Here's the code:
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!