• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Ganesh Patekar

How do I crawl web pages using JSoup or HtmlUnit?

 
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I'm using J-soup HTML parser in my java application. I want crawl the Ajax pages for reviews contents. See THIS URL. In this URL, When I crawl the reviews, the J-soup will give only first five reviews. I'm not getting all the reviews of this page. Can any one help me to get this?

Thanks:
Ramakrishna K.C
 
Bartender
Posts: 11445
18
Android Google Web Toolkit Mac Eclipse IDE Ubuntu Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you don't own the site, I suspect there is not much you can do.
Check out https://developers.google.com/webmasters/ajax-crawling/docs/getting-started to understand what the problem is and what is one of the possible solutions (which needs to be implemented on the server side)
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Maneesh,

Thanks for the reply. I red those contents and I tried by replacing # by ?_escaped_fragment_=. But didn't work. I'm still in confusion. When I open those page contents with fire-bug in "Console-> All" I can see the URL and I can see that the next 5 reviews is coming in fire-bug. But how to get those contents. Don't know how to fetch. Any idea about J-soup html parser to fetch those contents.

Thanks:
Ramakrishna K.C
 
Maneesh Godbole
Bartender
Posts: 11445
18
Android Google Web Toolkit Mac Eclipse IDE Ubuntu Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Please re read the article one more time. Just replacing the URL does not work. You need a supporting component on the server side to render the data.
Did you try hitting the URL you see in firebug? If that works you will have to
1) Crawl the initial page
2) Identify the "get next" URL
3) Crawl the new data
4) Repeat
 
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
IIRC, JSoup deals with static content (at least last time I looked into it, a few years back). You may want to try a library that has good support for executing embedded JavaScript, like HtmlUnit. If HtmlUnit can't do it, then there is probably no way (besides coding it yourself, of course).
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The second step, i.e
2)Identify the "get next" URL. There is no URL for that. When I see in the fire-bug there is just a text content "LOAD MORE 75 COMMENTS". When I click on that, 5 extra comments will come with "LOAD MORE 70 COMMENTS".

The next URL what I find in the fire-bug is "http://www.zomato.com/php/social_load_more.php". If you open this page, you'll find only this much

Contents:

{"page":2,"left_count":70,"html":"
\n

in fire-bug, contents as next 5 comments. for every click of LOAD MORE COMMENTS, the same URL'll fired up again and again. J-soup is not extracting that page id's, classes etc., Can please elaborate this step.
Set up your server to handle requests for URLs that contain _escaped_fragment_. ?

Thanks:
Ramakrishna K.C
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay. I'll try Ulf Dittmer.

Thanks:
Ramakrishna K.C
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Ulf Dittmer
I tried that htmlUnit. I'm getting lot of exception. I include all the required JAR files to my class-path. Exception is



My source code is




I'm not getting how to solve that. In the exception it is showing about log4j which I marked with red text. I included log4j and all other JAR's to my classpath. Is there any idea why this exception is?

Thanks:
Ramakrishna K.C
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The log4j message is just a warning, not an error - you can ignore it.

That missing class is part of the Rhino library - it's strange that it doesn't ship with HtmlUnit. You can get it here.
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I downloaded that rhino, extracted and copied js.jar and js-14.jar. It is showing

Exception:



I don't know this exception is for what?

Thanks:
Ramakrishna K.C
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are you sure you added all libraries that come with HtmlUnit correctly, and didn't make a typo somewhere? That class is in xml-apis-1.4.01.jar, which comes with HtmlUnit.
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey I got HtmlUnit thanks. But when I run with URL, It is showing this exception


Source code:


Exception:
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I think htmlUnit will not work, I'm not sure. Also I posted my source code in previous posts(Please see what's the problem, if possible). So is there any other solutions to crawl dynamic contents from web-page??

@Maneesh Godbole: Can you please send a sample code to get the dynamic contents from web-sites(If possible).

Thanks:
Ramakrishna K.C
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I got struck-up in htmlUnit. This is my Page to crawl. In this page, if you want to see more user reviews, you've to click the "Load-More" option. How can you do that using htmlunit. I want all the review comments at a time. It's internally calling Ajax. Is there any solutions guys for this page to crawl?

Thanks:
Ramakrishna K.C
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What do you have so far, and where exactly are you stuck making progress?
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Ulf Dittmer, Maneesh. Finally I got that.. It was so silly mistake.

@Ulf Dittmer: I was struck to get the divElements to cast to button.

My source code is


Its a 5 line code, took almost 2 days..

Once again thanks All..
Ramakrishna K.C
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

HtmlUnit was working till yesterday. Suddenly the first line itself giving an exception. I'm not getting, what's the error.

This is my code




Exception is:




Thanks
Ramakrishna K.C
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Ulf Dittmer The Html-unit is not working. It works for only small programs. For example, the 5 line code what I posted works only for 2-3 dynamic pages. In my case, lots of threads crawls simultaneously.
Is there any other tool which is good to crawl the dynamic web pages other than html-unit which is similar to html-unit ? I tried crawljax. But, it has lot of problems. The basic example code from github is not working.

Thanks:
Ramakrishna K.C
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

The Html-unit is not working.


Please read https://coderanch.com/how-to/java/ItDoesntWorkIsUseless

In my case, lots of threads crawls simultaneously.


Why do you mention concurrency? Do you suspect that the "failure" you describe -whatever that is- is connected to that? If so, did you notice that the javadocs of the WebClient class mention that it is not thread-safe?
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Ulf Dittmer I mean it is crawling, but in program I written function to click to some button. After that I'm crawling, But it crawls the page without clicking that button.

Example: See THIS page. you will find all reviews about this Restaurants. There is a button below the review i.e LOAD MORE . If you click that you will get another five reviews. In html crawler, the page is crawling, that's not the problem. I written click function in my program. But, it is not clicking and then crawling.
My code is


This my snippet.


Thanks
Ramakrishna K.C
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't think I understand the exact problem - that DIV is present, and after it is clicked currentPage.asText() is considerably longer than before (about 23400 bytes vs. 19200 bytes), so it would seem that something gets added to the page. Is that not what you want?
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When I run program, I click LOAD MORE through program(snippet what I posted). But, it is not entering in that snippet itself. I mean, if page contains LOAD MORE button also, it is not entering in the if condition in my snippet.

if(currentPage.getFirstByXPath("//div[@class = 'load-more']") != null)

The above line in my posted code.

Instead of crawling and clicking that button, it is showing this message.

Jan 09, 2014 4:36:29 PM com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController processSynchron
INFO: Re-synchronized call to http://www.zomato.com/php/social_load_more.php
Facebook Cross-Domain Messaging helper
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's odd - it does that for me.
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there any solutions for that?

I tried and fed-up with Crawljax, which is a crawling tool similar to Html-unit.
 
Ulf Dittmer
Rancher
Posts: 43011
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This code works fine for me:
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How's this possible? I think you din't change anything right? Did you change anything in code, what I posted?
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!