• Post Reply Bookmark Topic Watch Topic
  • New Topic

How to download the complete webpage with HtmlUnit or crawler4j?  RSS feed

 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator


as i know that from htmlunit i will get html and js of page but know how to display like it was in browser

this is how im reading the webpage but how to download it
 
Norm Radder
Rancher
Posts: 2240
28
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
how to display like it was in browser

What do the third party packages (com.gargoylesoftware) you are using do? 
Do they provide all the functionality that a browser does? Like executing javascript properly?
 
Swastik Dey
Rancher
Posts: 1812
15
Android Eclipse IDE Java Java ME
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
By download do you mean save the contents in local disk?
 
Tim Moores
Saloon Keeper
Posts: 3967
94
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think you need to tell us what you mean by "download". On July 27 you had posted code that saves an HtmlPage object to a file (https://coderanch.com/t/682566/java/download-individual-item-local-storage#3202839); if that is not "downloading" by your definition, please tell us what is.
 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
it prints the html of the website but i want a complete crawler ?
 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
im trying with crawler4j but nothing download to my folder ? cn you tell me why
 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
basic crawler. java this is the code using crawler4j

 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
avoid above second code snippet basic controller.java
 
Dave Tolls
Ranch Foreman
Posts: 3011
37
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What links is it not finding?
 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
its finding but its not downloading to the storage folder
 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
yes swastik i mean save the content in local disk
 
Tim Moores
Saloon Keeper
Posts: 3967
94
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You'll need to write the code that saves the page to disk yourself. Note that the visit method does not currently do that. The ImageCrawler example does it for all the images - it's probably easier to extend that example to also save the HTML, since the code already shows how to treat file names.

Note that the example as is does not work as it assumes that all URLs  start with "http://uci.edu/" - which due to the redirect to "https://uci.edu/" is not correct. But that's an easy fix.
 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i dont want to save the html i want to save the page as it is as it open browser
 
Tim Moores
Saloon Keeper
Posts: 3967
94
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What does that mean? The page consists of HTML, images, CSS and JavaScript. How is saving the constituent parts different from what you want to achieve? Please give an example web page, and list what you would want to save as a result of crawling it.
 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
http://www.thehindu.com/ like this is a webpage i wanna download this webpage as it is consisting of html css and js
 
Tim Moores
Saloon Keeper
Posts: 3967
94
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OK, so you want the HTML, JS and CSS, but not the images. You may need to enable binary content in the config, as crawler4j seems to regard part of what that site serves as binary. (There's an error message to that effect in its output.)

Apart form that you'll need to alter the "visit" method to save HTML, JS and CSS files. I had already mentioned where to find example code for that.

you should also read the "Terms of use" to make sure what you're doing is in accordance with those.
 
Niti Kapoor
Ranch Hand
Posts: 259
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i basically want to save the webpage for offline browsing i hope now its more clear
 
Tim Moores
Saloon Keeper
Posts: 3967
94
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OK, so you DO want the images after all. In that case, starting with the image crawler example might be easier, you'll just need to adapt it to store HTML, JS and CSS as well.

Note that that particular web site also has an uncommon extension (".ece"), so the code needs to accommodate it and treat it as HTML. But that, too, is a small change.

Let us know if you have specific questions about making these changes. I don't know if crawler4j actually supports this use case - it would mean keeping file names in sync so that the HTML files reference the corresponding JS, CSS and image files; have you found anything regarding this?
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!