• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Devaka Cooray
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Jeanne Boyarsky
  • Tim Cooke
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Tim Moores
  • Mikalai Zaikin
  • Carey Brown
Bartenders:

How to download the complete webpage with HtmlUnit or crawler4j?

 
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator


as i know that from htmlunit i will get html and js of page but know how to display like it was in browser

this is how im reading the webpage but how to download it
 
Rancher
Posts: 5008
38
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

how to display like it was in browser


What do the third party packages (com.gargoylesoftware) you are using do?  
Do they provide all the functionality that a browser does? Like executing javascript properly?
 
Bartender
Posts: 2270
20
Android Java ME Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
By download do you mean save the contents in local disk?
 
Saloon Keeper
Posts: 7551
176
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think you need to tell us what you mean by "download". On July 27 you had posted code that saves an HtmlPage object to a file (https://coderanch.com/t/682566/java/download-individual-item-local-storage#3202839); if that is not "downloading" by your definition, please tell us what is.
 
Niti Kapoor
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
it prints the html of the website but i want a complete crawler ?
 
Niti Kapoor
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
im trying with crawler4j but nothing download to my folder ? cn you tell me why
 
Niti Kapoor
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
basic crawler. java this is the code using crawler4j

 
Niti Kapoor
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
avoid above second code snippet basic controller.java
 
Rancher
Posts: 4801
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What links is it not finding?
 
Niti Kapoor
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
its finding but its not downloading to the storage folder
 
Niti Kapoor
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
yes swastik i mean save the content in local disk
 
Tim Moores
Saloon Keeper
Posts: 7551
176
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You'll need to write the code that saves the page to disk yourself. Note that the visit method does not currently do that. The ImageCrawler example does it for all the images - it's probably easier to extend that example to also save the HTML, since the code already shows how to treat file names.

Note that the example as is does not work as it assumes that all URLs  start with "http://uci.edu/" - which due to the redirect to "https://uci.edu/" is not correct. But that's an easy fix.
 
Niti Kapoor
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i dont want to save the html i want to save the page as it is as it open browser
 
Tim Moores
Saloon Keeper
Posts: 7551
176
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What does that mean? The page consists of HTML, images, CSS and JavaScript. How is saving the constituent parts different from what you want to achieve? Please give an example web page, and list what you would want to save as a result of crawling it.
 
Niti Kapoor
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
http://www.thehindu.com/ like this is a webpage i wanna download this webpage as it is consisting of html css and js
 
Tim Moores
Saloon Keeper
Posts: 7551
176
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
OK, so you want the HTML, JS and CSS, but not the images. You may need to enable binary content in the config, as crawler4j seems to regard part of what that site serves as binary. (There's an error message to that effect in its output.)

Apart form that you'll need to alter the "visit" method to save HTML, JS and CSS files. I had already mentioned where to find example code for that.

You should also read the "Terms of use" to make sure what you're doing is in accordance with those.
 
Niti Kapoor
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i basically want to save the webpage for offline browsing i hope now its more clear
 
Tim Moores
Saloon Keeper
Posts: 7551
176
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
OK, so you DO want the images after all. In that case, starting with the image crawler example might be easier, you'll just need to adapt it to store HTML, JS and CSS as well.

Note that that particular web site also has an uncommon extension (".ece"), so the code needs to accommodate it and treat it as HTML. But that, too, is a small change.

Let us know if you have specific questions about making these changes. I don't know if crawler4j actually supports this use case - it would mean keeping file names in sync so that the HTML files reference the corresponding JS, CSS and image files; have you found anything regarding this?
 
"How many licks ..." - I think all of this dog's research starts with these words. Tasty tiny ad:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic