• Post Reply Bookmark Topic Watch Topic
  • New Topic

HtmlUnit - Finding target page content type before loading  RSS feed

Vinoth Kumar Kannan
Ranch Hand
Posts: 276
Chrome Java Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello All,
I've been lately using HtmlUnit2.8 for web scraping. For scraping, I need only html pages and not pdfs,mp3s,rars...etc.
The code so far I've been using is,

The main problem here is the entire target page is loaded into memory and then its content type is checked. Suppose it is a 1MB pdf url, the whole 1MB loads and then says it is of application/pdf content type. This thing here, eats up my memory and takes too much time as well. I've did some digging into the API, but nothing promising. Is there any alternative solution to this?
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!