• Post Reply Bookmark Topic Watch Topic
  • New Topic

get links from website  RSS feed

 
Smita Ahuja
Greenhorn
Posts: 28
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I want to extract all the document hyperlinks listed on website(basically to download documents)
http://pdonline.brisbane.qld.gov.au/masterview/modules/applicationmaster/default.aspx?page=wrapper&key=A003608957.
I tried with jsoup , but its giving me all the links apart from pdf hyperlinks.

I tried:


Any advice would be highly appreciated.

Thnaks,
Smita
 
E Armitage
Rancher
Posts: 989
9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First, scrapping is not always legal. you should check that the terms of use of the site allows you to do that before proceeding.
Second, the site requires people to agree to the said terms of use before they can access anything on the site. Likely your code is being presented with a page to accept the terms of service not the page you have requested with the pdfs. I don't know if jsoup handles that kind of dynamic interaction (htmlunit does).
 
Smita Ahuja
Greenhorn
Posts: 28
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for your quick reply, Could you please share one example for htmlUnit.

I tried but I am getting below exception:

Exception in thread "main" org.apache.http.conn.HttpHostConnectException: Connection to http://google.com refused
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Have you checked that google.com is accessible in general from the machine where this code runs?
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37465
539
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your code doesn't mention Google. Why is it trying to access Google?
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!