This week's book giveaway is in the General Computing forum.
We're giving away four copies of Emmy in the Key of Code and have Aimee Lucido on-line!
See this thread for details.
Win a copy of Emmy in the Key of Code this week in the General Computing forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Devaka Cooray
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Ron McLeod
  • Carey Brown
Bartenders:
  • Paweł Baczyński
  • Piet Souris
  • Vijitha Kumara

Validate URL - Looking for something robust

 
Ranch Hand
Posts: 47
1
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi!

There are loads of topics around the net describing how to check if an URL exists in Java. They seems to work fine when the URLs are rather simple, but fail on more messy URLs.

If I example pass "http://google.com", the method will return true. That's good. If I pass "http://google.com/i_dont_exsits", the method returns false(404). Also good. Now if I pass a "messy" URL, like this:

The method returns false(503). That's not right. If I enter that URL in a browser, I see a that the URL is perfectly valid and working.

Why do my method return false when it should be true?
 
author & internet detective
Posts: 39574
781
Eclipse IDE VI Editor Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My guess is that Amazon is trying to avoid people scraping the site and looking for a user agent header or the like.

This Selenium code correctly returns response code 200 for your URL.

 
Saloon Keeper
Posts: 2746
359
Android Eclipse IDE Angular Framework MySQL Database TypeScript Redhat Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jeanne Boyarsky wrote:My guess is that Amazon is trying to avoid people scraping the site and looking for a user agent header or the like.


I think you are right.  If you do a GET rather than a HEAD you will see this in the content returned with the 503:
 
Ron McLeod
Saloon Keeper
Posts: 2746
359
Android Eclipse IDE Angular Framework MySQL Database TypeScript Redhat Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think your best chance of not having the server reject your request would be to:
  • specify a User-Agent for a real browser (for me, running your code as-is, the User-Agent was set as Java/1.8.0_45)
  • use the GET method rather than HEAD
  •  
    Ron McLeod
    Saloon Keeper
    Posts: 2746
    359
    Android Eclipse IDE Angular Framework MySQL Database TypeScript Redhat Java Linux
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Interestingly - Amazon will successfully reply to HEAD requests when setting the User-Agent to one for this old browser
     
    P Marksson
    Ranch Hand
    Posts: 47
    1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Ron McLeod wrote:I think your best chance of not having the server reject your request would be to:

  • specify a User-Agent for a real browser (for me, running your code as-is, the User-Agent was set as Java/1.8.0_45)
  • use the GET method rather than HEAD

  • That seems to solve the problems(or doing a GET with Jersey Client, not setting user agent). However, doing GET will fetch the entire site, I think. I assume it would be a problem in an environment where performance is critical.

    Jeanne Boyarsky wrote:My guess is that Amazon is trying to avoid people scraping the site and looking for a user agent header or the like.

    This Selenium code correctly returns response code 200 for your URL.



    Is that code opening a browser?
     
    Jeanne Boyarsky
    author & internet detective
    Posts: 39574
    781
    Eclipse IDE VI Editor Java
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    P Marksson wrote:Is that code opening a browser?


    No. It's using HtmlUnit which does everything in memory.
     
    After some pecan pie, you might want to cleanse your palatte with this tiny ad:
    Java file APIs (DOC, XLS, PDF, and many more)
    https://products.aspose.com/total/java
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!