• Post Reply Bookmark Topic Watch Topic
  • New Topic

Validate URL - Looking for something robust  RSS feed

 
P Marksson
Ranch Hand
Posts: 41
1
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi!

There are loads of topics around the net describing how to check if an URL exists in Java. They seems to work fine when the URLs are rather simple, but fail on more messy URLs.

If I example pass "http://google.com", the method will return true. That's good. If I pass "http://google.com/i_dont_exsits", the method returns false(404). Also good. Now if I pass a "messy" URL, like this:

The method returns false(503). That's not right. If I enter that URL in a browser, I see a that the URL is perfectly valid and working.

Why do my method return false when it should be true?
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37513
554
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My guess is that Amazon is trying to avoid people scraping the site and looking for a user agent header or the like.

This Selenium code correctly returns response code 200 for your URL.

 
Ron McLeod
Bartender
Posts: 1603
232
Android Angular Framework Eclipse IDE Java Linux MySQL Database Redhat TypeScript
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jeanne Boyarsky wrote:My guess is that Amazon is trying to avoid people scraping the site and looking for a user agent header or the like.

I think you are right.  If you do a GET rather than a HEAD you will see this in the content returned with the 503:
 
Ron McLeod
Bartender
Posts: 1603
232
Android Angular Framework Eclipse IDE Java Linux MySQL Database Redhat TypeScript
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think your best chance of not having the server reject your request would be to:
  • specify a User-Agent for a real browser (for me, running your code as-is, the User-Agent was set as Java/1.8.0_45)
  • use the GET method rather than HEAD
  •  
    Ron McLeod
    Bartender
    Posts: 1603
    232
    Android Angular Framework Eclipse IDE Java Linux MySQL Database Redhat TypeScript
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Interestingly - Amazon will successfully reply to HEAD requests when setting the User-Agent to one for this old browser
     
    P Marksson
    Ranch Hand
    Posts: 41
    1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Ron McLeod wrote:I think your best chance of not having the server reject your request would be to:
  • specify a User-Agent for a real browser (for me, running your code as-is, the User-Agent was set as Java/1.8.0_45)
  • use the GET method rather than HEAD

  • That seems to solve the problems(or doing a GET with Jersey Client, not setting user agent). However, doing GET will fetch the entire site, I think. I assume it would be a problem in an environment where performance is critical.
    Jeanne Boyarsky wrote:My guess is that Amazon is trying to avoid people scraping the site and looking for a user agent header or the like.

    This Selenium code correctly returns response code 200 for your URL.



    Is that code opening a browser?
     
    Jeanne Boyarsky
    author & internet detective
    Marshal
    Posts: 37513
    554
    Eclipse IDE Java VI Editor
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    P Marksson wrote:Is that code opening a browser?

    No. It's using HtmlUnit which does everything in memory.
     
    It is sorta covered in the JavaRanch Style Guide.
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!