• Post Reply Bookmark Topic Watch Topic
  • New Topic

Proxy Set in java  RSS feed

 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I'm using proxy to crawl some data. So, for now I've this snippet,



Now, I've a thread say T which is calls this IPchange() method to use proxy. Now, Thread T creates other threads say T1, T2. When I give URL to crawl for this T1 and T2. T1 and T2 will not call IPchange() method. To crawl URL, T1 and T2 now using my Original IP?

Thanks:
Ramakrishna K.C
 
Paul Clapham
Sheriff
Posts: 22836
43
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not sure what your question is... do you want to know how to do that, or do you want to know whether it will happen?

At any rate there are two things you need to know:

1. The proxySet system property doesn't do anything, and hasn't done anything since Java 1.1. But people keep copying it from other people without understanding it.

2. Those are system properties. That means that their current values will be used by any code in the JVM which asks for them.
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Then, What I'm crawling is not using the proxy? The crawling pages is from my original IP?

I want to use proxy to crawl the page. How to do that then?

Thanks:
Ramakrishna K.C
 
Paul Clapham
Sheriff
Posts: 22836
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You still didn't clarify the question. Let me restate what I just said:

If those system properties have values at a particular time, then your code will use the values if it runs at that time. If not, then the code won't use the values. If that doesn't help then you're going to have to explain.
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not sure what your question is... do you want to know how to do that, or do you want to know whether it will happen?


I thought my code is correct and it is setting proxy i.e using proxy to crawl. But you told

The proxySet system property doesn't do anything, and hasn't done anything since Java 1.1. But people keep copying it from other people without understanding it.


Now, I want to know how to set proxy in java.

I want to crawl the pages with proxy. For that my program needs to use proxy, how to set the proxy in my code ?
Currently I'm using the code what I posted. It is not working. I checked just now by crawling blocked pages.

Thanks:
Ramakrishna K.C
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I got another snippet, But, problem is connection timeout exception are often.



Is there any-way to get the page with proxy without much exception through this jsoup code?

Thanks:
Ramakrishna K.C
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Note that Paul said something about "proxySet" being non-functional, but he didn't say anything about "http.proxyHost" and "http.proxyPort" being non-functional - those are, in fact, functional. A different approach is to use a Proxy, like the latest code snippet you posted does. You need to decide which of these approaches you want to implement - you should not mix them.

I would assume that the Proxy object only applies to the connection it's used with, other than using system properties that would apply to all connections (like Paul said). You should read the javadocs to make sure how exactly that works.
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would assume that the Proxy object only applies to the connection it's used with, other than using system properties that would apply to all connections


I'm using multi-threading. So, I want to use proxy to all. Then, I've to use system.setProperty("http.proxyHost", proxy);. But, the code I posted (older one) will not work. I refer This page.

I'm using Jsoup to crawl the page. In Jsoup there is no Proxy option.

Thanks:
Ramakrishna K.C
 
Paul Clapham
Sheriff
Posts: 22836
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ramakrishna Udupa wrote:I'm using multi-threading. So, I want to use proxy to all. Then, I've to use system.setProperty("http.proxyHost", proxy);. But, the code I posted (older one) will not work.


Then your original code should work just fine. Line 6 is useless and should be deleted so other people reading your code don't get the idea that it does anything, but the rest of the code is OK. You would have to call the method before you tried to make any connections, of course. As for "will not work"... have a look at our FAQ page ItDoesntWorkIsUseless (<-- click that link) and let us know.
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Paul Clapham I'm trying the same code(old one) Because I got blocked from my original IP for some sites.. When I use proxy's It is showing 403 Exception. That means The my IP is not setting and my original IP is going right?
 
Paul Clapham
Sheriff
Posts: 22836
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Usually 403 means "Refused". You didn't post the whole exception, which means that people reading the post don't know enough. Possibly the proxy server is refusing access to your code -- maybe it requires authentication? Or it might mean something else. It might be a good idea if you spent a few minutes talking to your network people, the ones who run that proxy server. They might have some useful information, not necessarily about Java programming but about how proxy servers actually work.
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is the exception

 
Paul Clapham
Sheriff
Posts: 22836
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So either the proxy server is refusing the connection or www.yelp.com is refusing it. The error message doesn't say which, and anyway your code (and the code it's calling) can't tell which because the low-level code is dealing with the proxy.

But if it's really www.yelp.com you're trying to connect to, then no doubt the site has noticed that you're connecting via a bot and not via a browser. That's contrary to the site's Terms of Service, in particular section 6.B.iii.

If you want programmatic access to www.yelp.com you should use their developer tools instead of trying to scrape their site. Here's the link: http://www.yelp.com/developers

 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
if it's really www.yelp.com you're trying to connect to

Yes. I'm connecting to that site.

then no doubt the site has noticed that you're connecting via a bot and not via a browser

Is there any way to crawl their site without using developers tools ?


If you want programmatic access to www.yelp.com you should use their developer tools instead of trying to scrape their site. Here's the link: http://www.yelp.com/developers

That is API examples. I think we have to buy that. Without buying that can I crawl?

Thanks:
Ramakrishna K.C
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Without buying that can I crawl?

You quote extensively from Paul's post, but it seems you overlooked the part that addresses that. After reading his post again, including the links it contains, what do you think the answer is?
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How to check that the request is sent through proxy or not for any site?

I mean, see this code,



In this code, I set proxy But I don't know whether it is sending a request through proxy or not. How to check this? Is there any way to confirm that we are using proxy ?

Thanks:
Ramakrishna K.C
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could use a firewall to block all outgoing traffic except to the proxy host.

If the proxy runs on the same machine as the client, a tool like WireShark will tell you what's happening.
 
Ramakrishna Udupa
Ranch Hand
Posts: 254
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
After reading his post again, including the links it contains, what do you think the answer is?


Sorry Ulf. I really din't get. That's why I asked SHOULD I BUY?
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You asked "Without buying that can I crawl?" - is that question really still open after you have read the section of the terms and services that Paul pointed you to?

We're not in a position to tell you what you should or should not buy. But there should be no doubt what the answer to your question is.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!