• Post Reply Bookmark Topic Watch Topic
  • New Topic

avoiding webcrawlers

 
saikrishna cinux
Ranch Hand
Posts: 689
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I am planning to develop one site using jsp and html . so here i want to provide complete protection against web crawlers (web bots) .

Now a days in the fast revolution of internet everyone is copying the content of the pages easily using some kind software (web crawlers).

So,I want to avoid web crawlers to scroll(crawl) into pages and copy entire website .

How can i do this. anybody overcome this problem.

Please suggest me some thing about this.
Thanks in advance!

regards,
sai krishna C
 
Amos Matt
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Could you tell me the folder structure of your website project.
 
saikrishna cinux
Ranch Hand
Posts: 689
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Andy Matt:
Could you tell me the folder structure of your website project.


for example it is in this way...
www.xxx.com/~yyy
/index.jsp
/x.jsp
/y.html
.....
so on...

please suggest me
 
Ben Souther
Sheriff
Posts: 13411
Firefox Browser Redhat VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If your pages aren't password protected and are open to the public, anyone will be able to crawl through them.
Why are you concerned with this?
 
saikrishna cinux
Ranch Hand
Posts: 689
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ben Souther:
If your pages aren't password protected and are open to the public, anyone will be able to crawl through them.
Why are you concerned with this?


Ben,I am not using any authentication in my site you may think it's like javaranch site .
In this site a user may explore every hwere as a guest (with out nay passowrd issue) crawl every where with using some sort of software.
 
Joe Ess
Bartender
Posts: 9362
11
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Most crawlers obey the robots exclusion standard. There is also an HTML meta tag (see same article) that can instruct crawlers to ignore a page or links.
 
Ben Souther
Sheriff
Posts: 13411
Firefox Browser Redhat VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How can i do this. anybody overcome this problem.

I guess a good question at this point would be:
What, exactly, has you concerned?
or
What, exactly, is the problem?

Is it search engines that have you concerned or someone else?
 
saikrishna cinux
Ranch Hand
Posts: 689
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ben Souther:

I guess a good question at this point would be:
What, exactly, has you concerned?
or
What, exactly, is the problem?

Is it search engines that have you concerned or someone else?


Hi Ben,
There is no problwm with search engine crawlers.
The problem is with only with third party software which is used by any intruder (end user) for copying all the html pages into his local system.
Hope this time i am specific !



 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The only way to totally prevent that is to not put the page publicly online. After all, anyone who can look at a page also can save it to his hard disk.
 
saikrishna cinux
Ranch Hand
Posts: 689
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ilja Preuss:
The only way to totally prevent that is to not put the page publicly online. After all, anyone who can look at a page also can save it to his hard disk.



Then there is no security over our content
There should be some thing to stop this. As a software professional we should not say nagative things.



+hp =Everything is possible
 
Ben Souther
Sheriff
Posts: 13411
Firefox Browser Redhat VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by saikrishna cinux:



Then there is no security over our content
There should be some thing to stop this. As a software professional we should not say nagative things.



+hp =Everything is possible


Think about it.
What does a web browser do?
It downloads your material to the user's local machine.
That's what it's supposed to do.
The server is supposed to make that content available.
Hyperlinks are there to show users (often someone clicking on links) what other pages are available for download.

If your content needs to be secured, password protect your site.
Then only people with the necessary credentials can download content from it.
 
saikrishna cinux
Ranch Hand
Posts: 689
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ben Souther:


Think about it.
What does a web browser do?
It downloads your material to the user's local machine.
That's what it's supposed to do.
The server is supposed to make that content available.
Hyperlinks are there to show users (often someone clicking on links) what other pages are available for download.

If your content needs to be secured, password protect your site.
Then only people with the necessary credentials can download content from it.



Ok Ben, So far so good answer from you.
and your exactly correct in this point.
The web browser will copy entire data into the local system ( irrespective of loign password).

But there must be some kind of restriction to the end user for copying the entire content of the page.

If we can do this thing then we can bring revolution ,create benchmark .

What do you say ?
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by saikrishna cinux:
But there must be some kind of restriction to the end user for copying the entire content of the page.

Why? What's the difference between someone reading a page, and someone copying the contents of a page? The web is a public medium; if you don't want something disseminated, don't put it online, or add a login for accessing it.
If we can do this thing then we can bring revolution ,create benchmark.

I have no idea what this means.
 
saikrishna cinux
Ranch Hand
Posts: 689
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ulf Dittmer:

I have no idea what this means.


Hi ULF!!! ,

Congratulations for your 10K posts here so i've seen your 9999 post.
ok,
You are right there is no difference betten by sseing th ewb page content and copying it into the local system

But the thing is here when the user uses web crawlers or web spider software he will get some millions of pages into his local ystem
and the sites like largest community "orkut " can be easily crawlled and can be minused (extract) the phone numbers email id's etc. some confidentional or personal data can be accessed at once and can be misused.

Hope i have made some sense by clear idea!

Thanks !


regards
sai krishna c

 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you leave personal or confidential information on a publicly accessible page what do you expect? Of course it gets misappropriated. I'm anyway at a loss to understand the amount of personal detail some people choose to make available about themselves on the web; naive is about the nicest word I can find for this behavior.
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24213
35
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You cannot stop automated page downloads because they don't look any different from non-automated ones.

Except...

Downloading a million pages could take a long, long time if your site limited the bandwidth used by any one client. You might add a servlet filter that checked each request against a list of recent request IP addresses, and refused to serve a page if the previous request was less than X seconds ago. I imagine there are commercial products with this sort of capability built in.
 
saikrishna cinux
Ranch Hand
Posts: 689
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ernest Friedman-Hill:
You cannot stop automated page downloads because they don't look any different from non-automated ones.

Except...

Downloading a million pages could take a long, long time if your site limited the bandwidth used by any one client. You might add a servlet filter that checked each request against a list of recent request IP addresses, and refused to serve a page if the previous request was less than X seconds ago. I imagine there are commercial products with this sort of capability built in.



Ernest, May i know some commercial (and /or )free site names .

ofcourse, This is really very good idea!

Great suggestion Boss.
 
Ben Souther
Sheriff
Posts: 13411
Firefox Browser Redhat VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by saikrishna cinux:

Ernest, May i know some commercial (and /or )free site names .


Read Ernest's post again.

I imagine there are commercial products with this sort of capability built in.


"Imagine" is the keyword there.
You will have to search for such a product yourself.
[ August 31, 2007: Message edited by: Ben Souther ]
 
saikrishna cinux
Ranch Hand
Posts: 689
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ben Souther:


"Imagine" is the keyword there.
You will have to search for such a product yourself.

[ August 31, 2007: Message edited by: Ben Souther ]


Ok, Dear Ben
Any way you got a good Eye on each and every word in the posts !
Good!!
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!