Extracting Data
Hi forum,

I want to extract pdfs and text files from the popular news sites , i already found those sites which maintain pdfs and have the download option. so here i want to extract those pdfs in to my site database , later i retrieve from my database to my site for to display . if i give the link is there any chance to extract it to our database directly . i still confused on which technology is used for this for better results, is there any other chances to extract pdfs to my database.
If i enter the URL of the content web page in the text field , then directly the data will extract and save in database. later i will retrieve from the database to my site.
can any one of both have the possibility to done
can any one help me with there valuable suggestions.

thank you
Aside from the technical issues, you should check if you are legally allowed to copy content from those other websites before you do this. Just because other websites make content available to you doesn't mean you're allowed to republish that content yourself.
Thank you Jesper Young for your suggestion,
But i have All the rights for to access those data. some how those are my colleagues sites so no problem regarding that. can you have any idea regarding the process and coding
thank you
The Apache HttpClient library can access and download web content if you know the URLs. If you only know the domain names -but not the exact URLs- it gets a lot trickier; you'd essentially have to implement a web spider that parses the HTML pages for PDF links (and links to other pages where you could continue spidering). You can probably find existing spiders written in Java on java-source.net.
Thank you very much Ulf Dittmer,
I just started searching as you said,
In the mean any suggestions

thank you
Hai Ulf Dittmer,
As you said i tried with web spider and related concepts , there is some codes also but those are representing the database applications but not web based ,
According to me the data files(PDF, html or images) are directly fall in to my database server , when ever i give the URL link of the files .
Is there any possibility for this , and first thing that i confused to where to start. i have the idea from database to website but not how to retrieve data file from web page to data base.

thank you
There are two steps to it. First, you need to download the page/file through HTTP; the HttpClient library can help you with that. That will result either in a file on disk, or in a byte[] in memory. Either way, the second step is to store the binary data in the DB, most likely in a Blob field; that's probably the simpler part, but neither part is really hard. Let us know if you encounter any problems.
Odd that you first say the files are on "popular news sites" and then say the sites belong to your colleagues as soon as someone questions your legal standing.

Why can't you just link to the documents where they are? Why do you have to make copies?
Thank you where much Ulf Dittmer,
I felt my self there is a solution as you said, But i am unaware of HttpClient library .
so first i started towards that . i hoping there is a solution. i will soon back with my result , in the mean please discuss with me , if you found any ideas regarding.

coming to Ernest Friedman-Hill surely i will give the all the links if i succeed. i want make a site for my regional language people .

thanks for your suggestions
I'm still not understanding. In your HTML on www.yoursite.com, you can just include

<a href="http://www.othersite.com/document.pdf">Click here to read PDF document on other site</a>

and you're done. What more do you need?

Thank you Ernest Friedman-Hill , of course i can do as you said to display the file but its not a static site , After Publish my site, if i need to add the more data for every time , i cant change the html pages for every link.
for that i am creating admin panel , along with data i will add the fields like name, content site name, date etc. to display.
that to those are also displayed in latest items , most Reading items etc.
i think now you getting my point.
thank you again for your valuable suggestion.

kishore venv wrote:
i think now you getting my point.

No, I am still not getting it. If the site is dynamic, then the HTML is generated dynamically, but it can still link to a document in its original location, rather than a pirated copy on your own site. Your database can simply contain the link to the remote document.
Why can't you keep (and update) the metadata in a DB on your site without having to keep a copy of the content?
ok fine let me explain clearly.
i am constructing a site with the categories like ,
regional news ,
film reviews,
help desk (for social awareness program will be held in the surrounding) etc
its just like a show case of those informations.
as already said when user search along category then there found a data files along with the content site name,
of course , of course this content site name in the link format. if user want to read the data , then he /she read file , then chose another category. if user want to read more topics in that category , went to content site by clicking the content site name link. may be my language make you confused. thats the project that i want to do.
thanks for your suggestions
so they are not from your colleages' sites like you said? I guess I am confused too

