• Post Reply Bookmark Topic Watch Topic
  • New Topic

Search Engine for pdf

 
Allen Bandela
Ranch Hand
Posts: 128
Eclipse IDE MS IE Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi everybody,
I'm a student trying to design a search engine for my school's offices where I work.
So far I have implemented a basic search engine to search a bunch of pdf files. I have pre-indexed a list of keywords for each pdf and displayed those pdf's whose keywords match the user's search phrase.I have ranked them by number of matches( sorted them based on hits by bubble sort). By all means , this is not a search engine.

I would like to implement a real search engine that either serializes the pdf , and stores every major keyword in a list, or by any other means. But, the problem is I have pdf files (and only pdf's). I tried to serialize them to get a word but to no success.

If anyone has any suggestions, please help.Thanks.
I hope to use applets, for it should be put up on a website.
 
Dave Wingate
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't have an answer to your question about serializing the text in PDFs to extract keywords, but I can recommend that you not use bubble sort. As you list of documents grows, the performance of the search will be poor. A better solution would be to use merge sort, which is used when you call Collections.sort().

The tecnhical explanation is that the time complexity of bubble sort is O(n^2), which is worse than the time complexity of merge sort: O(n lg n).

The non-tecnhical explanation is that the sort routine provided by the Java Collections Framework is stable and about the best you can hope to do. Using the sort defined in that framework will help you focus on the problem that you really want to solve (creating a search engine).

See this tutorial for an example of sorting through the collections framework.
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What do you mean by "serializing a PDF"? From what you describe it seems that you would need to extract text from a PDF so that you can index it, correct? If that's the case, JPedal might be what you need.
 
Allen Bandela
Ranch Hand
Posts: 128
Eclipse IDE MS IE Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Dittmer,
Your correct! I need to extract the words from the Pdf . I have tried to use JPedal, but to no avail. I can't understand how to integrate it with Eclipse, the editor that I'm using. Its so confusing to me. I wonder what the guys who made JPedal did to extract those words from pdf.
As of now, I'm going ahead pre-indexing them manually (i.e. by typing in the keywords, these are a bunch of fixed number of pdf's that will never change) and I'll try to use merge sort to do the ranking . Thanks guys. About, the JPedal, Dittmer, do you have any direction you can point me to. Thanks again.
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This page, which is just 2 clicks away from the home page, links to several examples and their source code.
 
Chetya Benkipuri
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You might also have a look at Lucene.
[ April 26, 2006: Message edited by: Chetya Benkipuri ]
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!