Hi everybody, I'm a student trying to design a search engine for my school's offices where I work. So far I have implemented a basic search engine to search a bunch of pdf files. I have pre-indexed a list of keywords for each pdf and displayed those pdf's whose keywords match the user's search phrase.I have ranked them by number of matches( sorted them based on hits by bubble sort). By all means , this is not a search engine.
I would like to implement a real search engine that either serializes the pdf , and stores every major keyword in a list, or by any other means. But, the problem is I have pdf files (and only pdf's). I tried to serialize them to get a word but to no success.
If anyone has any suggestions, please help.Thanks. I hope to use applets, for it should be put up on a website.
I don't have an answer to your question about serializing the text in PDFs to extract keywords, but I can recommend that you not use bubble sort. As you list of documents grows, the performance of the search will be poor. A better solution would be to use merge sort, which is used when you call Collections.sort().
The tecnhical explanation is that the time complexity of bubble sort is O(n^2), which is worse than the time complexity of merge sort: O(n lg n).
The non-tecnhical explanation is that the sort routine provided by the Java Collections Framework is stable and about the best you can hope to do. Using the sort defined in that framework will help you focus on the problem that you really want to solve (creating a search engine).
See this tutorial for an example of sorting through the collections framework.
What do you mean by "serializing a PDF"? From what you describe it seems that you would need to extract text from a PDF so that you can index it, correct? If that's the case, JPedal might be what you need.
Hi Dittmer, Your correct! I need to extract the words from the Pdf . I have tried to use JPedal, but to no avail. I can't understand how to integrate it with Eclipse, the editor that I'm using. Its so confusing to me. I wonder what the guys who made JPedal did to extract those words from pdf. As of now, I'm going ahead pre-indexing them manually (i.e. by typing in the keywords, these are a bunch of fixed number of pdf's that will never change) and I'll try to use merge sort to do the ranking . Thanks guys. About, the JPedal, Dittmer, do you have any direction you can point me to. Thanks again.