I have a collection of around 1500 documents. I parse through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection(i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,
I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.
Please let me know if my logic is correct. Thanks!!
That still sounds like something you can do with a Map. If you look for Map in the Java Tutorials, there is a counting example. It is obviously simpler than your problem, but it might suggest some principles about counts.
You could create a class to hold a DocNum and TermFreq value (called say DocFreq). Then have a Map keyed on the term and with a List of DocFreq objects as a value.
Then, for any given term, The size of the List gives you the DocFreq and each object in the List gives you a DocNum and TermFreq value.
F Merchant wrote:I have a collection of around 1500 documents. I parse through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection(i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document.
Doesn't sound like an "inverted" index to me; just a different level of frequency count.
Also, "DocNo|Frequency in that document" doesn't make much sense to me. Frequency of what? The number?
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Is this for work or for school? Are you interested in the index itself or is the index just a means to an end? If it's for work and if the index is just a means to an end, you might want to consider a tool like Lucene instead of implementing this yourself.
Go for a custom object that holds everything about a word - String (in upper case probably), list of document numbers where found, parallel list of incidence counts per document. Store these in a TreeMap to keep alphabetic list. Note that you could use ArrayLists while parsing but convert to int for compact serialization and fast logic.
Things to watch out for - what is punctuation, how to handle different case, plurals, trash words to discard.
Now, onward to logic! With the document number lists from two words A, B you can find A and B, A or B, and A not B.
Man that takes me back, my first commercial program did this in Forth about 1979 - great way to expand your programming skills, but as Junilu said, Lucene can do that.
Well this is what I've come up with so far..
HashMap<docnum, Map<term,termFreq>> and a TreeMap<term, Set<docnum>>.The docFreqs can be read off as set.size in the values of the second map.
@Tony, I'm trying to work on that. But this si my first time with Maps and I don't really know how to assign objects as values. If possible could someone please explain with an example!
@Winston, Sorry for that....Frequency of term in docNO
@Junilu, Yes its for school so I can't use Lucene.
@William, Working on that. And I've already taken care of the terms(punctuation, stemming and special characters). So true!! Writing this program is really improving my Java
I am to implement a project and it is somewhat similar to yours. I do have 1000 documents and want to parse those and convert into tokens. Which parser should i use? I am not asking for exact code but a hint on how to proceed will do for me.