• Post Reply Bookmark Topic Watch Topic
  • New Topic

Building an inverted index-logic  RSS feed

 
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a collection of around 1500 documents. I parse through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection(i.e. frequency) is stored as the value.

I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,





I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.

Please let me know if my logic is correct. Thanks!!
 
Marshal
Posts: 56600
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That still sounds like something you can do with a Map. If you look for Map in the Java Tutorials, there is a counting example. It is obviously simpler than your problem, but it might suggest some principles about counts.
 
Bartender
Posts: 3271
82
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could create a class to hold a DocNum and TermFreq value (called say DocFreq). Then have a Map keyed on the term and with a List of DocFreq objects as a value.
Then, for any given term, The size of the List gives you the DocFreq and each object in the List gives you a DocNum and TermFreq value.
 
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
F Merchant wrote:I have a collection of around 1500 documents. I parse through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection(i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document.

Doesn't sound like an "inverted" index to me; just a different level of frequency count.

Also, "DocNo|Frequency in that document" doesn't make much sense to me. Frequency of what? The number?

Winston
 
Sheriff
Posts: 11494
180
Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is this for work or for school? Are you interested in the index itself or is the index just a means to an end? If it's for work and if the index is just a means to an end, you might want to consider a tool like Lucene instead of implementing this yourself.
 
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Go for a custom object that holds everything about a word - String (in upper case probably), list of document numbers where found, parallel list of incidence counts per document. Store these in a TreeMap to keep alphabetic list. Note that you could use ArrayLists while parsing but convert to int[] for compact serialization and fast logic.

Things to watch out for - what is punctuation, how to handle different case, plurals, trash words to discard.

Now, onward to logic! With the document number lists from two words A, B you can find A and B, A or B, and A not B.

Man that takes me back, my first commercial program did this in Forth about 1979 - great way to expand your programming skills, but as Junilu said, Lucene can do that.

Bill
 
F Merchant
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks everyone!!

Well this is what I've come up with so far..
HashMap<docnum, Map<term,termFreq>> and a TreeMap<term, Set<docnum>>.The docFreqs can be read off as set.size in the values of the second map.

@Tony, I'm trying to work on that. But this si my first time with Maps and I don't really know how to assign objects as values. If possible could someone please explain with an example!
@Winston, Sorry for that....Frequency of term in docNO
@Junilu, Yes its for school so I can't use Lucene.
@William, Working on that. And I've already taken care of the terms(punctuation, stemming and special characters). So true!! Writing this program is really improving my Java
 
Campbell Ritchie
Marshal
Posts: 56600
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are you sure those maps are right? I am sure you are correct to use maps, but that solution looks so complicated I am getting suspicious about it.
 
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am to implement a project and it is somewhat similar to yours. I do have 1000 documents and want to parse those and convert into tokens. Which parser should i use? I am not asking for exact code but a hint on how to proceed will do for me.
 
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Unless this is an academic project, have you considered using Lucene?
 
Rahul Rlg
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is a academic project. And I just need a start to approach this problem.

BY now, i am thinking to use SAX parser to parse 1500 documents. This is fine. But how could i remove wiki markup from those documents. Any hints?
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!