I have a PDF file which I parsed into text using PDFBox API, now I want to extract finance related terms/words and their frequency from that text file. I also googled for the same and found that we can use GATE/OpenNLP but didn't find any concrete example. Please help.
As Bear said, you need to determine what makes a word "financial". That has nothing to do with Java or writing code - it has to do with your brain and/or your specs.
There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
should use an int array size 1 instead of Integer as the value. The reason being that Integer objects have no simple increment method - what you have written would require a new object creation every time. Incrementing the int[] at index 0 does not.
Assuming you have that parsing and counting code working, look at the output - do you see "financial" terms?
rastogi payam wrote:Yes I can see financial terms in the output along with general English terms. Now how can I differentiate between these two categories.
As Bear and Fred have said, that's for you to decide. If this is anything more than just an exercise then the design specification should tell you the answer. If it is just an exercise, then presumably you can just select your own terms.
Edit: Having just reread your post maybe that isn't what you meant. You say you can see the financial terms so presumably you know what they are and you just want to know how to separate them.
In that case you just need to put a check before the bit of code that adds the terms to the Map in your code. If it's a finacial term you add it, otherwise you don't. You could put all the financial terms in a Collection of some sort and then just check if the collection contains the term.