• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • paul wheaton
  • Ron McLeod
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:

How to extract finance related terms/words from a given text using Java

 
Ranch Hand
Posts: 47
Eclipse IDE Tomcat Server Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All,

I have a PDF file which I parsed into text using PDFBox API, now I want to extract finance related terms/words and their frequency from that text file. I also googled for the same and found that we can use GATE/OpenNLP but didn't find any concrete example. Please help.
 
Sheriff
Posts: 67754
173
Mac Mac OS X IntelliJ IDE jQuery TypeScript Java iOS
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What criteria are you using to determine if a term is "financial" or not?
 
Author and all-around good cowpoke
Posts: 13078
6
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As an initial exercise, try extracting all words and their frequency. Consider how to handle UPPER/lower case and plurals.

Natural Language Processing can get pretty tricky, best to start simple.

Bill
 
rastogi payam
Ranch Hand
Posts: 47
Eclipse IDE Tomcat Server Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi William,
Here is the code which I have written to extract and count the words from the text. What should be my next step. Please advice.
 
lowercase baba
Posts: 13091
67
Chrome Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As Bear said, you need to determine what makes a word "financial". That has nothing to do with Java or writing code - it has to do with your brain and/or your specs.
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
For performance sake, your Map


should use an int array size 1 instead of Integer as the value. The reason being that Integer objects have no simple increment method - what you have written would require a new object creation every time. Incrementing the int[] at index 0 does not.

Assuming you have that parsing and counting code working, look at the output - do you see "financial" terms?

Bill


 
Sheriff
Posts: 28394
100
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The java.util.concurrent.atomic.AtomicInteger class has methods to increment (etc.) the integer it contains.
 
rastogi payam
Ranch Hand
Posts: 47
Eclipse IDE Tomcat Server Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes I can see financial terms in the output along with general English terms. Now how can I differentiate between these two categories.
 
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

rastogi payam wrote:Yes I can see financial terms in the output along with general English terms. Now how can I differentiate between these two categories.


As Bear and Fred have said, that's for you to decide. If this is anything more than just an exercise then the design specification should tell you the answer. If it is just an exercise, then presumably you can just select your own terms.

Edit: Having just reread your post maybe that isn't what you meant. You say you can see the financial terms so presumably you know what they are and you just want to know how to separate them.
In that case you just need to put a check before the bit of code that adds the terms to the Map in your code. If it's a finacial term you add it, otherwise you don't. You could put all the financial terms in a Collection of some sort and then just check if the collection contains the term.
 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
First you need a dictionary that contains the financial term and then you need to classification of document using NLP.
may this link helpful to you.

http://www.eur.nl/edsc/english/databases/financial_databases/ and then you need to apply java code for this.
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic