Grant Ingersoll

Author
Greenhorn
+ Follow
since Jan 03, 2013
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Grant Ingersoll

OpenNLP has a new life at the Apache Software Foundation: http://opennlp.apache.org/
Congrats and thanks for the great questions!
It probably is closer to ready for company documentation, intranet, but still a non-trivial exercise. What do you have in place for search? I'd probably start there first.
Hi Paul,

The Frankenstein example in the first chapter is really just a toy to get people thinking about the problem space. Chapter 8 contains a system that is a few levels up, but still not production ready, IMO. I would suggest that the concepts and basic principles are applicable for a web-based engine, but there is a whole lot more engineering and capabilities that need to go into a system in order to make it effective in that area. I would say, it is a bit closer to ready if you are looking for a bit smaller scale, but you still have a lot of work to do, as the example really only handles simple fact-based questions and only returns a window around the candidate answer.

As for performance at web scale, you often will need leverage some type of distributed text analysis pipeline up front to handle the incoming documents.

HTH,
Grant
Hi Qunfeng,

Often times in text analytics you are trying to extract information from the text so it fits into other structured ways of looking at it and many of the statistical models used in other places are applicable for text, once you get the proper feature selection and representation of the text in place. The book is, in many ways, a description of how to do these things.

As an aside, I often find it ironic that people call text "unstructured", since it's very highly structured. We just haven't been that great at getting computers to understand that structure.

-Grant
Hi,

It's a great question and many of them are answered throughout the "core" chapters as well as in chapter 9, which discusses new and upcoming text applications.

Here are a few examples of things that can be built using the concepts in the book:
1. Sentiment analysis -- is this text positive or negative about a product/person/idea
2. Trend detection -- identifying what is trending in the news or in social media
3. Recommendation engine -- i.e. people who bought this also bought that.
4. Automatically identifying and extracting people, places, etc. from text
5. Classifying news into buckets like politics, sports, etc.

There are of course many others. At the end of the day, many of these techniques, esp. search, give you a real fast ranking engine, so any problem that needs ranking of top X items is a good candidate for search. Clustering, classification, named entity recognition are really good at helping you better organize unstructured content. They are also quite helpful in applications that have some component of text but aren't purely text based, like customer profile segmentation, etc.
I like to think of Taming Text as an engineer's practical guide to text-based applications like search and natural language processing with examples using popular open source tools like Apache Solr and Lucene. In many ways, we (the authors) wanted to write the book we all wish we had way back when we started in this field. Our target audience is people who are new to text-based applications or new to the tools we introduce. If you are looking for a hard core explanation of the math underlying this stuff, this book is likely not for you. However, if you want a good sense of the concepts and issues involved with building a real text-based application, we hope you will find the book useful.
Thanks, Mohamed, it's nice to be here. I'm looking forward to the discussion!