Ted Dunning

Greenhorn
+ Follow
since Aug 16, 2011
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Ted Dunning

APL was generally state of the art with respect to matrix computation for the time. The state of the art has progressed a good bit since, then, of course.

8 years ago
Owen is correct. I always do recommend R.

In fact, I also use R quite a lot. It is really valuable as a reference implementation or for exploratory use. For production, I very much prefer Mahout.

Another interesting related project is Vowpal Wabbit. Their focus is even more narrow than Mahout's and it is much harder to integrate VW models into a working system. See http://hunch.net/~vw/ for more information.
Mahout is a machine learning and recommendation framework that focuses on scalability rather than breadth of algorithm selection.

See http://mahout.apache.org/ for more information. Or read the book!

For more interactive discussions, send email to user@mahout.apache.org
Remember that Mahout is an open source project and, as such, doesn't really have a roadmap. What does exist is a set of desires that contributors have. As the contributors feel a need for something, it happens.

This means that you guys can influence the future of Mahout quite heavily.

To your point, however, it is true that the clustering code is rather inflexible about input. So is the Naive Bayes classifier family. The recommend framework is much more flexible (Sean recently added a Cassandra interface with very little work, for instance). The SGD classifier family is all about in-memory API's which makes it pretty easy to interface with.

The primary limitation right now on how the clustering and Naive Bayes systems accept data is that there is very little consensus on how that should work. Your input would be very helpful here.

Try emailing dev@mahout.apache.org and start a discussion around what you need.
I have personally used Mahout a good bit for click prediction and the company that I helped to do a lot of that has gone on and used Mahout for all kinds of additional functions where they needed systems to make autonomous decisions.
Please note that the examples from the book are available on-line and we will keep these examples up-to-date with the latest Mahout. Some of the examples are integrated into Mahout so we won't be able to change Mahout too much without updating those.
I would definitely second what Sean says. What we have in Mahout is a collection of algorithms that we know will scale. Mahout is not about being the broadest collection of learning algorithms. It is about having scalable algorithms that are reliable in scaling situations.

If you want a broad selection of algorithms with much less attention paid to scalability and deployability, then use R.

If you think we really, really need some algorithm in Mahout and you know how it should be implemented to be scalable, please do come over to dev@mahout.apache.org and let's talk about it!
Not really. We had an implementation in Pig contributed a long time ago, but nobody really had time to bring it up to snuff.

We do have a very reasonable parallel implementation of LDA, however, which is often considered the next generation algorithm after PLSI.

We did not talk much about the integration of Solr and Lucene with Mahout. You are correct that this is a good form of data for processing with Mahout, but I think that in actual production cases, it isn't a really large part of the mix. Perhaps it should be, and perhaps more explanation would help that.

On the other hand, another forthcoming Manning book, "Taming Text" definitely does talk quite a bit about this integration.
The book doesn't go into massive detail on the algorithms. Some of the algorithms can be a bit scary, but the fundamental ideas are really pretty straightforward.

To use the stuff in the book, you need to know the general shape of the algorithms, but not the details. We talk a lot about shape in the book.
The Stanford course is broader and more fundamental than our book. We present hands on practical knowledge, they present lots of theory and are much less judgmental about what works in practice. We don't present everything, by any stretch, but we do present practical methods that do get the job done.

It is probably good to have both.