posted 4 years ago
There seems two key areas to address when looking at Hadoop.. First, the nuts and bolts of just getting it up and running on various hardware. Sounds like EMR might address a real need here since much of this is being done in the cloud anyway. Second, the 'why' of doing this in the first place. To this end I'm looking forward to perusing this book further. There's lot's of ways to attack a problem and Hadoop provides the framework to do this heavy lifting. However we still need to understand how to write the algorithms to get the knowledge we are looking for. In this regard - you've listed using Mahout for document classification with naive-bayes in your chapter on Text processing. Can you describe roughly how you approach this problem with Hadoop and Mahout?