Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Mahout data access future

 
Robert-Zsolt Kabai
Greenhorn
Posts: 3
Chrome Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I'm wandering of how the data access and interoperability will evolve in the near future. You as authors may have some information, a vision or an opinion about that.
While the current data access methods are fine, the only way for mahout algorithms to use data from other Hadoop projects that have some table storage implemented (Hive, Cassandra, HBase) is to do a series of data extractions and transformations that is quite painful as multiple HDFS writes are necessary for it. This is of course because I can't just tell Mahout to use one of these tables. First we extract data from Hive/Cassandra/HBase and write it to a csv file on the HDFS, then start converting that csv data to the vector type data that mahout algorithms can eat. This is of course a lot of I/O work and that's a lot of time and resources.

Do you see the possibility of these operations and dataflow between these tools evolving to be more effective? After all, we have some data storage tools and some data analytics tools(like Mahout) and the need for the data flow to be effective is obvious. I've seen a current incubator project started to try to somewhat standardize table data to help interoperability, named HCatalog. Do you think this may be the short/long term answer for the question?

Thank you for your answers.
Robert

 
Ted Dunning
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Remember that Mahout is an open source project and, as such, doesn't really have a roadmap. What does exist is a set of desires that contributors have. As the contributors feel a need for something, it happens.

This means that you guys can influence the future of Mahout quite heavily.

To your point, however, it is true that the clustering code is rather inflexible about input. So is the Naive Bayes classifier family. The recommend framework is much more flexible (Sean recently added a Cassandra interface with very little work, for instance). The SGD classifier family is all about in-memory API's which makes it pretty easy to interface with.

The primary limitation right now on how the clustering and Naive Bayes systems accept data is that there is very little consensus on how that should work. Your input would be very helpful here.

Try emailing dev@mahout.apache.org and start a discussion around what you need.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic