I'm wandering of how the data access and interoperability will evolve in the near future. You as authors may have some information, a vision or an opinion about that.
While the current data access methods are fine, the only way for mahout algorithms to use data from other Hadoop projects that have some table storage implemented (Hive, Cassandra, HBase) is to do a series of data extractions and transformations that is quite painful as multiple HDFS writes are necessary for it. This is of course because I can't just tell Mahout to use one of these tables. First we extract data from Hive/Cassandra/HBase and write it to a csv file on the HDFS, then start converting that csv data to the vector type data that mahout algorithms can eat. This is of course a lot of I/O work and that's a lot of time and resources.
Do you see the possibility of these operations and dataflow between these tools evolving to be more effective? After all, we have some data storage tools and some data analytics tools(like Mahout) and the need for the data flow to be effective is obvious. I've seen a current incubator project started to try to somewhat standardize table data to help interoperability, named HCatalog
. Do you think this may be the short/long term answer for the question?
Thank you for your answers.