• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Accumulo, Zookeeper and Hadoop Integration

 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Have an existing Hadoop cluster and would like to install Accumulo, Mahout and some other tools from a separate machine and integrate them into this environment. I can probably stand up some Zookeeper VMs if necessary. Also, when I go to install zookeeper by itself (RHEL 6.4 - yum install zookeeper), it pulls in a copy of Hadoop and seems to want this running on the box (even though I already have namenodes/datanodes on another set of boxen). Installing on a single machine is cake, however, trying to integrate pieces/parts seems to be quite an undertaking.

Here is what I have gleaned thus far:

1. Accumulo NEEDS zookeeper?,
2. Zookeeper seems to want to keep data in memory on a znode (does it EVER write it to the HDFS?),
3. and using MapReduce/Hadoop works great in batch mode.
4. Have thought/tried to install Cloudera/Hortonworks in this environment ... Cloudera only supports RHEL 6.2; HW seems to work ok so far

I am thinking of installing (3) Zookeeper VMs and have them point to the Hadoop Cluster, and then have my Accumulo/Mahout VM point to the Zookeeper ensemble. Is this the best way? Will this ultimately use the Hadoop cluster? Do I need to run a base Hadoop service on all of these boxes to make it all communicate?

Any/all help in this matter is greatly appreciated.

Environment: High Performance Computing infrastructure, VMs/Boxes running RHEL 6.4, all using a private network

-Bob-
 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
1. Yup, ZooKeeper is essentially as it keeps bootstrapping state for Accumulo and relies heavily on the locking functionality to coordinate distributed events.
2. As stated above, it runs in memory and uses a local filesystem. ZooKeeper is not dependent on Apache Hadoop's DFS. When running more than one ZooKeeper server together, they are redundant without the use of an external distributed filesystem.

You can certainly use a single ZooKeeper server, but it's up to you the level of redundancy and availability you require for your application. ZooKeeper isn't a very heavy service, so if you have separate nodes, it would be good to run 3 servers. You can easily run it along side nodes which are also tasktrackers and/or datanodes. As far as the location of each service, as they as Accumulo can reach the ZooKeepers, namenode, and datanodes over the same network, you should be fine.

Also, you don't need to run a datanode and tasktracker process on every node; however, you'll most often see this, sans a node or two to run the jobtracker and namenode. It heavily depends on the kind of workload you intend to process.

A word to the wise if you do run Accumulo in VMs, keep in mind that Accumulo is very sensitive to time. Virtualization can skew these sorts of things, so just be cognizant of the actual system resources underneath your VM.
 
reply
    Bookmark Topic Watch Topic
  • New Topic