Win a copy of The Little Book of Impediments (e-book only) this week in the Agile and Other Processes forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Hadoop newbie having knowledge of Java & Linux

 
Manish Hardasmalani
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Friends, Greetings,

I am new to Hadoop but have in depth knowledge of Java and Linux. Considering learning whole Hadoop as a mammoth task, I would like to understand which areas of Hadoop i should concentrate on where i can use my existing knowledge of Java and Linux. Which areas are mandatory and i have to learn ?

Looking forward to your advise.

Many Thanks,
MH
 
chris webster
Bartender
Posts: 2407
33
Linux Oracle Postgres Database Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Don't even try to learn the "whole of Hadoop" - these days Hadoop is really a huge collection of open-source projects and you can't learn them all. In fact, I would say don't even try to install the individual packages, because there are lots of mutual incompatibilities and installing Hadoop is just a world of pain. Instead, go for a bundled Hadoop distribution which you can download for free from a provider such as Cloudera or Hortonworks. These companies offer a bundle of popular Hadoop-based tools, pre-installed and configured, which you can download as a "sandbox" VM and run in VirtualBox or VMWarePlayer.

If you go for Cloudera, then you might like to try Udacity's online course Intro to Hadoop and MapReduce which allows free access to the course materials so you can work through it on your own. I think this course uses a VM based on the free Cloudera Express bundle.

Alternatively, download the Hortonworks Sandbox which is another free Hadoop bundle in a VM, but also includes lots of introductory tutorials to help you get started with Hadoop.

Work through the basic tutorials e.g. using core tools like HDFS, Hue, Hive, Pig. Then when you understand a bit about Hadoop, look at application coding e.g. using Java. But bear in mind that writing pure Java MapReduce programs is no longer the preferred approach to coding for Hadoop. There are lots of higher-level libraries, such as Cascading, or tools such as Cloudera's Impala SQL engine, which are designed to make it easier to code your business logic at a more abstract level instead of having to break everything down into MapReduce steps which are hard to write and often do not perform particularly well on larger processes.

And if you want to go beyond MapReduce and see the current state of the art, have a look at Apache Spark with Python/Scala/Java, which is a high performance distributed computing engine that runs stand-alone e.g. on your local PC or a cluster, or on top of Hadoop's YARN engine.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic