Win a copy of The Little Book of Impediments (e-book only) this week in the Agile and Other Processes forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Languages used in Hadoop Implementation and Real World Problems

 
Meenal Abhijit Borkar
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Friends,

I am an academician and studying Hadoop for my presentations in class. As I am new to Hadoop, I seek your expert opinions on following two aspects:
1. Which languages are popularly used by industries to implement Hadoop? Python or Java.
2. Any sources, from where I can get real business scenarios / examples in Industry where Hadoop is being used? Also where the data sets are available.

During my course presentation, I need this information to convince my students. Kindly help me.

Thanking you in anticipation.

Regards,
Meenal
 
chris webster
Bartender
Posts: 2407
33
Linux Oracle Postgres Database Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For real-world business users. you should probably start with the examples provided by major Hadoop providers like Cloudera or Hortonworks:

http://www.cloudera.com/content/cloudera/en/our-customers.html

http://hortonworks.com/customers/

Hadoop itself is implemented mainly in Java as far as I know, and there is a fairly low-level Java API which a lot of people have used for Hadoop programming. However, it is often easier to use higher-level APIs such as Cascading (for Java) or alternative languages like Pig, Hive SQL and others. Pig is a Hadoop-based scripting language, and scripts are converted internally into a series of MapReduce tasks. Hive is a way to manage your data in HDFS as if it were held in relational database tables, and you can use SQL to manipulate your data, which is much easier than trying to do this in Java/MapReduce. As with Pig, the SQL is converted into MapReduce tasks underneath. Hadoop is also the foundation for other tools such as the NoSQL database HBase.

However, Hadoop v.2+ now provides the YARN resource manager, which allows you to plug in alternative processing engines e.g. Tez or Spark instead of the older MapReduce engine. Using these engines can speed up your Hive SQL or Pig jobs significantly. Apache Spark is a distributed processing engine that can run independently or on top of Hadoop's YARN engine. Spark has APIs for Scala, Python and Java, and provides a powerful high-level coding paradigm that many people are starting to see as an alternative to traditional Java/MapReduce with/without Hadoop. One of the nice things about Spark is that you can code your whole data-processing pipeline using the same language/API and a consistent programming model, instead of having to switch between e.g. Java, Pig and Hive SQL to complete different stages in the processing.

Many other languages and tools (e.g. ETL tools, BI, etc) provide interfaces of various kinds to Hadoop, and it seems to be getting easier to use Hadoop as a distributed data store, but use lots of other tools to access/manipulate your data in Hadoop, even if you are not executing your code directly on Hadoop's processing engine.



 
chris webster
Bartender
Posts: 2407
33
Linux Oracle Postgres Database Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
PS: Welcome to JavaRanch!
 
amit punekar
Ranch Hand
Posts: 544
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,
Re:Datasets, you can find some collection here - http://ibmhadoop.challengepost.com/details/data

Regards,
Amit
 
Meenal Abhijit Borkar
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you Chris, Amit and Gartner. I will certainly followup with new doubts. Thank you once again.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic