• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Ron McLeod
  • paul wheaton
  • Jeanne Boyarsky
Sheriffs:
  • Paul Clapham
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
  • Himai Minh
Bartenders:

Hadoop on Different Platforms

 
Ranch Hand
Posts: 119
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Are there any reference or benchmarks for Hadoop working on different platforms and OSes?

Thanks,
Mohamed
 
Ranch Hand
Posts: 221
Scala Python Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hadoop runs on Linux only.
Your mileage may vary.
Some customers get really good performance on Cisco UCS and also HP DL380 among others.
Hadoop uses the notion of data locality, so the closer the data is to the node where the task is running the better performance you get.
Depending on the application SSDs might have a positive impact versus classic HDD.

MapR Enterprise Grade Distribution for Hadoop has several records for the most popular Hadoop benchmarks.
 
Mohamed El-Refaey
Ranch Hand
Posts: 119
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Carlos. So, how far the data locality improve the performance?
 
author
Posts: 15
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Mohamed,

The data locality optimization is one of the key techniques that allows Hadoop to scale and be so performant.

The basic idea is that if you have a cluster with large amounts of data you really don't want to be moving that data around the cluster to be processed. So when a MapReduce job is being scheduled the framework determines which pieces of data (blocks) need to be processed and on which machines they are located and then starts tasks to process the data on those hosts. By default Hadoop keeps 3 copies of each block. So in the best case if you have 10 blocks to be processed (usually its much higher) then the framework will schedule jobs on the host where replicas of each block reside.

Obviously as the data size increases this becomes more difficult; if you have 50 machines but 20000 blocks to process then scheduling becomes much more complex. But by aiming to process data where it resides a lot of data transfer and i/o is avoided.

Garry
 
Mohamed El-Refaey
Ranch Hand
Posts: 119
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Fantastic. Thanks Garry for the detailed explanation.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic