• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Jeanne Boyarsky
  • Ron McLeod
Sheriffs:
  • Paul Clapham
  • Liutauras Vilda
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
Bartenders:

Do I need to learn Hadoop first to learn Apache Spark?

 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Which one is better to learn -- Hadoop or Hadoop with Spark, Scala and Storm? I am confused please suggest me.
 
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Let's get Scala out of the way first. Unlike the other three which are data processing technologies, Scala is a general purpose programming language.
You don't need to learn Scala to work with any of those technologies.
That said, Scala is still an interesting language with unique concepts and approaches, and I think you should learn it just to expand your mind to different possibilities.

Storm is also a bit of a different beast. Its focus is on near real time, low latency processing of streaming data as it comes in.
Certain kinds of data should be processed immediately.
This is in contrast to batch processing where data is first stored somewhere and then processed later in bulk.

For example, a meteorologist wants to know right now the probability of a storm (no pun intended!) coming in based on current weather sensor readings.
She can't wait 24 hours to collect data and then bulk process it, because then it'll be too late.

That leaves us with Hadoop vs Spark as batch processing solutions. Both do have near real time streaming solutions of their own (with Spark being better at it), but it's not their focus.
The thing is, there is no "vs." at all here.
Because Hadoop is a big ecosystem with components like HDFS for storage and YARN for cluster resource allocation which are used by Spark deployments.
So you can't use Spark in enterprise without also using some components of the Hadoop ecosystem.
Where Spark excels is performance, simply because it aggregates intermediate results in memory instead of on disk like Hadoop.

I've explained what their individual focus is. Now it's upto you to decide what exactly are your data processing goals and then choose the tools.
Much more important than learning the tools is learning data processing algorithms like clustering and prediction.
If you have no specific goal and you have enough time, learn all of them.
 
Live a little! The night is young! And we have umbrellas in our drinks! This umbrella has a tiny ad:
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic