Win a copy of The Little Book of Impediments (e-book only) this week in the Agile and Other Processes forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Is map reduce not used for any new code because of a faster option Apache Spark?

 
Monica Shiralkar
Ranch Hand
Posts: 873
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Apache Spark can be used for fast processing of huge volume of data. This is many times faster than Map Reduce. If it is new code then why would someone code in Map reduce instead of Spark code. I think for every use case Spark is preferred over Map Reduce. So does that mean map reduce should never be implemented for any project now unless it is some legacy code which has to be maintained.Is map reduce no more used for new implementations?

thanks
 
Winston Gutkowski
Bartender
Pie
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Monica Shiralkar wrote:I think for every use case Spark is preferred over Map Reduce. So does that mean map reduce should never be implemented for any project now unless it is some legacy code which has to be maintained.Is map reduce no more used for new implementations?

Well, this article would suggest that it is a later development, but whether it's fully backward-compatible is another question. Since it wasn't developed by Apache themselves, I'd suspect not unless it's explicitly stated in the Apache docs.

But not knowing anything much about either, I couldn't say much more.

Winston
 
chris webster
Bartender
Posts: 2407
33
Linux Oracle Postgres Database Python Scala
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Apache Spark is a completely separate system from Hadoop, although it can run on a Hadoop cluster via the YARN manager. It can also run on its own cluster or share cluster resources intelligently with an Apache Cassandra database. Spark is an in-memory distributed processing engjne that can read/write data on many different storage platforms. Hadoop has its own distributed storage (HDFS) which also provides the basis for Hadoop's Hive SQL layer and its HBase column-family datastore. So Hadoop provides distributed storage and processing, but the processing is based on MapReduce. Hadoop programs have usually been written in Java or Pig, which gets converted to MapReduce tasks.

Spark is written in Scala and provides excellent APIs in Scala and Python. There is also a Java API but it is kind of clunky as Java is not so good at supporting the functional programming techniques that Spark is built on. If you want to adopt Spark instead of Hadoop MapReduce, then you'll probably want/need to learn Scala in order to make best use of it. All the major commercial Hadoop suppliers are backing Spark and integrating it into their bundled Hadoop platforms, and it has gained a lot of interest over the last couple of years. I've been prototyping with Spark over the last year and I cannot imagine why anybody would choose MapReduce for a new project when Spark is so much more expressive, powerful and flexible. I also feel the same way about Scala - I never want to go back to Java!

On the other hand, Spark is much less mature than Hadoop, and there is a real shortage of tools, skills and documentation for Spark, especially around admin and management. If you're using one of the integrated Hadoop packages like Cloudera, then you might be happy to turn to Cloudera for support with these issues. But if you are running your own Hadoop cluster without support, you might be concerned about the extra challenges and potential risks involved in using Spark as well. It will depend on your resources - skills, external support, finance, extra hardware/VM capacity etc.

Of course there are a lot of Java MapReduce applications out there (Hadoop MapReduce has been around for 10 years), so there will still be a need for people who can work with these, but I think you're right that these will increasingly be seen as legacy code, while new Big Data applications will probably be more likely to use Spark, on top of or alongside Hadoop, or in combination with other distributed data storage.

Finally, it's worth remembering that there are other alternatives to MapReduce or Spark, e.g. using streaming and in-memory processing. I haven't used these myself but you might want to Google some of these.

(EDIT: 17/01/2016)
  • Cascading is a mature and well-documented Java library that provides a more flexible high-level abstraction layer on top of MapReduce, so you can model your data processing as a pipeline (similarly to Spark), but I think it still gets turned into MapReduce tasks underneath.
  • Apache Flink is a "a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams", so I guess it provides some of the same features as Spark, but I don't know anything else about this.
  •  
    chris webster
    Bartender
    Posts: 2407
    33
    Linux Oracle Postgres Database Python Scala
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Winston Gutkowski wrote:...But not knowing anything much about either, I couldn't say much more.

    Winston

    In case you're curious, Scala guru Dean Wampler gives a good overview of Why Spark Is the Next Top (Compute) Model for Big Data.
     
    Monica Shiralkar
    Ranch Hand
    Posts: 873
    1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    If you want to adopt Spark instead of Hadoop MapReduce, then you'll probably want/need to learn Scala in order to make best use of it


    Thanks. Why should one prefer Scala if one can do it in familiar Java programming as Spark coding can be done in Java,Scala and Python.Why not Java?
     
    chris webster
    Bartender
    Posts: 2407
    33
    Linux Oracle Postgres Database Python Scala
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Monica Shiralkar wrote:
    If you want to adopt Spark instead of Hadoop MapReduce, then you'll probably want/need to learn Scala in order to make best use of it


    Thanks. Why should one prefer Scala if one can do it in familiar Java programming as Spark coding can be done in Java,Scala and Python.Why not Java?

    Because Scala makes it much easier to write Spark code, and you can still interoperate with Java if you have to. Compare the Scala and Java code on the Spark samples to get a feel for the difference.

    This guy from Cloudera explains why he chose to use Scala rather yhan Java or Python for working with Spark.
     
    Paul Clapham
    Sheriff
    Posts: 21581
    33
    Eclipse IDE Firefox Browser MySQL Database
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Good question, brought out some good answers. Have a cow for the good question!
     
    Monica Shiralkar
    Ranch Hand
    Posts: 873
    1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    thanks
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic