Apache Spark can be used for fast processing of huge volume of data. This is many times faster than Map Reduce. If it is new code then why would someone code in Map reduce instead of Spark code. I think for every use case Spark is preferred over Map Reduce. So does that mean map reduce should never be implemented for any project now unless it is some legacy code which has to be maintained.Is map reduce no more used for new implementations?
Monica Shiralkar wrote:I think for every use case Spark is preferred over Map Reduce. So does that mean map reduce should never be implemented for any project now unless it is some legacy code which has to be maintained.Is map reduce no more used for new implementations?
Well, this article would suggest that it is a later development, but whether it's fully backward-compatible is another question. Since it wasn't developed by Apache themselves, I'd suspect not unless it's explicitly stated in the Apache docs.
But not knowing anything much about either, I couldn't say much more.
"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here
Apache Spark is a completely separate system from Hadoop, although it can run on a Hadoop cluster via the YARN manager. It can also run on its own cluster or share cluster resources intelligently with an Apache Cassandra database. Spark is an in-memory distributed processing engjne that can read/write data on many different storage platforms. Hadoop has its own distributed storage (HDFS) which also provides the basis for Hadoop's Hive SQL layer and its HBase column-family datastore. So Hadoop provides distributed storage and processing, but the processing is based on MapReduce. Hadoop programs have usually been written in Java or Pig, which gets converted to MapReduce tasks.
Spark is written in Scala and provides excellent APIs in Scala and Python. There is also a Java API but it is kind of clunky as Java is not so good at supporting the functional programming techniques that Spark is built on. If you want to adopt Spark instead of Hadoop MapReduce, then you'll probably want/need to learn Scala in order to make best use of it. All the major commercial Hadoop suppliers are backing Spark and integrating it into their bundled Hadoop platforms, and it has gained a lot of interest over the last couple of years. I've been prototyping with Spark over the last year and I cannot imagine why anybody would choose MapReduce for a new project when Spark is so much more expressive, powerful and flexible. I also feel the same way about Scala - I never want to go back to Java!
On the other hand, Spark is much less mature than Hadoop, and there is a real shortage of tools, skills and documentation for Spark, especially around admin and management. If you're using one of the integrated Hadoop packages like Cloudera, then you might be happy to turn to Cloudera for support with these issues. But if you are running your own Hadoop cluster without support, you might be concerned about the extra challenges and potential risks involved in using Spark as well. It will depend on your resources - skills, external support, finance, extra hardware/VM capacity etc.
Of course there are a lot of Java MapReduce applications out there (Hadoop MapReduce has been around for 10 years), so there will still be a need for people who can work with these, but I think you're right that these will increasingly be seen as legacy code, while new Big Data applications will probably be more likely to use Spark, on top of or alongside Hadoop, or in combination with other distributed data storage.
Finally, it's worth remembering that there are other alternatives to MapReduce or Spark, e.g. using streaming and in-memory processing. I haven't used these myself but you might want to Google some of these.
Cascading is a mature and well-documented Java library that provides a more flexible high-level abstraction layer on top of MapReduce, so you can model your data processing as a pipeline (similarly to Spark), but I think it still gets turned into MapReduce tasks underneath.
Apache Flink is a "a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams", so I guess it provides some of the same features as Spark, but I don't know anything else about this.
Agree, Spark is the new technology, the map-reduce days of craze are gone for now !
posted 1 week ago
Is Map reduce the only member of hadoop ecosystem which has got affected due to this and thus being used less for new project and are other members like hive ,hbase ,hdfs etc still being used a lot like earlier ?