Win a copy of Mastering Corda: Blockchain for Java Developers this week in the Cloud/Virtualization forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Bear Bibeault
  • Liutauras Vilda
  • Jeanne Boyarsky
  • Tim Cooke
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Tim Holloway
  • Stephan van Hulst
  • Jj Roberts
  • Carey Brown
  • salvin francis
  • Frits Walraven
  • Piet Souris

Map-Reduce and Homogeneous vs Heterogeneous Data sets

Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Diff b/w Homogeneous and Heterogeneous Data sets and why simple map reduce is not suitable for relational algebra?and also why Map-Reduce-Merge has been evolved? Proper explanation would be highly appreciated. Thanks in advance, as no one was able to answer it on Quora.
Posts: 2407
Scala Python Oracle Postgres Database Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Not sure I understand what you're asking, but here's a few thoughts.

Homogeneous data-sets would probably be the kind of thing that fits nicely into a relational schema i.e. "rectangular" data with a common format, predictable columns, where the data in each column is all the same data-type, and so on.

Heterogeneous data is likely to be data where the structure is unpredictable so you can't (easily) enforce a relational schema, and/or where the data itself might be of different types e.g. text, images, etc. Various NoSQL databases offer alternatives here e.g. MongoDB stores JSON documents with no fixed schema.

Hadoop's Distributed File System (HDFS) can handle all of this data, because it's just a file system and doesn't care what's in each file. However, most real applications need to work with structured data of some kind, and you need at least a key and a value in order to run MapReduce after all. Hadoop's Hive database allows you to define a rectangular table-like structure for files (e.g. CSV) that you have loaded into HDFS, and you can then run SQL queries (no updates) against these tables. The SQL commands are translated into MapReduce steps by the Hive query engine. Alternatively, HBase is a column-family database that sits on top of HDFS, so you have other ways to organise your data, depending on your requirements.

Most relational operations are based on some kind of key e.g. PK look-up, joins on foreign keys etc, but unlike an RDBMS, Hadoop isn't optimised for random access reads i.e. it typically has to do a scan (map/filter) of all your data in a given file or Hive table to find particular records. Also, relational joins require a sort to be performed before you can merge the joined record, and this is another expensive operation in MapReduce if you are dealing with large volumes of data.

This means a simple MapReduce approach is inefficient for most relational operations, unless you are simply reading all the data from a file in no particular order. Of course, you can still implement these SQL-style operations in MapReduce (as in Hive, for example), but it tends to be quite slow. Most serious tasks will require more than a single MapReduce phase, which is also slow because Hadoop's default MapReduce engine writes the intermediate data out to files between MapReduce phases. Various options are available to speed this up e.g. the alternative Tez processing engine, or the Impala SQL engine, which do a lot more in-memory processing to speed up your task execution. Hive SQL is slow, but Impala is pretty fast and may be a reasonable option for interactive SQL queries.

Another option is Apache Spark, which is a general purpose processing engine for distributed systems, and offers a rapidly growing set of tools for reading/writing/transforming data from a variety of data sources using SQL and data-frames (Spark SQL) and/or functional programming (map/reduce etc). Spark turns your process into a DAG of operations which it optimises before execution, and it runs the task in memory as far as possible. The Spark API (Scala, Python, Java) encapsulates all this behind a nice coherent abstraction layer that means you can achieve your goals with far less code in Spark than with traditional MapReduce in Java.
Noman Saleem
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Great answer,Thanks!!
Don't get me started about those stupid light bulbs.
    Bookmark Topic Watch Topic
  • New Topic