• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Map-Reduce and Homogeneous vs Heterogeneous Data sets

 
Noman Saleem
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Diff b/w Homogeneous and Heterogeneous Data sets and why simple map reduce is not suitable for relational algebra?and also why Map-Reduce-Merge has been evolved? Proper explanation would be highly appreciated. Thanks in advance, as no one was able to answer it on Quora.
 
chris webster
Bartender
Posts: 2407
33
Linux Oracle Postgres Database Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Not sure I understand what you're asking, but here's a few thoughts.

Homogeneous data-sets would probably be the kind of thing that fits nicely into a relational schema i.e. "rectangular" data with a common format, predictable columns, where the data in each column is all the same data-type, and so on.

Heterogeneous data is likely to be data where the structure is unpredictable so you can't (easily) enforce a relational schema, and/or where the data itself might be of different types e.g. text, images, etc. Various NoSQL databases offer alternatives here e.g. MongoDB stores JSON documents with no fixed schema.

Hadoop's Distributed File System (HDFS) can handle all of this data, because it's just a file system and doesn't care what's in each file. However, most real applications need to work with structured data of some kind, and you need at least a key and a value in order to run MapReduce after all. Hadoop's Hive database allows you to define a rectangular table-like structure for files (e.g. CSV) that you have loaded into HDFS, and you can then run SQL queries (no updates) against these tables. The SQL commands are translated into MapReduce steps by the Hive query engine. Alternatively, HBase is a column-family database that sits on top of HDFS, so you have other ways to organise your data, depending on your requirements.

Most relational operations are based on some kind of key e.g. PK look-up, joins on foreign keys etc, but unlike an RDBMS, Hadoop isn't optimised for random access reads i.e. it typically has to do a scan (map/filter) of all your data in a given file or Hive table to find particular records. Also, relational joins require a sort to be performed before you can merge the joined record, and this is another expensive operation in MapReduce if you are dealing with large volumes of data.

This means a simple MapReduce approach is inefficient for most relational operations, unless you are simply reading all the data from a file in no particular order. Of course, you can still implement these SQL-style operations in MapReduce (as in Hive, for example), but it tends to be quite slow. Most serious tasks will require more than a single MapReduce phase, which is also slow because Hadoop's default MapReduce engine writes the intermediate data out to files between MapReduce phases. Various options are available to speed this up e.g. the alternative Tez processing engine, or the Impala SQL engine, which do a lot more in-memory processing to speed up your task execution. Hive SQL is slow, but Impala is pretty fast and may be a reasonable option for interactive SQL queries.

Another option is Apache Spark, which is a general purpose processing engine for distributed systems, and offers a rapidly growing set of tools for reading/writing/transforming data from a variety of data sources using SQL and data-frames (Spark SQL) and/or functional programming (map/reduce etc). Spark turns your process into a DAG of operations which it optimises before execution, and it runs the task in memory as far as possible. The Spark API (Scala, Python, Java) encapsulates all this behind a nice coherent abstraction layer that means you can achieve your goals with far less code in Spark than with traditional MapReduce in Java.
 
Noman Saleem
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Great answer,Thanks!!
 
ratnesh singh
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines.Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning.However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms. MapReduce is designed for datasets which are too large for existing databases, and Map-reduce-merge brings more useful techniques to that environment.this is what about map reduce with relational algebra which is suitable somewhat not in proper manner.


Difference between homogeneous and hetrogeneous datasets are as follow-

1. A DDBMS may be classified as homogeneous or heterogeneous. In a homogeneous system, all sites use the same DBMS product. In a heterogeneous system, sites may run different DBMS products, which need not be based on the same underlying data model, and so the system may be composed of relational, network, hierarchical and object-oriented DBMSs.Homogeneous systems are much easier to design and manage. This approach provides incremental growth, making the addition of a new site to the DDBMS easy, and allows increased performance by exploiting the parallel processing capability of multiple sites.

Heterogeneous system usually result when individual sites have implemented their own database and integration is considered at a later stage. In a heterogeneous system, translations are required to allow communication between different DBMSs. To provide DBMS transparency, users must be able to make requests in the language of the DBMS at their local site. The system then has the task of locating the data and performing any necessary translation. Data may be required from another site that may have:

• Different hardware;

• Different DBMS products;

• Different hardware and different DBMS products

If the hardware is different but the DBMS products are the same, the translation is straightforward, involving the change of codes and word lengths. If the DBMS products are different, the translation is complicated, involving the mapping of the data structure in one data model to the equivalent data structures in another data model. For example, relations in the relational data model are mapped to records and sets in the network model. It is also necessary to translate the query language used (for example, SQL SELECT statements are mapped to the network FIND and GET statements). If both the hardware and software are different, then these two types of translation are required. This makes the processing extremely complex.The typical solution use by some relational systems that are part of a heterogeneous DDBMS is to use gateways, which convert the language and model of each different DBMS into the language and model of the relational system .http://alturl.com/soqps.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic