Q1. What is Hadoop?
A1. Hadoop is an open-source software framework for storing large amounts of data and processing/querying those data on a cluster with multiple nodes of commodity hardware (i.e. low cost hardware). In short, Hadoop consists of
1. HDFS (Hadoop Distributed File System): HDFS allows you to store huge amounts of data in a distributed and a redundant manner. For example, a 1 GB (i.e 1024 MB) text file can be split into 16 * 128MB files and stored on 8 different nodes in a Hadoop cluster. Each split can be replicated 3 times for fault tolerance so that if 1 node goes down, you have backups. HDFS is good for sequential write-once-and-read-many times type access.
2. MapReduce: A computational framework. This processes large amounts of data in a distributed and parallel manner. When you do a query on the above 1 GB file for all users with age > 18, there will be say “8 map” functions running in parallel to extract users with age > 18 within its 128MB split file, and then the “reduce” function will run to combine all the individual outputs into a single final result.
3. YARN (Yet Another Resource Nagotiator): A framework for job scheduling and cluster resource management.
4. Hadoop eco system, with 15+ frameworks & tools like Sqoop, Flume, Kafka, Pig, Hive, Spark, Impala, etc to ingest data into HDFS, to wrangle data (i.e. transform, enrich, aggregate, etc) within HDFS, and to query data from HDFS for business intelligence & analytics. Some tools like Pig & Hive are abstraction layers on top of MapReduce, whilst the other tools like Spark & Impala are improved architecture/design from MapReduce for much improved latencies to support near real-time (i.e. NRT) & real-time processing.
Q2. Why are organizations moving from traditional data warehouse tools to smarter data hubs based on Hadoop eco systems?
A2. Organizations are investing on enhancing their
Existing data infrastructure:
predominantly using “structured data” stored in high-end & expensive hardwares
predominantly processed as ETL batch jobs for ingesting data into RDBMS and data warehouse systems for data mining, analysis & reporting to make key business decisions.
predominantly handle data volumes in gigabytes to terabytes
Smarter data infrastructure based on Hadoop where
structured (e.g. RDBMS), unstructured (e.g, images, PDFs, docs ), & semi-structured (e.g. logs, XMLs) data can be stored in cheaper commodity machines in a scalable and fault tolerant manner.
data can be ingested via batch jobs and near real time (i.e. NRT, 200ms to 2 seconds) streaming (e.g. Flume & Kafka).
data can be queried with low latency (i.e under 100ms) capabilities with tools like Spark & Impala.
larger data volumes in terabytes to petabytes can be stored.
which empowers organizations to make better business decisions with smarter & bigger data with more powerful tools to ingest data, to wrangle stored data (e.g. aggregate, enrich, transform, etc), and to query the wrangled data with low-latency capabilities for reporting & business intelligence.
Q3. How does a smarter & bigger data hub architectures differ from a traditional data warehouse architectures?
Traditional Enterprise Data Warehouse Architecture: