• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Devaka Cooray
  • Knute Snortum
  • Paul Clapham
  • Tim Cooke
Sheriffs:
  • Liutauras Vilda
  • Jeanne Boyarsky
  • Bear Bibeault
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Ron McLeod
  • Piet Souris
  • Frits Walraven
Bartenders:
  • Ganesh Patekar
  • Tim Holloway
  • salvin francis

6+ Hadoop interview Q&As for Java developers who want a career change into BigData  RSS feed

 
Author
Posts: 3443
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Q1. What is Hadoop?
A1. Hadoop is an open-source software framework for storing large amounts of data and processing/querying those data on a cluster with multiple nodes of commodity hardware (i.e. low cost hardware). In short, Hadoop consists of

1. HDFS (Hadoop Distributed File System): HDFS allows you to store huge amounts of data in a distributed and a redundant manner. For example, a 1 GB (i.e 1024 MB) text file can be split into 16 * 128MB files and stored on 8 different nodes in a Hadoop cluster. Each split can be replicated 3 times for fault tolerance so that if 1 node goes down, you have backups. HDFS is good for sequential write-once-and-read-many times type access.



2. MapReduce: A computational framework. This processes large amounts of data in a distributed and parallel manner. When you do a query on the above 1 GB file for all users with age > 18, there will be say “8 map” functions running in parallel to extract users with age > 18 within its 128MB split file, and then the “reduce” function will run to combine all the individual outputs into a single final result.

3. YARN (Yet Another Resource Nagotiator): A framework for job scheduling and cluster resource management.

4. Hadoop eco system, with 15+ frameworks & tools like Sqoop, Flume, Kafka, Pig, Hive, Spark, Impala, etc to ingest data into HDFS, to wrangle data (i.e. transform, enrich, aggregate, etc) within HDFS, and to query data from HDFS for business intelligence & analytics. Some tools like Pig & Hive are abstraction layers on top of MapReduce, whilst the other tools like Spark & Impala are improved architecture/design from MapReduce for much improved latencies to support near real-time (i.e. NRT) & real-time processing.



Q2. Why are organizations moving from traditional data warehouse tools to smarter data hubs based on Hadoop eco systems?
A2. Organizations are investing on enhancing their

Existing data infrastructure:

   predominantly using “structured data” stored in high-end & expensive hardwares
   predominantly processed as ETL batch jobs for ingesting data into RDBMS and data warehouse systems for data mining, analysis & reporting to make key business decisions.
   predominantly handle data volumes in gigabytes to terabytes

To

Smarter data infrastructure based on Hadoop where

   structured (e.g. RDBMS), unstructured (e.g, images, PDFs, docs ), & semi-structured (e.g. logs, XMLs) data can be stored in cheaper commodity machines in a scalable and fault tolerant manner.
   data can be ingested via batch jobs and near real time (i.e. NRT, 200ms to 2 seconds) streaming (e.g. Flume & Kafka).
   data can be queried with low latency (i.e under 100ms) capabilities with tools like Spark & Impala.
   larger data volumes in terabytes to petabytes can be stored.

which empowers organizations to make better business decisions with smarter & bigger data with more powerful tools to ingest data, to wrangle stored data (e.g. aggregate, enrich, transform, etc), and to query the wrangled data with low-latency capabilities for reporting & business intelligence.

Q3. How does a smarter & bigger data hub architectures differ from a traditional data warehouse architectures?
A3.

Traditional Enterprise Data Warehouse Architecture:



Hadoop based Data Hub Architecture




More Hadoop interview Questions & Answers with diagrams


 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!