• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Data Read functionality in HDFS

 
vikas gunti
Greenhorn
Posts: 19
Eclipse IDE Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A File of 150 MB is stored in HDFS in that 128 MB is stored in two blocks remaining 22 MB is stored in third block,during a file read process after reading data in first block how does a map reduce job knows to which block it should go to completely read that file?

Thanks in advance.
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The HDFS NameNode is responsible for tracking such metadata. It knows which datanode(s) store which blocks of which files.

A mapper or reducer actually knows nothing about HDFS blocks. It just receives key-values read by the configured RecordReader object.
The RecordReader too knows nothing about HDFS blocks. It just asks HDFS to open the file and give it an InputStream.
It's this InputStream implementation (called DFSInputStream) that is responsible for getting metadata from name node, and reading the blocks in sequence from whichever datanode(s) they are stored in.
 
vikas gunti
Greenhorn
Posts: 19
Eclipse IDE Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Karthik,


Thank you for your answer.More granularly my question is: If file data consists of a line "Hadoop is wonderful framework" so it is stored up to "Hadoop is" in one block in datanode1 and "wonderful framework" in other block in datanode2 . These data nodes may consists of other files data also , so while reading a file from HDFS ,how hadoop framework will do its work? How it exactly finds out the correct block for the remaining continuation of data?
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When we actually try reading a file from HDFS, we know for a fact that it comes back in the right sequence. Ergo, the HDFS namenode knows how to find those fragments stored in different data nodes.
Now how exactly it does this in code requires one to know HDFS internal implementation, and I don't know them. Perhaps you can start from DFSInputStream and see what data structures it uses to understand how it works.
 
vikas gunti
Greenhorn
Posts: 19
Eclipse IDE Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Karthik,

Thank you for the reply. I will dig further into this .
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic