File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Hadoop and the fly likes Hadoop - FileInputFormat Question Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Databases » Hadoop
Bookmark "Hadoop - FileInputFormat Question" Watch "Hadoop - FileInputFormat Question" New topic
Author

Hadoop - FileInputFormat Question

Jeff Napier
Greenhorn

Joined: Oct 14, 2013
Posts: 2
I have a question concerning how FileInputFormat works. That being said, how does FileInputFormat work when used in a MapReduce job? More specifically, how does the FileInputFormat feed data from the input path files to the mapper? The files are of the format: Text Text and I'm not quite sure how I go about getting that information from the file while in the Mapper. Would I just set the input key and value to Text and Text and then when I'm in the Mapper assume that I'll be getting those two Text fields from the file?

Thanks,
Jeff
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 497
    
    5
Hi Jeff, Welcome to CodeRanch!

how does the FileInputFormat feed data from the input path files to the mapper. The files are of the format: Text Text and I'm not quite sure how I go about getting that information from the file while in the Mapper.

Use Job.setInputFormatClass() or JobConf.setInputFormat() to specify an InputFormat implementation that is suited to your input file format.
Since your file format consists of lines of key value pairs, you can use KeyValueTextInputFormat

When the job is started, KeyValueTextInputFormat will create KeyValueLineRecordReader.
Record readers are responsible for converting raw input data into key value pairs suitable for Mappers.
So, the KeyValueLineRecordReader will create <Text,Text> pairs out of each line.
These pairs are then sent to Mappers.

The KeyValueLineRecordReader uses tab ("\t") as default separator. If you want a different separator like whitespace or comma, you have to set a property in the configuration:
In versions 0.x or 1.x : set
In versions 2.x : set



Jeff Napier
Greenhorn

Joined: Oct 14, 2013
Posts: 2
I'm using version 0.2 and I don't quite understand what I would be setting for the KeyValueTextInputFormat. The conf.set('...") is a little to vague for me because I'm not as familiar with Hadoop yet.

Thanks,
Jeff

Karthik Shiraly wrote:Hi Jeff, Welcome to CodeRanch!

how does the FileInputFormat feed data from the input path files to the mapper. The files are of the format: Text Text and I'm not quite sure how I go about getting that information from the file while in the Mapper.

Use Job.setInputFormatClass() or JobConf.setInputFormat() to specify an InputFormat implementation that is suited to your input file format.
Since your file format consists of lines of key value pairs, you can use KeyValueTextInputFormat

When the job is started, KeyValueTextInputFormat will create KeyValueLineRecordReader.
Record readers are responsible for converting raw input data into key value pairs suitable for Mappers.
So, the KeyValueLineRecordReader will create <Text,Text> pairs out of each line.
These pairs are then sent to Mappers.

The KeyValueLineRecordReader uses tab ("\t") as default separator. If you want a different separator like whitespace or comma, you have to set a property in the configuration:
In versions 0.x or 1.x : set
In versions 2.x : set



Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 497
    
    5
The conf.set('...") is a little to vague for me because I'm not as familiar with Hadoop yet.

Try out the basic map reduce examples on hadoop wiki to understand job configuration settings. Some of the examples - like this one - even demonstrate how to set the appropriate input format class using Job.setInputFormatClass().

In first post, you mentioned:
The files are of the format: Text Text

I interpreted it as your input files consisting of lines of key value pairs separated by a whitespace or tab or some other separator string. Hence, I mentioned KeyValueTextInputFormat and how to override its default separator string.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Hadoop - FileInputFormat Question