• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Multiple inputs on a single mapper in hadoop

 
Ivan Zandon
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm developing an algorithm that needs to run two sequential mapreduce jobs, where the second one takes in input the input and the output of the first one at the same time. I found four ways to do it and I want to know witch of these is the most efficient or if there are other methods.

Distributed Cache

Merging all the reducer output into a single file and loading it on Distributed Cache



Adding it as a resource to the configuration class

As before I merge the output saving it on a String and than:



Reading from hdfs

The second map reads the output files of first reducers directly from hdfs

Passing two values as input

I have found on this webpage this pseudocode where it seems that they are passing two arguments as input to the second mapper but I don't know how to do that.

 
Rajesh Nagaraju
Ranch Hand
Posts: 63
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Loading the reducer data on to Distributed Cache will work if the Reducer output is small not very big.

local.cache.size parameter controls the size of the DistributedCache. By default, it’s set to 10 GB.
 
Ivan Zandon
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What about performance (in time and space)? is It correct the method I use to merge all the outputs from the reducers or are there any better methods?
 
Rajesh Nagaraju
Ranch Hand
Posts: 63
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Merging of the reducer could be done with command

Then we can read this as an input for the mapper, and use Distributed Cache if the file is small and if it is large you will have another Mapper to process this file and then use MultipleInputs ( this becomes a Reducer side join).

If you still want to do a map side join then you can use CompositeInputFormat.

Lastly, to automate the whole process you could use Oozie or spring batch.

Hope to hear from others their views.
 
Ivan Zandon
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How can I execute that command automatically when the first reducer finishes before the second mapper starts?
Unfortunately I cannot use Oozie because I haven't the rights to install it on the production environment.
 
Rajesh Nagaraju
Ranch Hand
Posts: 63
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Spring Batch can be used as it will be only the jar file references that are needed in the classpath
and you can launch it in a shell script and use the spring hadoop package to run the hadoop command
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic