• Post Reply Bookmark Topic Watch Topic
  • New Topic

how to set map-reduce task?

Joseph Hwang
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Following data are extracted from the 1st map-reduce task

country ; title ; sex ; units ; file location
Turkey ; Population ; Males ; Persons ; L/F/W/A/5/LFWA55MATRQ647N.csv
Turkey ; Population ; Males ; Persons ; L/F/W/A/5/LFWA55MATRA647N.csv
Turkey ; Population ; Males ; Persons ; L/F/W/A/5/LFWA55MATRQ647S.csv
Turkey ; Population ; Males ; Persons ; L/F/W/A/5/LFWA55MATRA647S.csv

And then i try to set 2nd map-reduce task with csv files of the file location column. Data format of each csv files is like below

year ; population
2004 ; 2130034
2005 ; 2239913
2006 ; 2437712
2007 ; 2210673

But i have no idea how to set 2nd map-reduce task with using file location column data from 1st map-reduce task. The final output format is like below

country ; year ; population
Turkey ; 2004 ; 2130034
Turkey ; 2005 ; 2239913
Turkey ; 2006 ; 2437712
Turkey ; 2007 ; 2210673

As far as i know, input file path is set only in driver class with FileInputFormat.setInputPaths() method, but in my map-reduce task file location is handled only in map and reduce class.i wonder how to load input file path from map and reduce class into driver class?
How can i put file location value into FileInputFormat.setInputPaths() method, for example FileInputFormat.setInputPaths(job,new Path("L/F/W/A/5/LFWA55MATRQ647N.csv"));
I need your advice. Your help will be appreciated in advance!
Rajesh Nagaraju
Ranch Hand
Posts: 63
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
One way to do chaining of MR jobs is to use Spring Batch
amit punekar
Ranch Hand
Posts: 544
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are your CSV files on HDFS ? How big is one file? I mean how many rows of "year";"population" does it contain ? You could copy them to HDFS first.

Then run a Pig script which would automatically chain the required MR jobs to process the data.

Pig script would roughly look like (Assuming output of your 1st MR is in a file)
1) Read the 1st MR output with schema - country,title,sex,units, file location (or name)
2) If CSV files are on HDFS, read those file using schema - file location (or name), year, population [You may have to write your own Loader Function for this as we want to have File location as one of the output fields]
3) Join 1 and 2 using "file location (name)" which would result in desired output i.e.
country, year, population

Of course, this all can be done using plain MR as well but you will have to chain those jobs together. Whichever way you proceed, I believe you would need to have CSV files on the HDFS cluster.

  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!