posted 10 years ago
Hello,
Are your CSV files on HDFS ? How big is one file? I mean how many rows of "year";"population" does it contain ? You could copy them to HDFS first.
Then run a Pig script which would automatically chain the required MR jobs to process the data.
Pig script would roughly look like (Assuming output of your 1st MR is in a file)
1) Read the 1st MR output with schema - country,title,sex,units, file location (or name)
2) If CSV files are on HDFS, read those file using schema - file location (or name), year, population [You may have to write your own Loader Function for this as we want to have File location as one of the output fields]
3) Join 1 and 2 using "file location (name)" which would result in desired output i.e.
country, year, population
Of course, this all can be done using plain MR as well but you will have to chain those jobs together. Whichever way you proceed, I believe you would need to have CSV files on the HDFS cluster.
Regards,
Amit