Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Sorting using Mapreduce

 
Madhumitha Baskaran
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I want to implement distributed sorting using Map reduce in Java. From whatever I know, Merge sort will look better for distributed sorting. Are there are any other algorithms that I could use? Please guide.

Thanks,
Madhu
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13074
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Note that using Mapreduce will involve two separate phases and thus two algorithm choices.

Obviously the phase which collects and merges separate sort jobs will have to be a merge but the separate sort jobs will operate just like any other sort with the same considerations.

Bill
 
Madhumitha Baskaran
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks.

I have some questions popping out.

I am not able to find what is the exact key over here. Out of Map function, if I need to find separate keys, it is not possible. Because if I have large amount of integers to be sorted, then key should be the same for everything for me to merge it together later at reduce phase.

Shall I do something like.. I will go through the input data set and picks few keys randomly. I will make all the integers lesser than a key to get associated with that key. I will sort each group of a key using Map function. Later I will just merge everything in the reduce phase depending on the sorting order of keys. Will this work out well?

I dont know how to use Mapreduce efficiently for large data set. Can you help me out.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13074
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't understand the function of "keys" in your example - sounds like an extra step for no gain.

I'm not an expert on Mapreduce but if this was my problem I would cut the data set into roughly equal portions, one for each sorting processor/thread/whatever, then merge by looking at the "top" element of each of the returned sorted arrays, take the lowest (or highest), remove it and repeat till all sources are exhausted.

Bill
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic