Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
The moose likes Hadoop and the fly likes Not getting performance with MapReduce Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Databases » Hadoop
Bookmark "Not getting performance with MapReduce" Watch "Not getting performance with MapReduce" New topic

Not getting performance with MapReduce

Priyanka Suresh Shinde

Joined: Nov 27, 2012
Posts: 2
I am working on hadoop mapreduce to get performance benefit but when I run my program on hadoop it takes about 37 minutes where as it takes only about 5 minutes for simple C++ program for doing the same task.
Jayesh A Lalwani
Saloon Keeper

Joined: Jan 17, 2008
Posts: 2744

Please TellTheDetails. What is your application doing? Where is it spending more time?
Priyanka Suresh Shinde

Joined: Nov 27, 2012
Posts: 2
The input file contains the number of records, one per line. I have written one simple program to print those lines in which three words are common. In map function i have passed the word as a key and record as a value and compared those records in reduce function.
Martin Vajsar

Joined: Aug 22, 2010
Posts: 3732

Welcome to the Ranch, Priyanka!

Parallel processing is not a silver bullet that will instantly turn every program to run x times faster. It adds a lot of overhead for creating all the threads, distributing work to them and then getting the results back and aggregating them again. If I understand your description right, there isn't any actual processing - your workers do nothing.

Imagine you need to do a project that will take a man-year of work. You can do it yourself in a year, or you can hire ten developers, distribute the work among them, manage them and deliver the project in, perhaps, three months. You might expect the project to be finished in five or six weeks, given that there are now ten people working on it, but it won't be the case. The developers won't spend all the time coding, they will need to meet and coordinate their work, which isn't needed if just one person does the work.

And now imagine that you'd hire ten developers to write a 20 lines "Hello, world!" application. They'll probably spend much, much more time doing so than if you whipped up the program yourself. Every one of them would in theory write just two lines of code, but the overhead of coordinating their work in this case is so big that it exceeds several times any benefit from having multiple people working on it.

Your program is similar - individual workers have very little work to do, but the amount of work needed to coordinate them is the same as if they worked hard. This simple program won't work well with Hadoop. Only programs that do substantial amount of work other than the Map and Reduce functions can experience any speedup at all. Hadoop is best suited for cases where you can distribute a lot of work among a lot of workers.
Saurabh Rana

Joined: Jul 05, 2012
Posts: 7
How many rows are you trying to process? What are the details of your cluster?how many nodes?
I agree. Here's the link:
subject: Not getting performance with MapReduce
jQuery in Action, 3rd edition