• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Jeanne Boyarsky
  • Ron McLeod
Sheriffs:
  • Paul Clapham
  • Liutauras Vilda
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
Bartenders:

Mapreduce using Java

 
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi all,

Is it necessary to use Hadoop to implement Mapreduce programs in Java? Is it possible for me to implement it with out Hadoop and using Java classes alone.

Please help me.

Thanks in advance,
Madhu
 
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Of course it's possible.
Map reduce is just a technique to break up some input into smaller chunks, process each chunk to get a result, and finally aggregate all those chunk results to get a final result.
It lends itself to parallelism, because each chunk can be processed by a different thread or core or processor or machine, then collected at a central point and aggregated.
But such a simplistic implementation may not scale well or may become simply too time consuming for large datasets.

What hadoop brings to the table is the infrastructure and physical architecture to perform distributed map reduce on a large scale using a cluster of machines., with features like centralized tracking and supervision.
Since it stores chunk results in a distributed file system, it's also fault tolerant. It's suitable when datasets are in the 100s of MBs and above in size.
If your problem does not require that level of scalability or fault tolerance or involve large dataset sizes, you don't need Hadoop.

What's the nature of your problem?
 
Madhumitha Baskaran
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks. Your answer was helpful.

I am working on a project actually to implement distributed GREP and distributed sorting using Mapreduce. I have an ordinary core I5 laptop and I dont have distributed environment to work on. So I am thinking that I can implement simplistic approach to implement.

If I use threads to implement the same , will it be possible for me to let each thread use one core and get executed simultaneously. Please help.

Thanks,
Madhu
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I am working on a project actually to implement distributed GREP and distributed sorting using Mapreduce. I have an ordinary core I5 laptop and I dont have distributed environment to work on. So I am thinking that I can implement simplistic approach to implement.


From your description of the problem, it looks like the intended system should indeed be distributed across machines at some point ("distributed GREP and distributed sorting") and even the strategy to do so has been decided as mapreduce (presumably using hadoop).

The only problem seems to be that for your development purposes you don't have a cluster of machines at the moment.

I don't think the solution should be decided by non availability of development resources. Rather it should be decided by how the system is going to be finally deployed in production.
You can start off by installing Hadoop in single machine mode on your laptop (they have a tutorial that explains how - very easy to do).
You can later simulate a cluster of machines on your laptop by installing a virtualization product like Virtualbox , creating atleast 1 Virtual Machine (your host machine and the virtual machine will play role of task tracker/name tracker), installing Hadoop on both them and running your jobs on this "virtual cluster". There is a learning curve involved here, but it'll be well worth it.
If at a later point you have access to more machines, you can very easily include them into your hadoop setup. The grepping and sorting (hadoop already supports sorted aggregation) logic will remain the same, regardless of whether hadoop is on a single machine or a cluster.

If I use threads to implement the same , will it be possible for me to let each thread use one core and get executed simultaneously. Please help.


How each thread is scheduled and assigned to a core depends on how the JVM is implemented, how the underlying OS inturn schedules them, what other applications are occupying the processor, etc. It's rather emergent behaviour. Java has no explicit parallelism capability - you can't tell java I have 4 cores and I want this thread to run on this core and this other thread to run on that core. You just implement multi threading using java APIs (Java 7 for example introduces the fork join API that makes tasks like yours easier), hope for the best, measure performance, and see if any code level optimizations are possible to utilize threads better.

From the description of your problem, I think you should stick with hadoop instead of going this route, since you need distributed. Going the threading route means you'll have to roll out your own distributed logic later on using RMI or something like that. Hadoop already has all that and is much less coding work. You can concentrate on the core analysis logic instead of the infrastructure to run that logic at the beginning.
 
Madhumitha Baskaran
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks a lot. Your reply is extremely helpful. I will go for Hadoop itself. Because I might lose points if I do simple implementation using Java threads alone. I am hoping that getting myself familar to Hadoop is going to be manageable task.
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
"lose points"?? Is this an academic project? Hadoop is easy to learn - no worries there.
 
Madhumitha Baskaran
Ranch Hand
Posts: 66
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes. It is a project for my graduate studies. Thanks a lot for your help. Because I would have ended up doing normal implementation using threads.
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
No problem. Good luck!
 
Ranch Hand
Posts: 440
4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I want to know: What is in Hadoop map reduce which was not in Google Map Reduce?
 
Greenhorn
Posts: 15
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Satyaprakash Joshii wrote:I want to know: What is in Hadoop map reduce which was not in Google Map Reduce?



The difference is that Hadoop is open source Apache software whereas Google is not. Hadoop is built based on a white paper that Google has published on Map Reduce. Look at Hadoop's history for more info.
 
Paddy spent all of his days in the O'Furniture back yard with this tiny ad:
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic