Win a copy of Spring Boot in Practice this week in the Spring forum!

Rajesh Nagaraju

Ranch Hand
+ Follow
since Nov 27, 2003
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
1
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Rajesh Nagaraju

Add a counter and check if the sum is correct in your reducer
7 years ago
I did my cloudera certified hadoop developer on my own.

MapR certified admin through, my company.

However I dont put the MAPR as I cannot do justice to that.

I am also a Java programmer, so you should ask yourself why both and not concentrate the developer
part totally
7 years ago
Spring hadoop actually is Spring Batch,

It offers a workflow to schedule and run MR, pig, hive and other related technologies jobs

It is not going to help you create MR jobs

Thanks and Regards
Rajesh Nagaraju
7 years ago
The approach would be to have 3 reference data sets,

1> Positive words
2> Negative words
3> Confusion matrix

Then see if there are positive words or negative words and then classify it has a positive or a negative comment.

Confusion matrix is a contigency table.

The challenges will be to capture positive words added with negative words, sarcasm in comments.

Examples: This is not the best movie, I have watched.

The word not does "not" mean it is a negative comment.

Can you share more information on what is your dataset? The computational power of Hadoop can help you compute such a
huge dataset, however Hadoop will not do any thing by itself.


Hope this helps

Thanks and Regards
Rajesh Nagaraju

7 years ago

This could be data issue or might have reached max thresold.



Can you please elaborate on the max threshold?

Thanks and Regards
Rajesh Nagaraju
7 years ago
Hi Baran,

How did you proceed?

Thanks and Regards
Rajesh Nagaraju
8 years ago
Additionally due to the the parallel processing of the mappers and the sort and shuffle phase, the ordering of the lines in the output could change.
Hence the line numbers in the output will not sync with the input.
8 years ago

amit punekar wrote:Hello,
Why can't you
1) Run the mapper that outputs "word" as key and its length as value.
2) Setting the reducer size to 1, would make sure that all mapper's output passed to a single reducer which can then look at the map and output the MAX length words out.

I do understand that I am not talking about "setup" question that you have asked. However this way you could handle it easily and in a better manner.

As someone mentioned you could use reducer as custom combiner as well (similar to the standard Weather example )

Regards,
Amit



This is the Approach 1, I mentioned the limitation is that you have only 1 reducer which could end up with a lot of things to do and
hence affect performance. I have not mentioned a combiner as we dont need a combiner each mapper output is just the longest word
and its max length
8 years ago
In this approach

I mean to say, use map to output word and its length; and then in reduce, use static variables for max length, and compute max length of word from input,
and finally write the output in cleanup() method



the variables will give you the max for the particular map task assuming the configuration set for JVM re-use as by default it is 1.
you will still have to have 1 reducer and get the actual max length and the word from all the mappers.

My approach would be to have global a counter for the max length, you start with the counter value being the value of the first word.
If the word is more than the value of the counter max length, then write the word as key and its length also change the counter value to the new length.
Then,

Approach 1: Then use a single reducer to get the max. Advantage is the number of records to process in the reducer will reduce.
Approach 2: More complicated however will perform better, you use the length of the word as a key. Then use a custom partitioner to send the range of lengths to a reducer.
Then find the max in each reducer and the output of your last reducer will hold the max length and the word



8 years ago
Spring Batch can be used as it will be only the jar file references that are needed in the classpath
and you can launch it in a shell script and use the spring hadoop package to run the hadoop command
8 years ago
Merging of the reducer could be done with command

Then we can read this as an input for the mapper, and use Distributed Cache if the file is small and if it is large you will have another Mapper to process this file and then use MultipleInputs ( this becomes a Reducer side join).

If you still want to do a map side join then you can use CompositeInputFormat.

Lastly, to automate the whole process you could use Oozie or spring batch.

Hope to hear from others their views.
8 years ago
Loading the reducer data on to Distributed Cache will work if the Reducer output is small not very big.

local.cache.size parameter controls the size of the DistributedCache. By default, it’s set to 10 GB.
8 years ago
Sorry I had not checked for some time

hive should be run as a server -- means that when we run it as a server, it runs the thrift service which enables us to connect through Java
8 years ago
The other aspect to note for deciding the compression is whether the compression technique is splittable or not
8 years ago
One way to do chaining of MR jobs is to use Spring Batch
8 years ago