• Post Reply Bookmark Topic Watch Topic
  • New Topic

Hadoop Text Object to String Object Problems  RSS feed

 
akshat sehgal
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My hadoop mapper is emitting a csv line by line as a text object as follows:



Within my reducer, I need to split the csv and use the strings for some further action. Here is the reducer code:



PROBLEM:

The elems string array get assigned a length of 1 and contains only the first element of each line in my CSV.

For example, if we have the below CSV file:
1,2,3
4,5,6
7,8,9

The elems array is getting enumerated with 1, 4 and 7 in every successive reducer call.

EXPECTED:

The elems array should get enumerated with {1,2,3}, {4,5,6} and {7,8,9} in every successive reducer call.

PLEASE NOTE:
I tried to debug the issue. I found that after I convert the 'TEXT' object to a 'STRING' as in the above code snippet, the following two observations are there:

(a) elems.length() is 1
(b) elems[0] is 1 (for the above example)

Kindly help. Thanks.

 
Knute Snortum
Sheriff
Posts: 4274
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know Hadoop but I think your code has some problems.  In reduce(...) you have a variable x of type double[] (line 9), but in line 12 you use Integer.parseInt(...).  If you get a double, the error branch will be taken.
 
Knute Snortum
Sheriff
Posts: 4274
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This code is problematic:

It's odd to have the try block surround a for loop and the continue in catch block is ambiguous (at first glance).  It's not immediately clear if the intent is to continue the for loop or the surrounding while loop.
 
akshat sehgal
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yeah but that doesnt solve the problem of the elems[] array
 
Knute Snortum
Sheriff
Posts: 4274
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What does iter.next().toString() look like?
 
akshat sehgal
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So suppose the Text object (or the first row of the csv in my file is like 211,222,455

iter.next().toString() comes out to be 211,222,455

and i also happened to check its length (iter.next().toString().length())

this comes to be 1 !!
 
Knute Snortum
Sheriff
Posts: 4274
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a guess as to what's wrong, but let's do this:

If you do this:

...then the length that you're looking at is the next line!  The reason the next line may be one in length is if the text delimiter is \r\n (Windows) and values was build with \n delimiters (Unix).
 
akshat sehgal
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorry for the late reply, some problem happened with my hadoop configuration

I think you are getting me wrong. Have a look at this:
https://github.com/akshatsehgal01/HCDistributed/blob/master/src/main/java/com/apporiented/algorithm/clustering/HCDistributed.java

Please check the reduce function.

Within my reduce function, I am converting 'Text' type values to a 'String' array.

Suppose the values contain 222,223,224,227

After i debugged, I found out that elems[] contains only one element i.e.222.

On hadoop, I can't actually do a System.out.println as the computation is happening on a cluster.(UNIX based)

But here is how i debugged:

(a) to check string output:


within the reduce function, i wrote the following:



output: (key, value) pair - here value represents the string
1 42,9
1 40,9
1 37,10
1 34,11

(b) to check the length:

within the reduce function, i wrote the following:



output: (key, value) pair - here value represent the length
1 1 --------------> This length should be 4 (for 42,9)
1 1
1 1
1 1

As you can see in the above outputs, string is coming full in length, howeve hadoop still shows the length as 1
 
akshat sehgal
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Correction:

for part (a) of debugging, str is assigned as follows:
str = iter.next().toString();
 
Knute Snortum
Sheriff
Posts: 4274
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd like you to combine debugging (a) and (b) together, something like this:

What doesn't matter is the way you send the debug data to the console or a log file.  What does matter is that you put the value of iter.next().toString() into a String variable and log both the value of the String and its length.
 
akshat sehgal
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

Here is the debug code:


Here is the output:


So it works now, what is the catch here?
 
akshat sehgal
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
correction:

output is coming as follows for {42,9}


 
Knute Snortum
Sheriff
Posts: 4274
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So it works now, what is the catch here? 

I think the problem was that you were calling iter.next() more than once in the while loop.
 
Consider Paul's rocket mass heater.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!