Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Transfer large file >50Gb with DistCp from s3 to cluster

 
Juan Felipe Morales Castellanos
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello guys

I have a problem using the DistCp to transfer a large file from s3 to HDFS cluster, whenever I tried to make the copy, I only saw processing work and memory usage in one of the nodes, not in all of them, I don't know if this is the proper behaviour of this or if it is a configuration problem. If I make the transfer of multiple files each node handles a single file at the same time, I understand that this transfer would be in parallel but it doesn't seems like that.

I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I was hoping that any of you have an idea of how it works distCp and which properties could I tweak to improve the transfer rate that is currently in 0.7 Gb per minute.

Regards.
 
Srinivas Mupparapu
Greenhorn
Posts: 14
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
distcp is for copying large amounts of data to and from Hadoop filesystems in parallel. Haven't heard of anyone using it to copy files from non-hdfs to hdfs. I am curious to know if you have solved your problem.
 
Juan Felipe Morales Castellanos
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Srinivas

No, I didn't make myself clear, when I talked about transfering from S3 I didn't meant transfer it from S3 format to HDFS, I was talking about a file in HDFS(stored in an S3 bucket) being transfer to an Ec2Instance.

Finally I found that this can't be done as I expected, distcp make copies in parallel but only for multiple files, for a single file is only one thread the one that is in charge to make the transfer, I didn't knew that, it seems that facebook managed to make this making a modification to one version of hadoop (0.20.2 I think) managing to transfer a single large file in parallel using distcp but I haven't try this facebook modified version. Finally to fix this issue I wrotte a simple Map-Reduce job that allowed me to transfer the file in parallel.

Regards and thanks for the interest.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic