• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Paul Clapham
  • Bear Bibeault
  • Jeanne Boyarsky
Sheriffs:
  • Ron McLeod
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Tim Moores
  • Tim Holloway
  • Jj Roberts
  • Stephan van Hulst
  • Carey Brown
Bartenders:
  • salvin francis
  • Scott Selikoff
  • fred rosenberger

Transfer large file >50Gb with DistCp from s3 to cluster

 
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello guys

I have a problem using the DistCp to transfer a large file from s3 to HDFS cluster, whenever I tried to make the copy, I only saw processing work and memory usage in one of the nodes, not in all of them, I don't know if this is the proper behaviour of this or if it is a configuration problem. If I make the transfer of multiple files each node handles a single file at the same time, I understand that this transfer would be in parallel but it doesn't seems like that.

I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I was hoping that any of you have an idea of how it works distCp and which properties could I tweak to improve the transfer rate that is currently in 0.7 Gb per minute.

Regards.
 
Greenhorn
Posts: 15
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
distcp is for copying large amounts of data to and from Hadoop filesystems in parallel. Haven't heard of anyone using it to copy files from non-hdfs to hdfs. I am curious to know if you have solved your problem.
 
Juan Felipe Morales Castellanos
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Srinivas

No, I didn't make myself clear, when I talked about transfering from S3 I didn't meant transfer it from S3 format to HDFS, I was talking about a file in HDFS(stored in an S3 bucket) being transfer to an Ec2Instance.

Finally I found that this can't be done as I expected, distcp make copies in parallel but only for multiple files, for a single file is only one thread the one that is in charge to make the transfer, I didn't knew that, it seems that facebook managed to make this making a modification to one version of hadoop (0.20.2 I think) managing to transfer a single large file in parallel using distcp but I haven't try this facebook modified version. Finally to fix this issue I wrotte a simple Map-Reduce job that allowed me to transfer the file in parallel.

Regards and thanks for the interest.
 
He loves you so much! And I'm baking the cake! I'm going to put this tiny ad in the cake:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
reply
    Bookmark Topic Watch Topic
  • New Topic