Java 使用 distcp 或 s3distcp 将文件从 S3 复制到 HDFS

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22678748/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 17:14:25  来源:igfitidea点击:

Copy files from S3 to HDFS using distcp or s3distcp

javahadoopamazon-web-servicesamazon-s3

提问by scalauser

I am trying to copy files from S3 to HDFS using the following command:

我正在尝试使用以下命令将文件从 S3 复制到 HDFS:

hadoop distcp s3n://bucketname/filename hdfs://namenodeip/directory

However this is not working, getting an error as following:

但是这不起作用,出现如下错误:

ERROR tools.DistCp: Exception encountered 
java.lang.IllegalArgumentException: Invalid hostname in URI

I have tried to add the S3 keys in hadoop conf.xml, and it is also not working. Please help me the appropriate step by step procedure to achieve the file copy from S3 to HDFS.

我试图在 hadoop conf.xml 中添加 S3 密钥,但它也不起作用。请帮助我完成从 S3 到 HDFS 的文件复制的适当分步过程。

Thanks in advance.

提前致谢。

采纳答案by scalauser

The command should be like this :

命令应该是这样的:

Hadoop distcp s3n://bucketname/directoryname/test.csv /user/myuser/mydirectory/

This will copy test.csv file from S3 to a HDFS directory called /mydirectory in the specified HDFS path. In this S3 file system is being used in a native mode. More details can be found on http://wiki.apache.org/hadoop/AmazonS3

这会将 test.csv 文件从 S3 复制到指定 HDFS 路径中名为 /mydirectory 的 HDFS 目录。在此 S3 文件系统中使用的是本机模式。更多细节可以在http://wiki.apache.org/hadoop/AmazonS3找到

回答by Sathish

Copy log files stored in an Amazon S3 bucket into HDFS. Here --srcPattern option is used to limit the data copied to the daemon logs.

将存储在 Amazon S3 存储桶中的日志文件复制到 HDFS。这里 --srcPattern 选项用于限制复制到守护程序日志的数据。

Linux, UNIX, and Mac OS X users:

Linux、UNIX 和 Mac OS X 用户:

./elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'

Windows users:

视窗用户:

ruby elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,--dest,hdfs:///output,--srcPattern,.*daemons.*-hadoop-.*'

Please check this link for more :
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

请查看此链接了解更多信息:http:
//docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

Hope this helps!

希望这可以帮助!