scala 使用来自 s3 或本地文件系统的 spark 从子目录递归读取文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27914145/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:49:18  来源:igfitidea点击:

read files recursively from sub directories with spark from s3 or local filesystem

scalahadoopapache-spark

提问by venuktan

I am trying to read files from a directory which contains many sub directories. The data is in S3 and I am trying to do this:

我正在尝试从包含许多子目录的目录中读取文件。数据在 S3 中,我正在尝试这样做:

val rdd =sc.newAPIHadoopFile(data_loc,
    classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
    classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
    classOf[org.apache.hadoop.io.NullWritable])

this does not seem to work.

这似乎不起作用。

Appreciate the help

感谢帮助

回答by venuktan

yes it works, took a while to get the individual blocks/splits though , basically a specific directory in every sub directory : s3n://bucket/root_dir/*/data/*/*/*

是的,它有效,虽然需要一段时间才能获得单个块/拆分,但基本上每个子目录中的特定目录: s3n://bucket/root_dir/*/data/*/*/*

回答by venuktan

ok, try this :

好的,试试这个:

hadoop fs -lsr
drwxr-xr-x   - venuktangirala supergroup          0 2014-02-11 16:30 /user/venuktangirala/-p
drwxr-xr-x   - venuktangirala supergroup          0 2014-04-15 17:00 /user/venuktangirala/.Trash
drwx------   - venuktangirala supergroup          0 2015-02-11 16:16 /user/venuktangirala/.staging
-rw-rw-rw-   1 venuktangirala supergroup      19823 2013-10-24 14:34 /user/venuktangirala/data
drwxr-xr-x   - venuktangirala supergroup          0 2014-02-12 22:50 /user/venuktangirala/pandora

-lsrlists recursively, then parse the ones that do not start with "d"

-lsr递归列出,然后解析不以“d”开头的那些