java 使用 sc.textFile 从子目录递归获取文件内容

Question

提问by javadba

It seems that SparkContext textFile expects only files to be present in the given directory location - it does not either

SparkContext textFile 似乎只希望文件出现在给定的目录位置 - 它也没有

(a) recurse or
(b) even supportdirectories (tries to read directories as files)

(a) 递归或
(b) 甚至支持目录（尝试将目录作为文件读取）

Any suggestion how to structure a recursion - potentially simpler than creating the recursive file list / descent logic manually?

任何建议如何构建递归 - 可能比手动创建递归文件列表/下降逻辑更简单？

Here is the use case: files under

这是用例：下的文件

/data/tables/my_table

/数据/表/my_table

I want to be able to read via an hdfs call all the files at all directory levels under that parent directory.

我希望能够通过 hdfs 调用读取该父目录下所有目录级别的所有文件。

UPDATE

更新

The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass) TextInputFormat. Inside the logic does exist to do the recursive directory reading - i.e. first detecting if an entry were a directory and if so then descending:

sc.textFile() 通过（子类）TextInputFormat 调用 Hadoop FileInputFormat。在逻辑内部确实存在进行递归目录读取 - 即首先检测条目是否是目录，如果是，则降序：

<!-- language: java -->
     for (FileStatus globStat: matches) {
218          if (globStat.isDir()) {
219            for(FileStatus stat: fs.listStatus(globStat.getPath(),
220                inputFilter)) {
221              result.add(stat);
222            }          
223          } else {
224            result.add(globStat);
225          }
226        }

However when invoking sc.textFile there are errors on directory entries: "not a file". This behavior is confusing - given the proper support appears to be in place for handling directories.

但是，在调用 sc.textFile 时，目录条目会出现错误：“不是文件”。这种行为令人困惑 - 鉴于处理目录似乎有适当的支持。

Answer 1

回答by javadba

I was looking at an old version of FileInputFormat..

我在看旧版本的 FileInputFormat ..

BEFOREsetting the recursive config mapreduce.input.fileinputformat.input.dir.recursive

在设置递归配置mapreduce.input.fileinputformat.input.dir.recursive 之前

scala> sc.textFile("dev/*").count
     java.io.IOException: Not a file: file:/shared/sparkup/dev/audit-release/blank_maven_build

The default is null/not set which is evaluated as "false":

默认值为空/未设置，被评估为“假”：

scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
res1: String = null

AFTER:

后：

Now set the value :

现在设置值：

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")

Now retry the recursive operation:

现在重试递归操作：

scala>sc.textFile("dev/*/*").count

..
res5: Long = 3481

So it works.

Updateadded /for full recursion per comment by @Ben

@Ben 为每条评论添加的更新/完整递归

Answer 2

回答by Paul

I have found that these parameters must be set in the following way:

我发现必须按以下方式设置这些参数：

.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

java 使用 sc.textFile 从子目录递归获取文件内容

提问by javadba

回答by javadba

回答by Paul

相关推荐

最近更新

标签

java 使用 sc.textFile 从子目录递归获取文件内容

提问by javadba

回答by javadba

回答by Paul

相关推荐

检查值是否在 Java 中的两个数字之间

java 是否可以将 AngularJs 包含到带有 Gradle 的项目中

java 使用spring boot时如何配置动态属性？

java 递归斐波那契算法的空间复杂度是多少？

相关推荐

最近更新

标签