scala 使用 Spark 列出 Hadoop HDFS 目录中的所有文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23352311/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:14:03  来源:igfitidea点击:

Use Spark to list all files in a Hadoop HDFS directory?

scalaapache-sparkhadoop

提问by poliu2s

I want to loop through all text files in a Hadoop dir and count all the occurrences of the word "error". Is there a way to do a hadoop fs -ls /users/ubuntu/to list all the files in a dir with the Apache Spark Scala API?

我想遍历 Hadoop 目录中的所有文本文件并计算“错误”一词的所有出现次数。有没有办法hadoop fs -ls /users/ubuntu/使用 Apache Spark Scala API 列出目录中的所有文件?

From the given first example, the spark context seems to only access files individually through something like:

从给出的第一个示例来看,spark 上下文似乎只能通过以下方式单独访问文件:

val file = spark.textFile("hdfs://target_load_file.txt")

In my problem, I do not know how many nor the names of the files in the HDFS folder beforehand. Looked at the spark context docsbut couldn't find this kind of functionality.

在我的问题中,我事先不知道 HDFS 文件夹中文件的数量和名称。查看了spark 上下文文档,但找不到这种功能。

回答by Daniel Darabos

You can use a wildcard:

您可以使用通配符:

val errorCount = sc.textFile("hdfs://some-directory/*")
                   .flatMap(_.split(" ")).filter(_ == "error").count

回答by Animesh Raj Jha

import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack


val fs = FileSystem.get( sc.hadoopConfiguration )
var dirs = Stack[String]()
val files = scala.collection.mutable.ListBuffer.empty[String]
val fs = FileSystem.get(sc.hadoopConfiguration)

dirs.push("/user/username/")

while(!dirs.isEmpty){
    val status = fs.listStatus(new Path(dirs.pop()))
    status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else 
    files+= x.getPath.toString)
}
files.foreach(println)