scala 使用 Spark 列出 Hadoop HDFS 目录中的所有文件？

Question

提问by poliu2s

I want to loop through all text files in a Hadoop dir and count all the occurrences of the word "error". Is there a way to do a hadoop fs -ls /users/ubuntu/to list all the files in a dir with the Apache Spark Scala API?

我想遍历 Hadoop 目录中的所有文本文件并计算“错误”一词的所有出现次数。有没有办法hadoop fs -ls /users/ubuntu/使用 Apache Spark Scala API 列出目录中的所有文件？

From the given first example, the spark context seems to only access files individually through something like:

从给出的第一个示例来看，spark 上下文似乎只能通过以下方式单独访问文件：

val file = spark.textFile("hdfs://target_load_file.txt")

In my problem, I do not know how many nor the names of the files in the HDFS folder beforehand. Looked at the spark context docsbut couldn't find this kind of functionality.

在我的问题中，我事先不知道 HDFS 文件夹中文件的数量和名称。查看了spark 上下文文档，但找不到这种功能。

Answer 1

回答by Daniel Darabos

You can use a wildcard:

您可以使用通配符：

val errorCount = sc.textFile("hdfs://some-directory/*")
                   .flatMap(_.split(" ")).filter(_ == "error").count

Answer 2

回答by Animesh Raj Jha

import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack


val fs = FileSystem.get( sc.hadoopConfiguration )
var dirs = Stack[String]()
val files = scala.collection.mutable.ListBuffer.empty[String]
val fs = FileSystem.get(sc.hadoopConfiguration)

dirs.push("/user/username/")

while(!dirs.isEmpty){
    val status = fs.listStatus(new Path(dirs.pop()))
    status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else 
    files+= x.getPath.toString)
}
files.foreach(println)

scala 使用 Spark 列出 Hadoop HDFS 目录中的所有文件？

提问by poliu2s

回答by Daniel Darabos

回答by Animesh Raj Jha

相关推荐

最近更新

标签

scala 使用 Spark 列出 Hadoop HDFS 目录中的所有文件？

提问by poliu2s

回答by Daniel Darabos

回答by Animesh Raj Jha

相关推荐

Scala 静态分析工具的现状如何？

Scala 的 actor 与 Go 的协程相似吗？

scala 带有 MySQL 的 Slick 2.0 的“Hello World”示例

scala 按键分组时，Spark 内存不足

相关推荐

最近更新

标签