Spark Scala 列出目录中的文件夹

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33394884/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:45:50  来源:igfitidea点击:

Spark Scala list folders in directory

scalahadoopapache-spark

提问by AlexL

I want to list all folders within a hdfs directory using Scala/Spark. In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/

我想使用 Scala/Spark 列出 hdfs 目录中的所有文件夹。在 Hadoop 中,我可以使用以下命令执行此操作:hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/

I tried it with:

我试过:

val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)

val path = new Path("hdfs://sandbox.hortonworks.com/demo/")

val files = fs.listFiles(path, false)

But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.

但他似乎没有在 Hadoop 目录中查找,因为我找不到我的文件夹/文件。

I also tried with:

我也试过:

FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)

But this also does not help.

但这也无济于事。

Do you have any other idea?

你有什么其他想法吗?

PS: I also checked this thread: Spark iterate HDFS directorybut it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.

PS:我还检查了这个线程:Spark iterate HDFS directory但它对我不起作用,因为它似乎没有在 hdfs 目录上搜索,而是仅在具有架构文件的本地文件系统上搜索//。

回答by nil

We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.

我们使用的是 hadoop 1.4,它没有 listFiles 方法,所以我们使用 listStatus 来获取目录。它没有递归选项,但很容易管理递归查找。

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
status.foreach(x=> println(x.getPath))

回答by Ajay Ahuja

In Spark 2.0+,

在 Spark 2.0+ 中,

import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)

Hope this is helpful.

希望这是有帮助的。

回答by user3190018

in Ajay Ahujas answer isDiris deprecated..

在 Ajay AhujasisDir中,不推荐使用答案。

use isDirectory... pls see complete example and output below.

使用isDirectory...请参阅下面的完整示例和输出。

package examples

    import org.apache.log4j.Level
    import org.apache.spark.sql.SparkSession

    object ListHDFSDirectories  extends  App{
      val logger = org.apache.log4j.Logger.getLogger("org")
      logger.setLevel(Level.WARN)
      val spark = SparkSession.builder()
        .appName(this.getClass.getName)
        .config("spark.master", "local[*]").getOrCreate()

      val hdfspath = "." // your path here
      import org.apache.hadoop.fs.{FileSystem, Path}
      val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
      fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
    }

Result :

结果 :

file:/Users/user/codebase/myproject/target
file:/Users/user/codebase/myproject/Rel
file:/Users/user/codebase/myproject/spark-warehouse
file:/Users/user/codebase/myproject/metastore_db
file:/Users/user/codebase/myproject/.idea
file:/Users/user/codebase/myproject/src

回答by Lejla

val spark = SparkSession.builder().appName("Demo").getOrCreate()
val path = new Path("enter your directory path")
val fs:FileSystem = projects.getFileSystem(spark.sparkContext.hadoopConfiguration)
val it = fs.listLocatedStatus(path)

This will create an iterator itover org.apache.hadoop.fs.LocatedFileStatusthat is your subdirectory

这将创建一个迭代器itorg.apache.hadoop.fs.LocatedFileStatus它是您的子目录

回答by sun007

   val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
.globStatus(new org.apache.hadoop.fs.Path(url))

  for (urlStatus <- listStatus) {
    println("urlStatus get Path:" + urlStatus.getPath())

}

}

回答by Franzi

I was looking for the same, however instead of HDFS, for S3.

我正在为S3寻找相同的,但不是HDFS

I solved creating the FileSystem with my S3 path as below:

我解决了使用 S3 路径创建文件系统的问题,如下所示:

  def getSubFolders(path: String)(implicit sparkContext: SparkContext): Seq[String] = {
    val hadoopConf = sparkContext.hadoopConfiguration
    val uri = new URI(path)

    FileSystem.get(uri, hadoopConf).listStatus(new Path(path)).map {
      _.getPath.toString
    }
  }

I know this question was related for HDFS, but maybe others like me will come here looking for S3 solution. Since without specifying the URI in FileSystem, it will look for HDFS ones.

我知道这个问题与 HDFS 相关,但也许像我这样的其他人会来这里寻找 S3 解决方案。由于没有在 FileSystem 中指定 URI,它将查找 HDFS 的。

java.lang.IllegalArgumentException: Wrong FS: s3://<bucket>/dummy_path
expected: hdfs://<ip-machine>.eu-west-1.compute.internal:8020

回答by Shan Hadoop Learner

object HDFSProgram extends App {    
  val uri = new URI("hdfs://HOSTNAME:PORT")    
  val fs = FileSystem.get(uri,new Configuration())    
  val filePath = new Path("/user/hive/")    
  val status = fs.listStatus(filePath)    
  status.map(sts => sts.getPath).foreach(println)    
}

This is sample code to get list of hdfs files or folder present under /user/hive/

这是获取 /user/hive/ 下存在的 hdfs 文件或文件夹列表的示例代码

回答by Yogesh_JavaJ2EE

Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations

Azure 博客存储映射到 HDFS 位置,因此所有 Hadoop 操作

On Azure Portal, go to Storage Account, you will find following details:

Azure 门户上,转到存储帐户,您将找到以下详细信息:

  • Storage account

  • Key -

  • Container -

  • Path pattern – /users/accountsdata/

  • Date format – yyyy-mm-dd

  • Event serialization format – json

  • Format – line separated

  • 存储帐户

  • 钥匙 -

  • 容器 -

  • 路径模式 – /users/accountsdata/

  • 日期格式 – yyyy-mm-dd

  • 事件序列化格式——json

  • 格式 - 行分隔

Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:

这里的路径模式是 HDFS 路径,您可以登录/腻子到 Hadoop Edge 节点并执行以下操作:

hadoop fs -ls /users/accountsdata 

Above command will list all the files. In Scala you can use

上面的命令将列出所有文件。在 Scala 中,您可以使用

import scala.sys.process._ 

val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!

回答by Matthew Graves

Because you're using Scala, you may also be interested in the following:

因为您使用的是 Scala,您可能还对以下内容感兴趣:

import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!

This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatusinstead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single !if you want to just get the return code.)

不幸的是,这会将命令的整个输出作为字符串返回,因此解析为仅文件名需要一些努力。(fs.listStatus改为使用。)但是,如果您发现自己需要运行其他可以在命令行中轻松执行的命令,并且不确定如何在 Scala 中执行此操作,则只需使用命令行至scala.sys.process._. (!如果您只想获取返回码,请使用单个。)