Spark Scala 列出目录中的文件夹

Question

提问by AlexL

I want to list all folders within a hdfs directory using Scala/Spark. In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/

我想使用 Scala/Spark 列出 hdfs 目录中的所有文件夹。在 Hadoop 中，我可以使用以下命令执行此操作：hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/

I tried it with:

我试过：

val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)

val path = new Path("hdfs://sandbox.hortonworks.com/demo/")

val files = fs.listFiles(path, false)

But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.

但他似乎没有在 Hadoop 目录中查找，因为我找不到我的文件夹/文件。

I also tried with:

我也试过：

FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)

But this also does not help.

但这也无济于事。

Do you have any other idea?

你有什么其他想法吗？

PS: I also checked this thread: Spark iterate HDFS directorybut it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.

PS：我还检查了这个线程：Spark iterate HDFS directory但它对我不起作用，因为它似乎没有在 hdfs 目录上搜索，而是仅在具有架构文件的本地文件系统上搜索//。

Answer 1

回答by nil

We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.

我们使用的是 hadoop 1.4，它没有 listFiles 方法，所以我们使用 listStatus 来获取目录。它没有递归选项，但很容易管理递归查找。

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
status.foreach(x=> println(x.getPath))

Answer 2

回答by Ajay Ahuja

In Spark 2.0+,

在 Spark 2.0+ 中，

import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)

Hope this is helpful.

希望这是有帮助的。

Answer 3

回答by user3190018

in Ajay Ahujas answer isDiris deprecated..

在 Ajay AhujasisDir中，不推荐使用答案。

use isDirectory... pls see complete example and output below.

使用isDirectory...请参阅下面的完整示例和输出。

package examples

    import org.apache.log4j.Level
    import org.apache.spark.sql.SparkSession

    object ListHDFSDirectories  extends  App{
      val logger = org.apache.log4j.Logger.getLogger("org")
      logger.setLevel(Level.WARN)
      val spark = SparkSession.builder()
        .appName(this.getClass.getName)
        .config("spark.master", "local[*]").getOrCreate()

      val hdfspath = "." // your path here
      import org.apache.hadoop.fs.{FileSystem, Path}
      val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
      fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
    }

Result :

结果：

file:/Users/user/codebase/myproject/target
file:/Users/user/codebase/myproject/Rel
file:/Users/user/codebase/myproject/spark-warehouse
file:/Users/user/codebase/myproject/metastore_db
file:/Users/user/codebase/myproject/.idea
file:/Users/user/codebase/myproject/src

Answer 4

回答by Lejla

val spark = SparkSession.builder().appName("Demo").getOrCreate()
val path = new Path("enter your directory path")
val fs:FileSystem = projects.getFileSystem(spark.sparkContext.hadoopConfiguration)
val it = fs.listLocatedStatus(path)

This will create an iterator itover org.apache.hadoop.fs.LocatedFileStatusthat is your subdirectory

这将创建一个迭代器it，org.apache.hadoop.fs.LocatedFileStatus它是您的子目录

Answer 5

回答by sun007

   val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
.globStatus(new org.apache.hadoop.fs.Path(url))

  for (urlStatus <- listStatus) {
    println("urlStatus get Path:" + urlStatus.getPath())

}

Answer 6

回答by Franzi

I was looking for the same, however instead of HDFS, for S3.

我正在为S3寻找相同的，但不是HDFS。

I solved creating the FileSystem with my S3 path as below:

我解决了使用 S3 路径创建文件系统的问题，如下所示：

  def getSubFolders(path: String)(implicit sparkContext: SparkContext): Seq[String] = {
    val hadoopConf = sparkContext.hadoopConfiguration
    val uri = new URI(path)

    FileSystem.get(uri, hadoopConf).listStatus(new Path(path)).map {
      _.getPath.toString
    }
  }

I know this question was related for HDFS, but maybe others like me will come here looking for S3 solution. Since without specifying the URI in FileSystem, it will look for HDFS ones.

我知道这个问题与 HDFS 相关，但也许像我这样的其他人会来这里寻找 S3 解决方案。由于没有在 FileSystem 中指定 URI，它将查找 HDFS 的。

java.lang.IllegalArgumentException: Wrong FS: s3://<bucket>/dummy_path
expected: hdfs://<ip-machine>.eu-west-1.compute.internal:8020

Answer 7

回答by Shan Hadoop Learner

object HDFSProgram extends App {    
  val uri = new URI("hdfs://HOSTNAME:PORT")    
  val fs = FileSystem.get(uri,new Configuration())    
  val filePath = new Path("/user/hive/")    
  val status = fs.listStatus(filePath)    
  status.map(sts => sts.getPath).foreach(println)    
}

This is sample code to get list of hdfs files or folder present under /user/hive/

这是获取 /user/hive/ 下存在的 hdfs 文件或文件夹列表的示例代码

Answer 8

回答by Yogesh_JavaJ2EE

Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations

Azure 博客存储映射到 HDFS 位置，因此所有 Hadoop 操作

On Azure Portal, go to Storage Account, you will find following details:

在Azure 门户上，转到存储帐户，您将找到以下详细信息：

Storage account
Key -
Container -
Path pattern – /users/accountsdata/
Date format – yyyy-mm-dd
Event serialization format – json
Format – line separated

存储帐户
钥匙 -
容器 -
路径模式 – /users/accountsdata/
日期格式 – yyyy-mm-dd
事件序列化格式——json
格式 - 行分隔

Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:

这里的路径模式是 HDFS 路径，您可以登录/腻子到 Hadoop Edge 节点并执行以下操作：

hadoop fs -ls /users/accountsdata

Above command will list all the files. In Scala you can use

上面的命令将列出所有文件。在 Scala 中，您可以使用

import scala.sys.process._ 

val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!

Answer 9

回答by Matthew Graves

Because you're using Scala, you may also be interested in the following:

因为您使用的是 Scala，您可能还对以下内容感兴趣：

import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!

This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatusinstead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single !if you want to just get the return code.)

不幸的是，这会将命令的整个输出作为字符串返回，因此解析为仅文件名需要一些努力。（fs.listStatus改为使用。）但是，如果您发现自己需要运行其他可以在命令行中轻松执行的命令，并且不确定如何在 Scala 中执行此操作，则只需使用命令行至scala.sys.process._. （!如果您只想获取返回码，请使用单个。）

Spark Scala 列出目录中的文件夹

提问by AlexL

回答by nil

回答by Ajay Ahuja

回答by user3190018

回答by Lejla

回答by sun007

回答by Franzi

回答by Shan Hadoop Learner

回答by Yogesh_JavaJ2EE

回答by Matthew Graves

相关推荐

最近更新

标签

Spark Scala 列出目录中的文件夹

提问by AlexL

回答by nil

回答by Ajay Ahuja

回答by user3190018

回答by Lejla

回答by sun007

回答by Franzi

回答by Shan Hadoop Learner

回答by Yogesh_JavaJ2EE

回答by Matthew Graves

相关推荐

可供 Jupyter/IPython 选择的众多 Spark/Scala 内核中的哪一个？

scala 使用 SBT 包在 JAR 中包含依赖项

Scala：滑动（N，N）与分组（N）

scala 如何在 Spark 中压缩两个（或更多）DataFrame

相关推荐

最近更新

标签