scala 使用scala读取后如何删除hdfs目录中的文件

Question

提问by user1125829

I use fileStream to read files in the hdfs directory from Spark (streaming context). In case my Spark shut down and starts after some time, I would like to read the new files in the directory. I don't want to read old files in the directory which was already read and processed by Spark. I am trying to avoid duplicates here.

我使用 fileStream 从 Spark（流上下文）读取 hdfs 目录中的文件。如果我的 Spark 关闭并在一段时间后启动，我想读取目录中的新文件。我不想读取已被 Spark 读取和处理的目录中的旧文件。我试图避免在这里重复。

val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/File")

any code snippets to help?

任何代码片段来帮助？

Answer 1

回答by Ishan Kumar

You can use the FileSystemAPI:

您可以使用FileSystemAPI：

import org.apache.hadoop.fs.{FileSystem, Path}

val fs = FileSystem.get(sc.hadoopConfiguration)

val outPutPath = new Path("/abc")

if (fs.exists(outPutPath))
  fs.delete(outPutPath, true)

Answer 2

回答by Tzach Zohar

fileStreamalready handles that for you - from its Scaladoc:

fileStream已经为你处理了 - 从它的Scaladoc：

Create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them using the given key-value types and input format.

创建一个输入流，监控与 Hadoop 兼容的文件系统中的新文件，并使用给定的键值类型和输入格式读取它们。

This means that fileStreamwould only load new files(created after streaming context was started), any files that already existed in the folder before you started your streaming application would be ignored.

这意味着fileStream只会加载新文件（在流上下文启动后创建），在启动流应用程序之前文件夹中已经存在的任何文件都将被忽略。

scala 使用scala读取后如何删除hdfs目录中的文件

提问by user1125829

回答by Ishan Kumar

回答by Tzach Zohar

相关推荐

最近更新

标签

scala 使用scala读取后如何删除hdfs目录中的文件

提问by user1125829

回答by Ishan Kumar

回答by Tzach Zohar

相关推荐

scala 将 Spark 数据帧插入到 hbase 中

无法在 Intellij IDE 中的 Scala 中找到或加载主类

如何从 Scala Spark 中的 Excel (xls,xlsx) 文件构造数据框？

“value $ 不是 StringContext 的成员” - 缺少 Scala 插件？

相关推荐

最近更新

标签