scala 使用scala读取后如何删除hdfs目录中的文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45104284/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:21:20  来源:igfitidea点击:

how do i delete files in hdfs directory after reading it using scala

scalahadoopapache-sparkspark-streaming

提问by user1125829

I use fileStream to read files in the hdfs directory from Spark (streaming context). In case my Spark shut down and starts after some time, I would like to read the new files in the directory. I don't want to read old files in the directory which was already read and processed by Spark. I am trying to avoid duplicates here.

我使用 fileStream 从 Spark(流上下文)读取 hdfs 目录中的文件。如果我的 Spark 关闭并在一段时间后启动,我想读取目录中的新文件。我不想读取已被 Spark 读取和处理的目录中的旧文件。我试图避免在这里重复。

val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/File")

any code snippets to help?

任何代码片段来帮助?

回答by Ishan Kumar

You can use the FileSystemAPI:

您可以使用FileSystemAPI:

import org.apache.hadoop.fs.{FileSystem, Path}

val fs = FileSystem.get(sc.hadoopConfiguration)

val outPutPath = new Path("/abc")

if (fs.exists(outPutPath))
  fs.delete(outPutPath, true)

回答by Tzach Zohar

fileStreamalready handles that for you - from its Scaladoc:

fileStream已经为你处理了 - 从它的Scaladoc

Create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them using the given key-value types and input format.

创建一个输入流,监控与 Hadoop 兼容的文件系统中的新文件,并使用给定的键值类型和输入格式读取它们。

This means that fileStreamwould only load new files(created after streaming context was started), any files that already existed in the folder before you started your streaming application would be ignored.

这意味着fileStream只会加载新文件(在流上下文启动后创建),在启动流应用程序之前文件夹中已经存在的任何文件都将被忽略。