scala 如何使用正则表达式在 sc.textFile 中包含/排除某些输入文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31782763/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:25:40  来源:igfitidea点击:

How to use regex to include/exclude some input files in sc.textFile?

scalaapache-spark

提问by eboni

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile().

我试图使用文件中的 Apache spark 过滤掉特定文件的日期到 RDD 函数sc.textFile()

I have attempted to do the following:

我尝试执行以下操作:

sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")

This should match the following:

这应该匹配以下内容:

/user/Orders/201507270010033.gz
/user/Orders/201507300060052.gz

Any idea how to achieve this?

知道如何实现这一目标吗?

回答by nhahtdh

Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat.

查看已接受的答案,它似乎使用了某种形式的 glob 语法。它还揭示了该 API 是 Hadoop 的FileInputFormat.

Searching reveals that paths supplied to FileInputFormat's addInputPathor setInputPath"may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContextalso uses those APIs to set the path.

搜索显示提供给FileInputFormat's 的路径addInputPathsetInputPath“可能代表一个文件、一个目录,或者,通过使用 glob,一个文件和目录的集合”。也许,SparkContext也使用这些 API 来设置路径。

The syntax of the globincludes:

glob语法包括:

  • *(match 0 or more character)
  • ?(match single character)
  • [ab](character class)
  • [^ab](negated character class)
  • [a-b](character range)
  • {a,b}(alternation)
  • \c(escape character)
  • *(匹配0个或多个字符)
  • ?(匹配单个字符)
  • [ab](字符类)
  • [^ab](否定字符类)
  • [a-b](字符范围)
  • {a,b}(交替)
  • \c(转义字符)

Following the example in the accepted answer, it is possible to write your path as:

按照已接受答案中的示例,可以将您的路径写为:

sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")

It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:

不清楚此处如何使用交替语法,因为逗号用于分隔路径列表(如上所示)。根据zero323的评论,不需要转义:

sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")