模式匹配 - spark scala RDD

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34038904/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:50:08  来源:igfitidea点击:

Pattern matching - spark scala RDD

regexscalaapache-sparkpattern-matchingrdd

提问by user3560220

I am new to Spark and Scala coming from R background.After a few transformations of RDD, I get a RDD of type

我是来自 R 背景的 Spark 和 Scala 的新手。经过几次 RDD 转换后,我得到了一个 RDD 类型

Description: RDD[(String, Int)]

Now I want to apply a Regular expression on the String RDD and extract substrings from the String and add just substring in a new coloumn.

现在我想在字符串 RDD 上应用正则表达式并从字符串中提取子字符串并在新列中添加子字符串。

Input Data :

输入数据 :

BMW 1er Model,278
MINI Cooper Model,248

Output I am looking for :

我正在寻找的输出:

   Input                  |  Brand   | Series      
BMW 1er Model,278,          BMW ,        1er        
MINI Cooper Model ,248      MINI ,      Cooper

where Brand and Series are newly calculated substrings from String RDD

其中 Brand 和 Series 是从 String RDD 新计算出的子串

What I have done so far.

到目前为止我所做的。

I could achieve this for a String using regular expression, but I cani apply fro all lines.

我可以使用正则表达式为 String 实现这一点,但我可以申请所有行。

 val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r //to look for BMW or MINI

Then I can use

然后我可以使用

brandRegEx.findFirstIn("hello this mini is bmW testing")

But how can I use it for all the lines of RDD and to apply different regular expression to achieve the output as above.

但是如何将它用于 RDD 的所有行并应用不同的正则表达式来实现上述输出。

I read about this code snippet, but not sure how to put it altogether.

我读过这个代码片段,但不知道如何把它放在一起。

val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r

def getBrand(Col4: String) : String = Col4 match {
    case brandRegEx(str)  =>  
    case _ => ""
    return 'substring
}

Any help would be appreciated !

任何帮助,将不胜感激 !

Thanks

谢谢

回答by mattinbits

To apply your regex to each item in the RDD, you should use the RDD mapfunction, which transforms each row in the RDD using some function (in this case, a Partial Function in order to extract to two parts of the tuple which makes up each row):

要将您的正则表达式应用于 RDD 中的每个项目,您应该使用 RDDmap函数,该函数使用某个函数(在这种情况下,一个部分函数,​​以便提取到组成每个的元组的两个部分)转换RDD 中的每一行排):

import org.apache.spark.{SparkContext, SparkConf}

object Example extends App {

  val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("Example"))

  val data = Seq(
    ("BMW 1er Model",278),
    ("MINI Cooper Model",248))

  val dataRDD = sc.parallelize(data)

  val processedRDD = dataRDD.map{
    case (inString, inInt) =>
      val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
      val brand = brandRegEx.findFirstIn(inString)
      //val seriesRegEx = ...
      //val series = seriesRegEx.findFirstIn(inString)
      val series = "foo"
      (inString, inInt, brand, series)
  }

  processedRDD.collect().foreach(println)
  sc.stop()
}

Note that I think you have some problems in your regular expression, and you also need a regular expression for finding the series. This code outputs:

请注意,我认为您的正则表达式存在一些问题,并且您还需要一个正则表达式来查找系列。此代码输出:

(BMW 1er Model,278,BMW,foo)
(MINI Cooper Model,248,NOT FOUND,foo)

But if you correct your regexes for your needs, this is how you can apply them to each row.

但是,如果您根据需要更正了正则表达式,这就是将它们应用于每一行的方法。

回答by Ishan Kumar

hi I was just looking for aother question and got this question. The above problem can be done using normal transformations.

嗨,我只是在寻找另一个问题并得到了这个问题。上述问题可以使用法线变换来完成。

val a=sc.parallelize(collection)
a.map{case (x,y)=>(x.split (" ")(0)+" "+x.split(" ")(1))}.collect