模式匹配 - spark scala RDD

Question

提问by user3560220

I am new to Spark and Scala coming from R background.After a few transformations of RDD, I get a RDD of type

我是来自 R 背景的 Spark 和 Scala 的新手。经过几次 RDD 转换后，我得到了一个 RDD 类型

Description: RDD[(String, Int)]

Now I want to apply a Regular expression on the String RDD and extract substrings from the String and add just substring in a new coloumn.

现在我想在字符串 RDD 上应用正则表达式并从字符串中提取子字符串并在新列中添加子字符串。

Input Data :

输入数据：

BMW 1er Model,278
MINI Cooper Model,248

Output I am looking for :

我正在寻找的输出：

   Input                  |  Brand   | Series      
BMW 1er Model,278,          BMW ,        1er        
MINI Cooper Model ,248      MINI ,      Cooper

where Brand and Series are newly calculated substrings from String RDD

其中 Brand 和 Series 是从 String RDD 新计算出的子串

What I have done so far.

到目前为止我所做的。

I could achieve this for a String using regular expression, but I cani apply fro all lines.

我可以使用正则表达式为 String 实现这一点，但我可以申请所有行。

 val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r //to look for BMW or MINI

Then I can use

然后我可以使用

brandRegEx.findFirstIn("hello this mini is bmW testing")

But how can I use it for all the lines of RDD and to apply different regular expression to achieve the output as above.

但是如何将它用于 RDD 的所有行并应用不同的正则表达式来实现上述输出。

I read about this code snippet, but not sure how to put it altogether.

我读过这个代码片段，但不知道如何把它放在一起。

val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r

def getBrand(Col4: String) : String = Col4 match {
    case brandRegEx(str)  =>  
    case _ => ""
    return 'substring
}

Any help would be appreciated !

任何帮助，将不胜感激！

Thanks

谢谢

Answer 1

回答by mattinbits

To apply your regex to each item in the RDD, you should use the RDD mapfunction, which transforms each row in the RDD using some function (in this case, a Partial Function in order to extract to two parts of the tuple which makes up each row):

要将您的正则表达式应用于 RDD 中的每个项目，您应该使用 RDDmap函数，该函数使用某个函数（在这种情况下，一个部分函数，以便提取到组成每个的元组的两个部分）转换RDD 中的每一行排）：

import org.apache.spark.{SparkContext, SparkConf}

object Example extends App {

  val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("Example"))

  val data = Seq(
    ("BMW 1er Model",278),
    ("MINI Cooper Model",248))

  val dataRDD = sc.parallelize(data)

  val processedRDD = dataRDD.map{
    case (inString, inInt) =>
      val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
      val brand = brandRegEx.findFirstIn(inString)
      //val seriesRegEx = ...
      //val series = seriesRegEx.findFirstIn(inString)
      val series = "foo"
      (inString, inInt, brand, series)
  }

  processedRDD.collect().foreach(println)
  sc.stop()
}

Note that I think you have some problems in your regular expression, and you also need a regular expression for finding the series. This code outputs:

请注意，我认为您的正则表达式存在一些问题，并且您还需要一个正则表达式来查找系列。此代码输出：

(BMW 1er Model,278,BMW,foo)
(MINI Cooper Model,248,NOT FOUND,foo)

But if you correct your regexes for your needs, this is how you can apply them to each row.

但是，如果您根据需要更正了正则表达式，这就是将它们应用于每一行的方法。

Answer 2

回答by Ishan Kumar

hi I was just looking for aother question and got this question. The above problem can be done using normal transformations.

嗨，我只是在寻找另一个问题并得到了这个问题。上述问题可以使用法线变换来完成。

val a=sc.parallelize(collection)
a.map{case (x,y)=>(x.split (" ")(0)+" "+x.split(" ")(1))}.collect

模式匹配 - spark scala RDD

提问by user3560220

回答by mattinbits

回答by Ishan Kumar

相关推荐

最近更新

标签

模式匹配 - spark scala RDD

提问by user3560220

回答by mattinbits

回答by Ishan Kumar

相关推荐

scala 如何将源文件名添加到 Spark 中的每一行？

scala Gatling - 在 Simulation.log 或控制台中记录请求正文

找不到隐式 ExecutionContext。你可能会通过spray scala

scala 在 spark 中加入两个 RDD

相关推荐

最近更新

标签