scala Spark 数据帧过滤器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42951905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:09:08  来源:igfitidea点击:

Spark dataframe filter

scalaapache-sparkapache-spark-sql

提问by Ramesh

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
|  6|  MSL12|
|  7|    MSL|
|  8|    HCP|
|  9|  HCP12|
+---+-------+

I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.

我想过滤掉“c2”列的前 3 个字符(“MSL”或“HCP”)的记录。

So the output should be like below.

所以输出应该如下所示。

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+

Can any one please help on this?

任何人都可以帮忙吗?

I knew that df.filter($"c2".rlike("MSL"))-- This is for selecting the records but how to exclude the records. ?

我知道df.filter($"c2".rlike("MSL"))- 这是用于选择记录但如何排除记录。?

Version: Spark 1.6.2 Scala : 2.10

版本:Spark 1.6.2 Scala:2.10

采纳答案by pasha701

df.filter(not(
    substring(col("c2"), 0, 3).isin("MSL", "HCP"))
    )

回答by Jegan

This works too. Concise and very similar to SQL.

这也有效。简洁且与 SQL 非常相似。

df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+

回答by Priyanshu Singh

I used below to filter rows from dataframe and this worked form me.Spark 2.2

我在下面用来过滤数据框中的行,这在 me.Spark 2.2 中起作用

val spark = new org.apache.spark.sql.SQLContext(sc)    
val data = spark.read.format("csv").
          option("header", "true").
          option("delimiter", "|").
          option("inferSchema", "true").
          load("D:\test.csv")   


import  spark.implicits._
val filter=data.filter($"dept" === "IT" )

OR

或者

val filter=data.filter($"dept" =!= "IT" )

回答by Ramesh

val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP")))

val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP")))

This worked.

这奏效了。