scala Spark 数据帧过滤器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/42951905/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark dataframe filter
提问by Ramesh
val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
|  6|  MSL12|
|  7|    MSL|
|  8|    HCP|
|  9|  HCP12|
+---+-------+
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
我想过滤掉“c2”列的前 3 个字符(“MSL”或“HCP”)的记录。
So the output should be like below.
所以输出应该如下所示。
+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+
Can any one please help on this?
任何人都可以帮忙吗?
I knew that df.filter($"c2".rlike("MSL"))-- This is for selecting the records but how to exclude the records. ?
我知道df.filter($"c2".rlike("MSL"))- 这是用于选择记录但如何排除记录。?
Version: Spark 1.6.2 Scala : 2.10
版本:Spark 1.6.2 Scala:2.10
采纳答案by pasha701
df.filter(not(
    substring(col("c2"), 0, 3).isin("MSL", "HCP"))
    )
回答by Jegan
This works too. Concise and very similar to SQL.
这也有效。简洁且与 SQL 非常相似。
df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+
回答by Priyanshu Singh
I used below to filter rows from dataframe and this worked form me.Spark 2.2
我在下面用来过滤数据框中的行,这在 me.Spark 2.2 中起作用
val spark = new org.apache.spark.sql.SQLContext(sc)    
val data = spark.read.format("csv").
          option("header", "true").
          option("delimiter", "|").
          option("inferSchema", "true").
          load("D:\test.csv")   
import  spark.implicits._
val filter=data.filter($"dept" === "IT" )
OR
或者
val filter=data.filter($"dept" =!= "IT" )
回答by Ramesh
val df1 = df.filter(not(df("c2").rlike("MSL"))&¬(df("c2").rlike("HCP")))
val df1 = df.filter(not(df("c2").rlike("MSL"))&¬(df("c2").rlike("HCP")))
This worked.
这奏效了。

