Python 通过排除使用 isin 过滤 pyspark 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41775281/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 01:35:39  来源:igfitidea点击:

Filtering a pyspark dataframe using isin by exclusion

pythonapache-sparkpysparkpyspark-sql

提问by gabrown86

I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).

我正在尝试获取数据框中的所有行,其中列值不在列表中(因此通过排除进行过滤)。

As an example:

举个例子:

df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]
,schema=('id','bar'))

I get the data frame:

我得到数据框:

+---+---+
| id|bar|
+---+---+
|  1|  a|
|  2|  b|
|  3|  b|
|  4|  c|
|  5|  d|
+---+---+

I only want to exclude rows where bar is ('a' or 'b').

我只想排除 bar 所在的行('a' 或 'b')。

Using an SQL expression string it would be:

使用 SQL 表达式字符串将是:

df.filter('bar not in ("a","b")').show()

Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?

有没有办法在不使用 SQL 表达式的字符串或一次排除一项的情况下做到这一点?

Edit:

编辑:

I am likely to have a list, ['a','b'], of the excluded values that I would like to use.

我可能有一个我想使用的排除值的列表 ['a','b']。

回答by gabrown86

It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it.

看起来 ~ 提供了我需要的功能,但我还没有找到任何合适的文档。

df.filter(~col('bar').isin(['a','b'])).show()



+---+---+
| id|bar|
+---+---+
|  4|  c|
|  5|  d|
+---+---+

回答by Alezis

Also could be like this

也可能是这样

df.filter(col('bar').isin(['a','b']) == False).show()

回答by Ryan Collingwood

Got a gotcha for those with their headspace in Pandas and moving to pyspark

对于那些在 Pandas 中有顶空并搬到 pyspark 的人来说,有一个问题

 from pyspark import SparkConf, SparkContext
 from pyspark.sql import SQLContext

 spark_conf = SparkConf().setMaster("local").setAppName("MyAppName")
 sc = SparkContext(conf = spark_conf)
 sqlContext = SQLContext(sc)

 records = [
     {"colour": "red"},
     {"colour": "blue"},
     {"colour": None},
 ]

 pandas_df = pd.DataFrame.from_dict(records)
 pyspark_df = sqlContext.createDataFrame(records)

So if we wanted the rows that are notred:

所以如果我们想要不是红色的行:

pandas_df[~pandas_df["colour"].isin(["red"])]

As expected in Pandas

正如 Pandas 中的预期

Looking good, and in our pyspark DataFrame

看起来不错,在我们的 pyspark DataFrame 中

pyspark_df.filter(~pyspark_df["colour"].isin(["red"])).collect()

Not what I expected

不是我所期望的

So after some digging, I found this: https://issues.apache.org/jira/browse/SPARK-20617So to include nothingnessin our results:

所以经过一番挖掘,我发现了这个:https: //issues.apache.org/jira/browse/SPARK-20617所以在我们的结果中包含虚无

pyspark_df.filter(~pyspark_df["colour"].isin(["red"]) | pyspark_df["colour"].isNull()).show()

much ado about nothing

无所事事

回答by Assaf Mendelson

df.filter((df.bar != 'a') & (df.bar != 'b'))