Python 通过排除使用 isin 过滤 pyspark 数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41775281/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filtering a pyspark dataframe using isin by exclusion
提问by gabrown86
I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).
我正在尝试获取数据框中的所有行,其中列值不在列表中(因此通过排除进行过滤)。
As an example:
举个例子:
df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]
,schema=('id','bar'))
I get the data frame:
我得到数据框:
+---+---+
| id|bar|
+---+---+
| 1| a|
| 2| b|
| 3| b|
| 4| c|
| 5| d|
+---+---+
I only want to exclude rows where bar is ('a' or 'b').
我只想排除 bar 所在的行('a' 或 'b')。
Using an SQL expression string it would be:
使用 SQL 表达式字符串将是:
df.filter('bar not in ("a","b")').show()
Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?
有没有办法在不使用 SQL 表达式的字符串或一次排除一项的情况下做到这一点?
Edit:
编辑:
I am likely to have a list, ['a','b'], of the excluded values that I would like to use.
我可能有一个我想使用的排除值的列表 ['a','b']。
回答by gabrown86
It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it.
看起来 ~ 提供了我需要的功能,但我还没有找到任何合适的文档。
df.filter(~col('bar').isin(['a','b'])).show()
+---+---+
| id|bar|
+---+---+
| 4| c|
| 5| d|
+---+---+
回答by Alezis
Also could be like this
也可能是这样
df.filter(col('bar').isin(['a','b']) == False).show()
回答by Ryan Collingwood
Got a gotcha for those with their headspace in Pandas and moving to pyspark
对于那些在 Pandas 中有顶空并搬到 pyspark 的人来说,有一个问题
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
spark_conf = SparkConf().setMaster("local").setAppName("MyAppName")
sc = SparkContext(conf = spark_conf)
sqlContext = SQLContext(sc)
records = [
{"colour": "red"},
{"colour": "blue"},
{"colour": None},
]
pandas_df = pd.DataFrame.from_dict(records)
pyspark_df = sqlContext.createDataFrame(records)
So if we wanted the rows that are notred:
所以如果我们想要不是红色的行:
pandas_df[~pandas_df["colour"].isin(["red"])]
Looking good, and in our pyspark DataFrame
看起来不错,在我们的 pyspark DataFrame 中
pyspark_df.filter(~pyspark_df["colour"].isin(["red"])).collect()
So after some digging, I found this: https://issues.apache.org/jira/browse/SPARK-20617So to include nothingnessin our results:
所以经过一番挖掘,我发现了这个:https: //issues.apache.org/jira/browse/SPARK-20617所以在我们的结果中包含虚无:
pyspark_df.filter(~pyspark_df["colour"].isin(["red"]) | pyspark_df["colour"].isNull()).show()
回答by Assaf Mendelson
df.filter((df.bar != 'a') & (df.bar != 'b'))