Python 通过排除使用 isin 过滤 pyspark 数据框

Question

提问by gabrown86

I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).

我正在尝试获取数据框中的所有行，其中列值不在列表中（因此通过排除进行过滤）。

As an example:

举个例子：

df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]
,schema=('id','bar'))

I get the data frame:

我得到数据框：

+---+---+
| id|bar|
+---+---+
|  1|  a|
|  2|  b|
|  3|  b|
|  4|  c|
|  5|  d|
+---+---+

I only want to exclude rows where bar is ('a' or 'b').

我只想排除 bar 所在的行（'a' 或 'b'）。

Using an SQL expression string it would be:

使用 SQL 表达式字符串将是：

df.filter('bar not in ("a","b")').show()

Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?

有没有办法在不使用 SQL 表达式的字符串或一次排除一项的情况下做到这一点？

Edit:

编辑：

I am likely to have a list, ['a','b'], of the excluded values that I would like to use.

我可能有一个我想使用的排除值的列表 ['a','b']。

Answer 1

回答by gabrown86

It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it.

看起来 ~ 提供了我需要的功能，但我还没有找到任何合适的文档。

df.filter(~col('bar').isin(['a','b'])).show()



+---+---+
| id|bar|
+---+---+
|  4|  c|
|  5|  d|
+---+---+

Answer 2

回答by Alezis

Also could be like this

也可能是这样

df.filter(col('bar').isin(['a','b']) == False).show()

Answer 3

回答by Ryan Collingwood

Got a gotcha for those with their headspace in Pandas and moving to pyspark

对于那些在 Pandas 中有顶空并搬到 pyspark 的人来说，有一个问题

 from pyspark import SparkConf, SparkContext
 from pyspark.sql import SQLContext

 spark_conf = SparkConf().setMaster("local").setAppName("MyAppName")
 sc = SparkContext(conf = spark_conf)
 sqlContext = SQLContext(sc)

 records = [
     {"colour": "red"},
     {"colour": "blue"},
     {"colour": None},
 ]

 pandas_df = pd.DataFrame.from_dict(records)
 pyspark_df = sqlContext.createDataFrame(records)

So if we wanted the rows that are notred:

所以如果我们想要不是红色的行：

pandas_df[~pandas_df["colour"].isin(["red"])]

Looking good, and in our pyspark DataFrame

看起来不错，在我们的 pyspark DataFrame 中

pyspark_df.filter(~pyspark_df["colour"].isin(["red"])).collect()

So after some digging, I found this: https://issues.apache.org/jira/browse/SPARK-20617So to include nothingnessin our results:

所以经过一番挖掘，我发现了这个：https: //issues.apache.org/jira/browse/SPARK-20617所以在我们的结果中包含虚无：

pyspark_df.filter(~pyspark_df["colour"].isin(["red"]) | pyspark_df["colour"].isNull()).show()

Answer 4

回答by Assaf Mendelson

df.filter((df.bar != 'a') & (df.bar != 'b'))

Python 通过排除使用 isin 过滤 pyspark 数据框

提问by gabrown86

回答by gabrown86

回答by Alezis

回答by Ryan Collingwood

回答by Assaf Mendelson

相关推荐

最近更新

标签

Python 通过排除使用 isin 过滤 pyspark 数据框

提问by gabrown86

回答by gabrown86

回答by Alezis

回答by Ryan Collingwood

回答by Assaf Mendelson

相关推荐

Python datetime.strptime('2017-01-12T14:12:06.000-0500','%Y-%m-%dT%H:%M:%S.%f%Z')

Python 使用 Beautiful Soup 查找特定类

为 Python 3.6 更新 pip3？

Python ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] 证书验证失败 (_ssl.c:749)

相关推荐

最近更新

标签