Python 使用 None 值过滤 Pyspark 数据框列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37262762/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:08:33  来源:igfitidea点击:

Filter Pyspark dataframe column with None value

pythonapache-sparkdataframepysparkapache-spark-sql

提问by Ivan

I'm trying to filter a PySpark dataframe that has Noneas a row value:

我正在尝试过滤具有None行值的 PySpark 数据框:

df.select('dt_mvmt').distinct().collect()

[Row(dt_mvmt=u'2016-03-27'),
 Row(dt_mvmt=u'2016-03-28'),
 Row(dt_mvmt=u'2016-03-29'),
 Row(dt_mvmt=None),
 Row(dt_mvmt=u'2016-03-30'),
 Row(dt_mvmt=u'2016-03-31')]

and I can filter correctly with an string value:

我可以使用字符串值正确过滤:

df[df.dt_mvmt == '2016-03-31']
# some results here

but this fails:

但这失败了:

df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0

But there are definitely values on each category. What's going on?

但是每个类别肯定都有价值。这是怎么回事?

回答by zero323

You can use Column.isNull/ Column.isNotNull:

您可以使用Column.isNull/ Column.isNotNull

df.where(col("dt_mvmt").isNull())

df.where(col("dt_mvmt").isNotNull())

If you want to simply drop NULLvalues you can use na.dropwith subsetargument:

如果您只想删除NULL可以na.dropsubset参数一起使用的值:

df.na.drop(subset=["dt_mvmt"])

Equality based comparisons with NULLwon't work because in SQL NULLis undefined so any attempt to compare it with another value returns NULL:

基于相等的比较NULL将不起作用,因为在 SQL 中NULL是未定义的,因此任何将其与另一个值进行比较的尝试都会返回NULL

sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## |         null|
## +-------------+


sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## |               null|
## +-------------------+

The only valid method to compare value with NULLis IS/ IS NOTwhich are equivalent to the isNull/ isNotNullmethod calls.

与值进行比较的唯一有效方法NULLIS/ IS NOT,它等效于isNull/isNotNull方法调用。

回答by Anthony

Try to just use isNotNullfunction.

尝试只使用isNotNull函数。

df.filter(df.dt_mvmt.isNotNull()).count()

回答by timctran

To obtain entries whose values in the dt_mvmtcolumn are not null we have

要获取dt_mvmt列中值不为空的条目,我们有

df.filter("dt_mvmt is not NULL")

and for entries which are null we have

对于空条目,我们有

df.filter("dt_mvmt is NULL")

回答by Rae

If you want to keep with the Pandas syntex this worked for me.

如果你想保持 Pandas 语法,这对我有用。

df = df[df.dt_mvmt.isNotNull()]

回答by user10238559

if column = None

如果列 = 无

COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------

Use create a temptable on data frame:

使用在数据框上创建一个临时表:

sqlContext.sql("select * from tempTable where column_old_value='None' ").show()

So use : column_old_value='None'

所以使用: column_old_value='None'

回答by neeraj bhadani

There are multiple ways you can remove/filter the null values from a column in DataFrame.

有多种方法可以从 DataFrame 的列中删除/过滤空值。

Lets create a simple DataFrame with below code:

让我们使用以下代码创建一个简单的 DataFrame:

date = ['2016-03-27','2016-03-28','2016-03-29', None, '2016-03-30','2016-03-31']
df = spark.createDataFrame(date, StringType())

Now you can try one of the below approach to filter out the null values.

现在您可以尝试以下方法之一来过滤掉空值。

# Approach - 1
df.filter("value is not null").show()

# Approach - 2
df.filter(col("value").isNotNull()).show()

# Approach - 3
df.filter(df["value"].isNotNull()).show()

# Approach - 4
df.filter(df.value.isNotNull()).show()

# Approach - 5
df.na.drop(subset=["value"]).show()

# Approach - 6
df.dropna(subset=["value"]).show()

# Note: You can also use where function instead of a filter.

You can also check the section "Working with NULL Values" on my blogfor more information.

您还可以查看我博客上的“使用 NULL 值”部分以获取更多信息。

I hope it helps.

我希望它有帮助。

回答by Swaminathan Meenakshisundaram

PySpark provides various filtering options based on arithmetic, logical and other conditions. Presence of NULL values can hamper further processes. Removing them or statistically imputing them could be a choice.

PySpark 提供了基于算术、逻辑和其他条件的各种过滤选项。NULL 值的存在会妨碍进一步的处理。删除它们或统计它们可能是一种选择。

Below set of code can be considered:

可以考虑以下代码集:

# Dataset is df
# Column name is dt_mvmt
# Before filtering make sure you have the right count of the dataset
df.count() # Some number

# Filter here
df = df.filter(df.dt_mvmt.isNotNull())

# Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
df.count() # Count should be reduced if NULL values are present

回答by information_interchange

I would also try:

我也会尝试:

df = df.dropna(subset=["dt_mvmt"])

df = df.dropna(subset=["dt_mvmt"])

回答by yogesh

If you want to filter out records having None value in column then see below example:

如果要过滤掉列中具有 None 值的记录,请参见以下示例:

df=spark.createDataFrame([[123,"abc"],[234,"fre"],[345,None]],["a","b"])

Now filter out null value records:

现在过滤掉空值记录:

df=df.filter(df.b.isNotNull())

df.show()

If you want to remove those records from DF then see below:

如果您想从 DF 中删除这些记录,请参见下文:

df1=df.na.drop(subset=['b'])

df1.show()