Python 使用 None 值过滤 Pyspark 数据框列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37262762/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filter Pyspark dataframe column with None value
提问by Ivan
I'm trying to filter a PySpark dataframe that has None
as a row value:
我正在尝试过滤具有None
行值的 PySpark 数据框:
df.select('dt_mvmt').distinct().collect()
[Row(dt_mvmt=u'2016-03-27'),
Row(dt_mvmt=u'2016-03-28'),
Row(dt_mvmt=u'2016-03-29'),
Row(dt_mvmt=None),
Row(dt_mvmt=u'2016-03-30'),
Row(dt_mvmt=u'2016-03-31')]
and I can filter correctly with an string value:
我可以使用字符串值正确过滤:
df[df.dt_mvmt == '2016-03-31']
# some results here
but this fails:
但这失败了:
df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0
But there are definitely values on each category. What's going on?
但是每个类别肯定都有价值。这是怎么回事?
回答by zero323
You can use Column.isNull
/ Column.isNotNull
:
您可以使用Column.isNull
/ Column.isNotNull
:
df.where(col("dt_mvmt").isNull())
df.where(col("dt_mvmt").isNotNull())
If you want to simply drop NULL
values you can use na.drop
with subset
argument:
如果您只想删除NULL
可以na.drop
与subset
参数一起使用的值:
df.na.drop(subset=["dt_mvmt"])
Equality based comparisons with NULL
won't work because in SQL NULL
is undefined so any attempt to compare it with another value returns NULL
:
基于相等的比较NULL
将不起作用,因为在 SQL 中NULL
是未定义的,因此任何将其与另一个值进行比较的尝试都会返回NULL
:
sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## | null|
## +-------------+
sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## | null|
## +-------------------+
The only valid method to compare value with NULL
is IS
/ IS NOT
which are equivalent to the isNull
/ isNotNull
method calls.
与值进行比较的唯一有效方法NULL
是IS
/ IS NOT
,它等效于isNull
/isNotNull
方法调用。
回答by Anthony
回答by timctran
To obtain entries whose values in the dt_mvmt
column are not null we have
要获取dt_mvmt
列中值不为空的条目,我们有
df.filter("dt_mvmt is not NULL")
and for entries which are null we have
对于空条目,我们有
df.filter("dt_mvmt is NULL")
回答by Rae
If you want to keep with the Pandas syntex this worked for me.
如果你想保持 Pandas 语法,这对我有用。
df = df[df.dt_mvmt.isNotNull()]
回答by user10238559
if column = None
如果列 = 无
COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------
Use create a temptable on data frame:
使用在数据框上创建一个临时表:
sqlContext.sql("select * from tempTable where column_old_value='None' ").show()
So use : column_old_value='None'
所以使用: column_old_value='None'
回答by neeraj bhadani
There are multiple ways you can remove/filter the null values from a column in DataFrame.
有多种方法可以从 DataFrame 的列中删除/过滤空值。
Lets create a simple DataFrame with below code:
让我们使用以下代码创建一个简单的 DataFrame:
date = ['2016-03-27','2016-03-28','2016-03-29', None, '2016-03-30','2016-03-31']
df = spark.createDataFrame(date, StringType())
Now you can try one of the below approach to filter out the null values.
现在您可以尝试以下方法之一来过滤掉空值。
# Approach - 1
df.filter("value is not null").show()
# Approach - 2
df.filter(col("value").isNotNull()).show()
# Approach - 3
df.filter(df["value"].isNotNull()).show()
# Approach - 4
df.filter(df.value.isNotNull()).show()
# Approach - 5
df.na.drop(subset=["value"]).show()
# Approach - 6
df.dropna(subset=["value"]).show()
# Note: You can also use where function instead of a filter.
You can also check the section "Working with NULL Values" on my blogfor more information.
您还可以查看我博客上的“使用 NULL 值”部分以获取更多信息。
I hope it helps.
我希望它有帮助。
回答by Swaminathan Meenakshisundaram
PySpark provides various filtering options based on arithmetic, logical and other conditions. Presence of NULL values can hamper further processes. Removing them or statistically imputing them could be a choice.
PySpark 提供了基于算术、逻辑和其他条件的各种过滤选项。NULL 值的存在会妨碍进一步的处理。删除它们或统计它们可能是一种选择。
Below set of code can be considered:
可以考虑以下代码集:
# Dataset is df
# Column name is dt_mvmt
# Before filtering make sure you have the right count of the dataset
df.count() # Some number
# Filter here
df = df.filter(df.dt_mvmt.isNotNull())
# Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
df.count() # Count should be reduced if NULL values are present
回答by information_interchange
I would also try:
我也会尝试:
df = df.dropna(subset=["dt_mvmt"])
df = df.dropna(subset=["dt_mvmt"])
回答by yogesh
If you want to filter out records having None value in column then see below example:
如果要过滤掉列中具有 None 值的记录,请参见以下示例:
df=spark.createDataFrame([[123,"abc"],[234,"fre"],[345,None]],["a","b"])
Now filter out null value records:
现在过滤掉空值记录:
df=df.filter(df.b.isNotNull())
df.show()
If you want to remove those records from DF then see below:
如果您想从 DF 中删除这些记录,请参见下文:
df1=df.na.drop(subset=['b'])
df1.show()