Python Pyspark 从数据框中的列中删除空值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44713799/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pyspark Removing null values from a column in dataframe
提问by Naveen Srikanth
My Dataframe looks like below
我的数据框如下所示
ID,FirstName,LastName
1,Navee,Srikanth
2,,Srikanth
3,Naveen,
Now My Problem statement is I have to remove the row number 2 since First Name is null.
现在我的问题陈述是我必须删除第 2 行,因为名字为空。
I am using below pyspark script
我正在使用下面的 pyspark 脚本
join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()
I am getting error as
我收到错误
File "D:df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 2| |Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 2| |Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 3| Naveen| |
+---+---------+--------+
\NameValidation.py", line 13, in <module>
join_Df1= filter(Name.FirstName.isnotnull()).show()
TypeError: 'Column' object is not callable
类型错误:“列”对象不可调用
Can anyone please help me on this to resolve
谁能帮我解决这个问题
回答by Rakesh Kumar
It looks like your DataFrame FirstName have empty value instead Null
. Below are some options to try out:-
看起来您的 DataFrame FirstName 具有空值Null
。以下是一些可以尝试的选项:-
join_Df1.filter(join_Df1.FirstName.isNotNull()).show
回答by ktheitroadalo
You should be doing as below
你应该做如下
ID,FirstName,LastName
1,Navee,Srikanth
2,,Srikanth
3,Naveen
Hope this helps!
希望这可以帮助!
回答by void
I think what you might need is this notnull()
.
我想你可能需要的是这个notnull()
。
So this is your input in csv file my_test.csv
:
所以这是您在 csv 文件中的输入my_test.csv
:
import pandas as pd
df = pd.read_csv("my_test.csv")
print(df[df['FirstName'].notnull()])
The code:
编码:
ID FirstName LastName
0 1 Navee Srikanth
2 3 Naveen NaN
output:
输出:
0 True
1 False
2 True
This is what you would like! df[df['FirstName'].notnull()]
这就是你想要的! df[df['FirstName'].notnull()]
output of df['FirstName'].notnull()
:
的输出df['FirstName'].notnull()
:
This creates a dataframe df
where df['FirstName'].notnull()
returns True
这将创建一个数据框df
,其中df['FirstName'].notnull()
返回True
How this is checked? df['FirstName'].notnull()
If the value for FirstName
column is notnull return True
else if NaN
is present return False
.
这个怎么查? df['FirstName'].notnull()
如果FirstName
列的值不为空则返回True
否则如果NaN
存在则返回False
。