Python Pyspark 从数据框中的列中删除空值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44713799/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pyspark Removing null values from a column in dataframe
提问by Naveen Srikanth
My Dataframe looks like below
我的数据框如下所示
ID,FirstName,LastName
1,Navee,Srikanth
2,,Srikanth
3,Naveen,
Now My Problem statement is I have to remove the row number 2 since First Name is null.
现在我的问题陈述是我必须删除第 2 行,因为名字为空。
I am using below pyspark script
我正在使用下面的 pyspark 脚本
join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()
I am getting error as
我收到错误
File "D:df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 2| |Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 2| |Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 3| Naveen| |
+---+---------+--------+
\NameValidation.py", line 13, in <module>
join_Df1= filter(Name.FirstName.isnotnull()).show()
TypeError: 'Column' object is not callable
类型错误:“列”对象不可调用
Can anyone please help me on this to resolve
谁能帮我解决这个问题
回答by Rakesh Kumar
It looks like your DataFrame FirstName have empty value instead Null. Below are some options to try out:-
看起来您的 DataFrame FirstName 具有空值Null。以下是一些可以尝试的选项:-
join_Df1.filter(join_Df1.FirstName.isNotNull()).show
回答by ktheitroadalo
You should be doing as below
你应该做如下
ID,FirstName,LastName
1,Navee,Srikanth
2,,Srikanth
3,Naveen
Hope this helps!
希望这可以帮助!
回答by void
I think what you might need is this notnull().
我想你可能需要的是这个notnull()。
So this is your input in csv file my_test.csv:
所以这是您在 csv 文件中的输入my_test.csv:
import pandas as pd
df = pd.read_csv("my_test.csv")
print(df[df['FirstName'].notnull()])
The code:
编码:
ID FirstName LastName
0 1 Navee Srikanth
2 3 Naveen NaN
output:
输出:
0 True
1 False
2 True
This is what you would like! df[df['FirstName'].notnull()]
这就是你想要的! df[df['FirstName'].notnull()]
output of df['FirstName'].notnull():
的输出df['FirstName'].notnull():
This creates a dataframe dfwhere df['FirstName'].notnull()returns True
这将创建一个数据框df,其中df['FirstName'].notnull()返回True
How this is checked? df['FirstName'].notnull()If the value for FirstNamecolumn is notnull return Trueelse if NaNis present return False.
这个怎么查? df['FirstName'].notnull()如果FirstName列的值不为空则返回True否则如果NaN存在则返回False。

