Python Pyspark 从数据框中的列中删除空值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44713799/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:23:19  来源:igfitidea点击:

Pyspark Removing null values from a column in dataframe

pythonhadoopapache-sparkmapreducepyspark

提问by Naveen Srikanth

My Dataframe looks like below

我的数据框如下所示

ID,FirstName,LastName

1,Navee,Srikanth

2,,Srikanth 

3,Naveen,

Now My Problem statement is I have to remove the row number 2 since First Name is null.

现在我的问题陈述是我必须删除第 2 行,因为名字为空。

I am using below pyspark script

我正在使用下面的 pyspark 脚本

join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()

I am getting error as

我收到错误

  File "D:
df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+
\NameValidation.py", line 13, in <module> join_Df1= filter(Name.FirstName.isnotnull()).show()

TypeError: 'Column' object is not callable

类型错误:“列”对象不可调用

Can anyone please help me on this to resolve

谁能帮我解决这个问题

回答by Rakesh Kumar

It looks like your DataFrame FirstName have empty value instead Null. Below are some options to try out:-

看起来您的 DataFrame FirstName 具有空值Null。以下是一些可以尝试的选项:-

join_Df1.filter(join_Df1.FirstName.isNotNull()).show

回答by ktheitroadalo

You should be doing as below

你应该做如下

ID,FirstName,LastName
1,Navee,Srikanth

2,,Srikanth

3,Naveen

Hope this helps!

希望这可以帮助!

回答by void

I think what you might need is this notnull().

我想你可能需要的是这个notnull()

So this is your input in csv file my_test.csv:

所以这是您在 csv 文件中的输入my_test.csv

import pandas as pd
df = pd.read_csv("my_test.csv")

print(df[df['FirstName'].notnull()])

The code:

编码:

  ID FirstName  LastName
0   1     Navee  Srikanth
2   3    Naveen       NaN

output:

输出:

0     True
1    False
2     True

This is what you would like! df[df['FirstName'].notnull()]

这就是你想要的! df[df['FirstName'].notnull()]

output of df['FirstName'].notnull():

的输出df['FirstName'].notnull()

##代码##

This creates a dataframe dfwhere df['FirstName'].notnull()returns True

这将创建一个数据框df,其中df['FirstName'].notnull()返回True

How this is checked? df['FirstName'].notnull()If the value for FirstNamecolumn is notnull return Trueelse if NaNis present return False.

这个怎么查? df['FirstName'].notnull()如果FirstName列的值不为空则返回True否则如果NaN存在则返回False