Python 从 Pandas DataFrame 中删除包含空单元格的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29314033/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Drop rows containing empty cells from a pandas DataFrame
提问by Amrita Sawant
I have a pd.DataFrame
that was created by parsing some excel spreadsheets. A column of which has empty cells. For example, below is the output for the frequency of that column, 32320 records have missing values for Tenant.
我有一个pd.DataFrame
通过解析一些 excel 电子表格创建的。其中一列有空单元格。例如,下面是该列频率的输出,32320 条记录缺少Tenant值。
>>> value_counts(Tenant, normalize=False)
32320
Thunderhead 8170
Big Data Others 5700
Cloud Cruiser 5700
Partnerpedia 5700
Comcast 5700
SDP 5700
Agora 5700
dtype: int64
I am trying to drop rows where Tenant is missing, however .isnull()
option does not recognize the missing values.
我正在尝试删除缺少租户的行,但是.isnull()
选项无法识别缺失值。
>>> df['Tenant'].isnull().sum()
0
The column has data type "Object". What is happening in this case? How can I drop records where Tenantis missing?
该列的数据类型为“对象”。在这种情况下发生了什么?如何删除缺少租户的记录?
采纳答案by McMath
Pandas will recognise a value as null if it is a np.nan
object, which will print as NaN
in the DataFrame. Your missing values are probably empty strings, which Pandas doesn't recognise as null. To fix this, you can convert the empty stings (or whatever is in your empty cells) to np.nan
objects using replace()
, and then call dropna()
on your DataFrame to delete rows with null tenants.
如果它是一个np.nan
对象,Pandas 会将其识别为 null ,它将像NaN
在 DataFrame 中一样打印。您的缺失值可能是空字符串,Pandas 无法将其识别为 null。要解决此问题,您可以使用 将空字符串(或空单元格中的任何内容)转换为np.nan
对象replace()
,然后调用dropna()
您的 DataFrame 以删除包含空租户的行。
To demonstrate, we create a DataFrame with some random values and some empty strings in a Tenants
column:
为了演示,我们在Tenants
列中创建了一个包含一些随机值和一些空字符串的 DataFrame :
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
>>> df['Tenant'] = np.random.choice(['Babar', 'Rataxes', ''], 10)
>>> print df
A B Tenant
0 -0.588412 -1.179306 Babar
1 -0.008562 0.725239
2 0.282146 0.421721 Rataxes
3 0.627611 -0.661126 Babar
4 0.805304 -0.834214
5 -0.514568 1.890647 Babar
6 -1.188436 0.294792 Rataxes
7 1.471766 -0.267807 Babar
8 -1.730745 1.358165 Rataxes
9 0.066946 0.375640
Now we replace any empty strings in the Tenants
column with np.nan
objects, like so:
现在我们Tenants
用np.nan
对象替换列中的任何空字符串,如下所示:
>>> df['Tenant'].replace('', np.nan, inplace=True)
>>> print df
A B Tenant
0 -0.588412 -1.179306 Babar
1 -0.008562 0.725239 NaN
2 0.282146 0.421721 Rataxes
3 0.627611 -0.661126 Babar
4 0.805304 -0.834214 NaN
5 -0.514568 1.890647 Babar
6 -1.188436 0.294792 Rataxes
7 1.471766 -0.267807 Babar
8 -1.730745 1.358165 Rataxes
9 0.066946 0.375640 NaN
Now we can drop the null values:
现在我们可以删除空值:
>>> df.dropna(subset=['Tenant'], inplace=True)
>>> print df
A B Tenant
0 -0.588412 -1.179306 Babar
2 0.282146 0.421721 Rataxes
3 0.627611 -0.661126 Babar
5 -0.514568 1.890647 Babar
6 -1.188436 0.294792 Rataxes
7 1.471766 -0.267807 Babar
8 -1.730745 1.358165 Rataxes
回答by Bob Haffner
value_counts omits NaN by default so you're most likely dealing with "".
value_counts 默认省略 NaN,因此您很可能正在处理“”。
So you can just filter them out like
所以你可以像过滤它们一样
filter = df["Tenant"] != ""
dfNew = df[filter]
回答by Amir F
You can use this variation:
您可以使用此变体:
import pandas as pd
vals = {
'name' : ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7'],
'gender' : ['m', 'f', 'f', 'f', 'f', 'c', 'c'],
'age' : [39, 12, 27, 13, 36, 29, 10],
'education' : ['ma', None, 'school', None, 'ba', None, None]
}
df_vals = pd.DataFrame(vals) #converting dict to dataframe
This will output(** - highlighting only desired rows):
这将输出(** - 仅突出显示所需的行):
age education gender name
0 39 ma m n1 **
1 12 None f n2
2 27 school f n3 **
3 13 None f n4
4 36 ba f n5 **
5 29 None c n6
6 10 None c n7
So to drop everything that does not have an 'education' value, use the code below:
因此,要删除没有“教育”值的所有内容,请使用以下代码:
df_vals = df_vals[~df_vals['education'].isnull()]
('~' indicating NOT)
('~' 表示 NOT)
Result:
结果:
age education gender name
0 39 ma m n1
2 27 school f n3
4 36 ba f n5
回答by Learn
There's a situation where the cell has white space, you can't see it, use
有一种情况,单元格有空白,你看不到,使用
df['col'].replace(' ', np.nan, inplace=True)
to replace white space as NaN, then
将空格替换为 NaN,然后
df= df.dropna(subset=['col'])
回答by cs95
Pythonic + Pandorable: df[df['col'].astype(bool)]
Pythonic + Pandorable: df[df['col'].astype(bool)]
Empty strings are falsy, which means you can you filter on bool values like this:
空字符串是假的,这意味着您可以像这样过滤 bool 值:
df = pd.DataFrame({
'A': range(5),
'B': ['foo', '', 'bar', '', 'xyz']
})
df
A B
0 0 foo
1 1
2 2 bar
3 3
4 4 xyz
df['B'].astype(bool)
0 True
1 False
2 True
3 False
4 True
Name: B, dtype: bool
df[df['B'].astype(bool)]
A B
0 0 foo
2 2 bar
4 4 xyz
If your goal is to remove not only empty strings, but also strings only containing whitespace, use str.strip
beforehand:
如果您的目标不仅要删除空字符串,还要删除仅包含空格的字符串,请str.strip
事先使用:
df[df['B'].str.strip().astype(bool)]
A B
0 0 foo
2 2 bar
4 4 xyz
Faster than you Think
比你想象的更快
.astype
is a vectorised operation, this is faster than every option presented thus far. At least, from my tests. YMMV.
.astype
是一个矢量化操作,这比迄今为止提供的每个选项都快。至少,从我的测试来看。天啊。
Here is a timing comparison, I've thrown in some other methods I could think of.
这是一个时间比较,我已经提出了一些我能想到的其他方法。
Benchmarking code, for reference:
基准代码,供参考:
import pandas as pd
import perfplot
df1 = pd.DataFrame({
'A': range(5),
'B': ['foo', '', 'bar', '', 'xyz']
})
perfplot.show(
setup=lambda n: pd.concat([df1] * n, ignore_index=True),
kernels=[
lambda df: df[df['B'].astype(bool)],
lambda df: df[df['B'] != ''],
lambda df: df[df['B'].replace('', np.nan).notna()], # optimized 1-col
lambda df: df.replace({'B': {'': np.nan}}).dropna(subset=['B']),
],
labels=['astype', "!= ''", "replace + notna", "replace + dropna", ],
n_range=[2**k for k in range(1, 15)],
xlabel='N',
logx=True,
logy=True,
equality_check=pd.DataFrame.equals)