Python 从 Pandas DataFrame 中删除包含空单元格的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29314033/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:23:24  来源:igfitidea点击:

Drop rows containing empty cells from a pandas DataFrame

pythonpandas

提问by Amrita Sawant

I have a pd.DataFramethat was created by parsing some excel spreadsheets. A column of which has empty cells. For example, below is the output for the frequency of that column, 32320 records have missing values for Tenant.

我有一个pd.DataFrame通过解析一些 excel 电子表格创建的。其中一列有空单元格。例如,下面是该列频率的输出,32320 条记录缺少Tenant值。

>>> value_counts(Tenant, normalize=False)
                              32320
    Thunderhead                8170
    Big Data Others            5700
    Cloud Cruiser              5700
    Partnerpedia               5700
    Comcast                    5700
    SDP                        5700
    Agora                      5700
    dtype: int64

I am trying to drop rows where Tenant is missing, however .isnull()option does not recognize the missing values.

我正在尝试删除缺少租户的行,但是.isnull()选项无法识别缺失值。

>>> df['Tenant'].isnull().sum()
    0

The column has data type "Object". What is happening in this case? How can I drop records where Tenantis missing?

该列的数据类型为“对象”。在这种情况下发生了什么?如何删除缺少租户的记录?

采纳答案by McMath

Pandas will recognise a value as null if it is a np.nanobject, which will print as NaNin the DataFrame. Your missing values are probably empty strings, which Pandas doesn't recognise as null. To fix this, you can convert the empty stings (or whatever is in your empty cells) to np.nanobjects using replace(), and then call dropna()on your DataFrame to delete rows with null tenants.

如果它是一个np.nan对象,Pandas 会将其识别为 null ,它将像NaN在 DataFrame 中一样打印。您的缺失值可能是空字符串,Pandas 无法将其识别为 null。要解决此问题,您可以使用 将空字符串(或空单元格中的任何内容)转换为np.nan对象replace(),然后调用dropna()您的 DataFrame 以删除包含空租户的行。

To demonstrate, we create a DataFrame with some random values and some empty strings in a Tenantscolumn:

为了演示,我们在Tenants列中创建了一个包含一些随机值和一些空字符串的 DataFrame :

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> df = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
>>> df['Tenant'] = np.random.choice(['Babar', 'Rataxes', ''], 10)
>>> print df

          A         B   Tenant
0 -0.588412 -1.179306    Babar
1 -0.008562  0.725239         
2  0.282146  0.421721  Rataxes
3  0.627611 -0.661126    Babar
4  0.805304 -0.834214         
5 -0.514568  1.890647    Babar
6 -1.188436  0.294792  Rataxes
7  1.471766 -0.267807    Babar
8 -1.730745  1.358165  Rataxes
9  0.066946  0.375640         

Now we replace any empty strings in the Tenantscolumn with np.nanobjects, like so:

现在我们Tenantsnp.nan对象替换列中的任何空字符串,如下所示:

>>> df['Tenant'].replace('', np.nan, inplace=True)
>>> print df

          A         B   Tenant
0 -0.588412 -1.179306    Babar
1 -0.008562  0.725239      NaN
2  0.282146  0.421721  Rataxes
3  0.627611 -0.661126    Babar
4  0.805304 -0.834214      NaN
5 -0.514568  1.890647    Babar
6 -1.188436  0.294792  Rataxes
7  1.471766 -0.267807    Babar
8 -1.730745  1.358165  Rataxes
9  0.066946  0.375640      NaN

Now we can drop the null values:

现在我们可以删除空值:

>>> df.dropna(subset=['Tenant'], inplace=True)
>>> print df

          A         B   Tenant
0 -0.588412 -1.179306    Babar
2  0.282146  0.421721  Rataxes
3  0.627611 -0.661126    Babar
5 -0.514568  1.890647    Babar
6 -1.188436  0.294792  Rataxes
7  1.471766 -0.267807    Babar
8 -1.730745  1.358165  Rataxes

回答by Bob Haffner

value_counts omits NaN by default so you're most likely dealing with "".

value_counts 默认省略 NaN,因此您很可能正在处理“”。

So you can just filter them out like

所以你可以像过滤它们一样

filter = df["Tenant"] != ""
dfNew = df[filter]

回答by Amir F

You can use this variation:

您可以使用此变体:

import pandas as pd
vals = {
    'name' : ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7'],
    'gender' : ['m', 'f', 'f', 'f',  'f', 'c', 'c'],
    'age' : [39, 12, 27, 13, 36, 29, 10],
    'education' : ['ma', None, 'school', None, 'ba', None, None]
}
df_vals = pd.DataFrame(vals) #converting dict to dataframe

This will output(** - highlighting only desired rows):

这将输出(** - 仅突出显示所需的行):

   age education gender name
0   39        ma      m   n1 **
1   12      None      f   n2    
2   27    school      f   n3 **
3   13      None      f   n4
4   36        ba      f   n5 **
5   29      None      c   n6
6   10      None      c   n7

So to drop everything that does not have an 'education' value, use the code below:

因此,要删除没有“教育”值的所有内容,请使用以下代码:

df_vals = df_vals[~df_vals['education'].isnull()] 

('~' indicating NOT)

('~' 表示 NOT)

Result:

结果:

   age education gender name
0   39        ma      m   n1
2   27    school      f   n3
4   36        ba      f   n5

回答by Learn

There's a situation where the cell has white space, you can't see it, use

有一种情况,单元格有空白,你看不到,使用

df['col'].replace('  ', np.nan, inplace=True)

to replace white space as NaN, then

将空格替换为 NaN,然后

df= df.dropna(subset=['col'])

回答by cs95

Pythonic + Pandorable: df[df['col'].astype(bool)]

Pythonic + Pandorable: df[df['col'].astype(bool)]

Empty strings are falsy, which means you can you filter on bool values like this:

空字符串是假的,这意味着您可以像这样过滤 bool 值:

df = pd.DataFrame({
    'A': range(5),
    'B': ['foo', '', 'bar', '', 'xyz']
})
df
   A    B
0  0  foo
1  1     
2  2  bar
3  3     
4  4  xyz

df['B'].astype(bool)                                                                                                                      
0     True
1    False
2     True
3    False
4     True
Name: B, dtype: bool

df[df['B'].astype(bool)]                                                                                                                  
   A    B
0  0  foo
2  2  bar
4  4  xyz

If your goal is to remove not only empty strings, but also strings only containing whitespace, use str.stripbeforehand:

如果您的目标不仅要删除空字符串,还要删除仅包含空格的字符串,请str.strip事先使用:

df[df['B'].str.strip().astype(bool)]
   A    B
0  0  foo
2  2  bar
4  4  xyz

Faster than you Think

比你想象的更快

.astypeis a vectorised operation, this is faster than every option presented thus far. At least, from my tests. YMMV.

.astype是一个矢量化操作,这比迄今为止提供的每个选项都快。至少,从我的测试来看。天啊。

Here is a timing comparison, I've thrown in some other methods I could think of.

这是一个时间比较,我已经提出了一些我能想到的其他方法。

enter image description here

在此处输入图片说明

Benchmarking code, for reference:

基准代码,供参考:

import pandas as pd
import perfplot

df1 = pd.DataFrame({
    'A': range(5),
    'B': ['foo', '', 'bar', '', 'xyz']
})

perfplot.show(
    setup=lambda n: pd.concat([df1] * n, ignore_index=True),
    kernels=[
        lambda df: df[df['B'].astype(bool)],
        lambda df: df[df['B'] != ''],
        lambda df: df[df['B'].replace('', np.nan).notna()],  # optimized 1-col
        lambda df: df.replace({'B': {'': np.nan}}).dropna(subset=['B']),  
    ],
    labels=['astype', "!= ''", "replace + notna", "replace + dropna", ],
    n_range=[2**k for k in range(1, 15)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=pd.DataFrame.equals)