Python 从 Pandas DataFrame 中删除包含空单元格的行

Question

提问by Amrita Sawant

I have a pd.DataFramethat was created by parsing some excel spreadsheets. A column of which has empty cells. For example, below is the output for the frequency of that column, 32320 records have missing values for Tenant.

我有一个pd.DataFrame通过解析一些 excel 电子表格创建的。其中一列有空单元格。例如，下面是该列频率的输出，32320 条记录缺少Tenant值。

>>> value_counts(Tenant, normalize=False)
                              32320
    Thunderhead                8170
    Big Data Others            5700
    Cloud Cruiser              5700
    Partnerpedia               5700
    Comcast                    5700
    SDP                        5700
    Agora                      5700
    dtype: int64

I am trying to drop rows where Tenant is missing, however .isnull()option does not recognize the missing values.

我正在尝试删除缺少租户的行，但是.isnull()选项无法识别缺失值。

>>> df['Tenant'].isnull().sum()
    0

The column has data type "Object". What is happening in this case? How can I drop records where Tenantis missing?

该列的数据类型为“对象”。在这种情况下发生了什么？如何删除缺少租户的记录？

Answer 1

采纳答案by McMath

Pandas will recognise a value as null if it is a np.nanobject, which will print as NaNin the DataFrame. Your missing values are probably empty strings, which Pandas doesn't recognise as null. To fix this, you can convert the empty stings (or whatever is in your empty cells) to np.nanobjects using replace(), and then call dropna()on your DataFrame to delete rows with null tenants.

如果它是一个np.nan对象，Pandas 会将其识别为 null ，它将像NaN在 DataFrame 中一样打印。您的缺失值可能是空字符串，Pandas 无法将其识别为 null。要解决此问题，您可以使用将空字符串（或空单元格中的任何内容）转换为np.nan对象replace()，然后调用dropna()您的 DataFrame 以删除包含空租户的行。

To demonstrate, we create a DataFrame with some random values and some empty strings in a Tenantscolumn:

为了演示，我们在Tenants列中创建了一个包含一些随机值和一些空字符串的 DataFrame ：

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> df = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
>>> df['Tenant'] = np.random.choice(['Babar', 'Rataxes', ''], 10)
>>> print df

          A         B   Tenant
0 -0.588412 -1.179306    Babar
1 -0.008562  0.725239         
2  0.282146  0.421721  Rataxes
3  0.627611 -0.661126    Babar
4  0.805304 -0.834214         
5 -0.514568  1.890647    Babar
6 -1.188436  0.294792  Rataxes
7  1.471766 -0.267807    Babar
8 -1.730745  1.358165  Rataxes
9  0.066946  0.375640

Now we replace any empty strings in the Tenantscolumn with np.nanobjects, like so:

现在我们Tenants用np.nan对象替换列中的任何空字符串，如下所示：

>>> df['Tenant'].replace('', np.nan, inplace=True)
>>> print df

          A         B   Tenant
0 -0.588412 -1.179306    Babar
1 -0.008562  0.725239      NaN
2  0.282146  0.421721  Rataxes
3  0.627611 -0.661126    Babar
4  0.805304 -0.834214      NaN
5 -0.514568  1.890647    Babar
6 -1.188436  0.294792  Rataxes
7  1.471766 -0.267807    Babar
8 -1.730745  1.358165  Rataxes
9  0.066946  0.375640      NaN

Now we can drop the null values:

现在我们可以删除空值：

>>> df.dropna(subset=['Tenant'], inplace=True)
>>> print df

          A         B   Tenant
0 -0.588412 -1.179306    Babar
2  0.282146  0.421721  Rataxes
3  0.627611 -0.661126    Babar
5 -0.514568  1.890647    Babar
6 -1.188436  0.294792  Rataxes
7  1.471766 -0.267807    Babar
8 -1.730745  1.358165  Rataxes

Answer 2

回答by Bob Haffner

value_counts omits NaN by default so you're most likely dealing with "".

value_counts 默认省略 NaN，因此您很可能正在处理“”。

So you can just filter them out like

所以你可以像过滤它们一样

filter = df["Tenant"] != ""
dfNew = df[filter]

Answer 3

回答by Amir F

You can use this variation:

您可以使用此变体：

import pandas as pd
vals = {
    'name' : ['n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7'],
    'gender' : ['m', 'f', 'f', 'f',  'f', 'c', 'c'],
    'age' : [39, 12, 27, 13, 36, 29, 10],
    'education' : ['ma', None, 'school', None, 'ba', None, None]
}
df_vals = pd.DataFrame(vals) #converting dict to dataframe

This will output(** - highlighting only desired rows):

这将输出（** - 仅突出显示所需的行）：

   age education gender name
0   39        ma      m   n1 **
1   12      None      f   n2    
2   27    school      f   n3 **
3   13      None      f   n4
4   36        ba      f   n5 **
5   29      None      c   n6
6   10      None      c   n7

So to drop everything that does not have an 'education' value, use the code below:

因此，要删除没有“教育”值的所有内容，请使用以下代码：

df_vals = df_vals[~df_vals['education'].isnull()]

('~' indicating NOT)

('~' 表示 NOT)

Result:

结果：

   age education gender name
0   39        ma      m   n1
2   27    school      f   n3
4   36        ba      f   n5

Answer 4

回答by Learn

There's a situation where the cell has white space, you can't see it, use

有一种情况，单元格有空白，你看不到，使用

df['col'].replace('  ', np.nan, inplace=True)

to replace white space as NaN, then

将空格替换为 NaN，然后

df= df.dropna(subset=['col'])

Answer 5

回答by cs95

Pythonic + Pandorable: `df[df['col'].astype(bool)]`

Pythonic + Pandorable： `df[df['col'].astype(bool)]`

Empty strings are falsy, which means you can you filter on bool values like this:

空字符串是假的，这意味着您可以像这样过滤 bool 值：

df = pd.DataFrame({
    'A': range(5),
    'B': ['foo', '', 'bar', '', 'xyz']
})
df
   A    B
0  0  foo
1  1     
2  2  bar
3  3     
4  4  xyz

df['B'].astype(bool)                                                                                                                      
0     True
1    False
2     True
3    False
4     True
Name: B, dtype: bool

df[df['B'].astype(bool)]                                                                                                                  
   A    B
0  0  foo
2  2  bar
4  4  xyz

If your goal is to remove not only empty strings, but also strings only containing whitespace, use str.stripbeforehand:

如果您的目标不仅要删除空字符串，还要删除仅包含空格的字符串，请str.strip事先使用：

df[df['B'].str.strip().astype(bool)]
   A    B
0  0  foo
2  2  bar
4  4  xyz

Faster than you Think

比你想象的更快

.astypeis a vectorised operation, this is faster than every option presented thus far. At least, from my tests. YMMV.

.astype是一个矢量化操作，这比迄今为止提供的每个选项都快。至少，从我的测试来看。天啊。

Here is a timing comparison, I've thrown in some other methods I could think of.

这是一个时间比较，我已经提出了一些我能想到的其他方法。

Benchmarking code, for reference:

基准代码，供参考：

import pandas as pd
import perfplot

df1 = pd.DataFrame({
    'A': range(5),
    'B': ['foo', '', 'bar', '', 'xyz']
})

perfplot.show(
    setup=lambda n: pd.concat([df1] * n, ignore_index=True),
    kernels=[
        lambda df: df[df['B'].astype(bool)],
        lambda df: df[df['B'] != ''],
        lambda df: df[df['B'].replace('', np.nan).notna()],  # optimized 1-col
        lambda df: df.replace({'B': {'': np.nan}}).dropna(subset=['B']),  
    ],
    labels=['astype', "!= ''", "replace + notna", "replace + dropna", ],
    n_range=[2**k for k in range(1, 15)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=pd.DataFrame.equals)

Python 从 Pandas DataFrame 中删除包含空单元格的行

提问by Amrita Sawant

采纳答案by McMath

回答by Bob Haffner

回答by Amir F

回答by Learn

回答by cs95

Pythonic + Pandorable: `df[df['col'].astype(bool)]`

Pythonic + Pandorable： `df[df['col'].astype(bool)]`

Faster than you Think

比你想象的更快

相关推荐

最近更新

标签

Python 从 Pandas DataFrame 中删除包含空单元格的行

提问by Amrita Sawant

采纳答案by McMath

回答by Bob Haffner

回答by Amir F

回答by Learn

回答by cs95

Pythonic + Pandorable: df[df['col'].astype(bool)]

Pythonic + Pandorable： df[df['col'].astype(bool)]

Faster than you Think

比你想象的更快

相关推荐

Python 中是否存在可变命名元组？

Python Pyinstaller 设置图标

Python DistutilsOptionError: 必须提供 home 或 prefix/exec-prefix —— 不能同时提供

Python 在具有数值的列上的 Pandas 数据框上按行应用函数

相关推荐

最近更新

标签

Pythonic + Pandorable: `df[df['col'].astype(bool)]`

Pythonic + Pandorable： `df[df['col'].astype(bool)]`