Python Pandas - 根据值删除行

Question

提问by philshem

I have a Pandas dataframe with columns A and B

我有一个包含 A 列和 B 列的 Pandas 数据框

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))

I create column C, which is NULL if A > B

我创建列 C，如果是 NULL，则为 A > B

df['C'] = np.select([ df.A > df.B ], [df.A], default=np.NaN)

That gives:

这给出了：

    A   B     C
0  95  19  95.0
1  46  11  46.0
2  96  86  96.0
3  22  61   NaN
4  69   1  69.0
5  78  91   NaN
6  42   7  42.0
7  24  28   NaN
8  55  92   NaN
9  92  16  92.0

I then drop rows that have df.C as NaN with one of several methods:

然后，我使用以下几种方法之一删除具有 df.C 作为 NaN 的行：

df = df.dropna(subset=['C'], how='any')

or

或者

df = df.drop(df[pd.isnull(df.C)].index)

or

或者

df = df.drop(df[(pd.isnull(df.C))].index)

and all 3 methods give me roughly have the rows. In this case:

并且所有 3 种方法都给了我大致的行。在这种情况下：

    A   B     C
0  95  19  95.0
1  46  11  46.0
2  96  86  96.0
4  69   1  69.0
6  42   7  42.0
9  92  16  92.0

But when I don't use a number, for example a string:

但是当我不使用数字时，例如字符串：

df['C'] = np.select([ df.A > df.B ], ['yes'], default=np.NaN)

Then those same 3 methods to drop rows with df.C being NaN are not filtered. For example, when df.A > df.Bsets column C to yes, I get something like this:

然后，不会过滤掉 df.C 为 NaN 的相同 3 种删除行的方法。例如，当df.A > df.B将 C 列设置为时yes，我得到如下信息：

    A   B    C
0   6  70  nan
1  85  46  yes
2  76  87  nan
3  77  36  yes
4  73  18  yes
5   1  41  nan
6  19  69  nan
7  62  89  nan
8   6   7  nan
9  35  75  nan

I can fix this, by replacing pd.NaN with a string like 'IGNORE', and then filtering 'IGNORE', but I find this result otherwise unexpected.

我可以解决这个问题，方法是用像“IGNORE”这样的字符串替换 pd.NaN，然后过滤“IGNORE”，但我发现这个结果是出乎意料的。

df['C'] = np.select([ df.A > df.B ], ['yes'], default='IGNORE')
df = df.drop(df[(df.C == 'IGNORE')].index)

What's going on here? (When df.C is a string, are my pd.NaN's being converted to strings?)

这里发生了什么？（当 df.C 是字符串时，我的 pd.NaN 是否被转换为字符串？）

I'm using 64 bit Python 2.7.13, Pandas 0.19.2, and Numpy 1.11.3 on Windows 10.

我在 Windows 10 上使用 64 位 Python 2.7.13、Pandas 0.19.2 和 Numpy 1.11.3。

Answer 1

采纳答案by Psidom

Your case is similar to this one:

您的情况与此类似：

np.array([1,2,'3',np.nan])
# array(['1', '2', '3', 'nan'], 
#       dtype='<U21')

since np.selectalso returns an array, if you further check

因为np.select也返回一个数组，如果你进一步检查

type(np.nan)
# float

str(np.nan)
# 'nan'

so np.nanis a float, but numpy array prefers single data type except for structured array, so when there is a string element in the array, all elements are converted to string.

所以np.nan是一个浮点数，但numpy数组除了结构化数组外更喜欢单一数据类型，所以当数组中有一个字符串元素时，所有元素都被转换为字符串。

For your case, if you have string column, you can use Nonein place of np.nanas default, this will create a missing value which can pass isnull()check and works with dropna():

对于您的情况，如果您有字符串列，则可以使用None代替np.nan作为默认值，这将创建一个缺失值，该值可以通过isnull()检查并使用dropna()：

import pandas as pd
import numpy as np
?
df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))
df['C'] = np.select([ df.A > df.B ], ['yes'], default=None)

df.dropna()

#    A  B     C
#0  82  1   yes
#3  84  8   yes
#6  52  30  yes
#7  68  61  yes
#9  91  87  yes

Answer 2

回答by MANOJ REDDY

Instead of dropping, take only finite values.

而不是丢弃，只取有限值。

df = df[np.isfinite(df['C'])]

Edit:

编辑：

As per you comment nanis of the type string, so, remove rows based on values:

根据您的评论nan类型string，因此，根据值删除行：

df = df[df.C != "nan"]will work

df = df[df.C != "nan"]将工作

df[df.C.notnull()]
    A   B    C
0  67  23  yes
1  91  61  yes
2  30  92  nan
3  53  97  nan
4  81  11  yes
5  23   7  yes
6  47  39  yes
7  11  27  nan
8  46  55  nan
9  31  82  nan
df = df[df.C != "nan"]


    A   B    C
0  67  23  yes
1  91  61  yes
4  81  11  yes
5  23   7  yes
6  47  39  yes

Python Pandas - 根据值删除行

提问by philshem

采纳答案by Psidom

回答by MANOJ REDDY

相关推荐

最近更新

标签

Python Pandas - 根据值删除行

提问by philshem

采纳答案by Psidom

回答by MANOJ REDDY

相关推荐

pandas 从熊猫数据框中选择排序组的第一行

pandas 格式化熊猫数据框索引日期

计算 Pandas 中的元素

pandas 数据框 set_index 未设置

相关推荐

最近更新

标签