Python Pandas - 根据值删除行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42306473/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:00:47  来源:igfitidea点击:

Python Pandas - Drop row based on value

pythonpython-2.7pandasnumpydataframe

提问by philshem

I have a Pandas dataframe with columns A and B

我有一个包含 A 列和 B 列的 Pandas 数据框

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))

I create column C, which is NULL if A > B

我创建列 C,如果是 NULL,则为 A > B

df['C'] = np.select([ df.A > df.B ], [df.A], default=np.NaN)

That gives:

这给出了:

    A   B     C
0  95  19  95.0
1  46  11  46.0
2  96  86  96.0
3  22  61   NaN
4  69   1  69.0
5  78  91   NaN
6  42   7  42.0
7  24  28   NaN
8  55  92   NaN
9  92  16  92.0

I then drop rows that have df.C as NaN with one of several methods:

然后,我使用以下几种方法之一删除具有 df.C 作为 NaN 的行:

df = df.dropna(subset=['C'], how='any')

or

或者

df = df.drop(df[pd.isnull(df.C)].index)

or

或者

df = df.drop(df[(pd.isnull(df.C))].index)

and all 3 methods give me roughly have the rows. In this case:

并且所有 3 种方法都给了我大致的行。在这种情况下:

    A   B     C
0  95  19  95.0
1  46  11  46.0
2  96  86  96.0
4  69   1  69.0
6  42   7  42.0
9  92  16  92.0

But when I don't use a number, for example a string:

但是当我不使用数字时,例如字符串:

df['C'] = np.select([ df.A > df.B ], ['yes'], default=np.NaN)

Then those same 3 methods to drop rows with df.C being NaN are not filtered. For example, when df.A > df.Bsets column C to yes, I get something like this:

然后,不会过滤掉 df.C 为 NaN 的相同 3 种删除行的方法。例如,当df.A > df.B将 C 列设置为 时yes,我得到如下信息:

    A   B    C
0   6  70  nan
1  85  46  yes
2  76  87  nan
3  77  36  yes
4  73  18  yes
5   1  41  nan
6  19  69  nan
7  62  89  nan
8   6   7  nan
9  35  75  nan

I can fix this, by replacing pd.NaN with a string like 'IGNORE', and then filtering 'IGNORE', but I find this result otherwise unexpected.

我可以解决这个问题,方法是用像“IGNORE”这样的字符串替换 pd.NaN,然后​​过滤“IGNORE”,但我发现这个结果是出乎意料的。

df['C'] = np.select([ df.A > df.B ], ['yes'], default='IGNORE')
df = df.drop(df[(df.C == 'IGNORE')].index)

What's going on here? (When df.C is a string, are my pd.NaN's being converted to strings?)

这里发生了什么?(当 df.C 是字符串时,我的 pd.NaN 是否被转换为字符串?)



I'm using 64 bit Python 2.7.13, Pandas 0.19.2, and Numpy 1.11.3 on Windows 10.

我在 Windows 10 上使用 64 位 Python 2.7.13、Pandas 0.19.2 和 Numpy 1.11.3。

采纳答案by Psidom

Your case is similar to this one:

您的情况与此类似:

np.array([1,2,'3',np.nan])
# array(['1', '2', '3', 'nan'], 
#       dtype='<U21')

since np.selectalso returns an array, if you further check

因为np.select也返回一个数组,如果你进一步检查

type(np.nan)
# float

str(np.nan)
# 'nan'

so np.nanis a float, but numpy array prefers single data type except for structured array, so when there is a string element in the array, all elements are converted to string.

所以np.nan是一个浮点数,但numpy数组除了结构化数组外更喜欢单一数据类型,所以当数组中有一个字符串元素时,所有元素都被转换为字符串。



For your case, if you have string column, you can use Nonein place of np.nanas default, this will create a missing value which can pass isnull()check and works with dropna():

对于您的情况,如果您有字符串列,则可以使用None代替np.nan作为默认值,这将创建一个缺失值,该值可以通过isnull()检查并使用dropna()

import pandas as pd
import numpy as np
?
df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))
df['C'] = np.select([ df.A > df.B ], ['yes'], default=None)

df.dropna()

#    A  B     C
#0  82  1   yes
#3  84  8   yes
#6  52  30  yes
#7  68  61  yes
#9  91  87  yes

回答by MANOJ REDDY

Instead of dropping, take only finite values.

而不是丢弃,只取有限值。

df = df[np.isfinite(df['C'])]

Edit:

编辑:

As per you comment nanis of the type string, so, remove rows based on values:

根据您的评论nan类型string,因此,根据值删除行:

df = df[df.C != "nan"]will work

df = df[df.C != "nan"]将工作

df[df.C.notnull()]
    A   B    C
0  67  23  yes
1  91  61  yes
2  30  92  nan
3  53  97  nan
4  81  11  yes
5  23   7  yes
6  47  39  yes
7  11  27  nan
8  46  55  nan
9  31  82  nan
df = df[df.C != "nan"]


    A   B    C
0  67  23  yes
1  91  61  yes
4  81  11  yes
5  23   7  yes
6  47  39  yes