Python 从 pandas.DataFrame 中选择复杂的标准

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15315452/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:46:32  来源:igfitidea点击:

Selecting with complex criteria from pandas.DataFrame

pythonpandas

提问by Gill Bates

For example I have simple DF:

例如我有简单的 DF:

import pandas as pd
from random import randint

df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
                   'B': [randint(1, 9)*10 for x in xrange(10)],
                   'C': [randint(1, 9)*100 for x in xrange(10)]})

Can I select values from 'A' for which corresponding values for 'B' will be greater than 50, and for 'C' - not equal 900, using methods and idioms of Pandas?

我可以使用 Pandas 的方法和习语从“A”中选择值,“B”的相应值将大于 50,“C”的相应值不等于 900?

采纳答案by DSM

Sure! Setup:

当然!设置:

>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
                   'B': [randint(1, 9)*10 for x in range(10)],
                   'C': [randint(1, 9)*100 for x in range(10)]})
>>> df
   A   B    C
0  9  40  300
1  9  70  700
2  5  70  900
3  8  80  900
4  7  50  200
5  9  30  900
6  2  80  700
7  2  80  400
8  5  80  300
9  7  70  800

We can apply column operations and get boolean Series objects:

我们可以应用列操作并获得布尔系列对象:

>>> df["B"] > 50
0    False
1     True
2     True
3     True
4    False
5    False
6     True
7     True
8     True
9     True
Name: B
>>> (df["B"] > 50) & (df["C"] == 900)
0    False
1    False
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False

[Update, to switch to new-style .loc]:

[更新,切换到新样式.loc]:

And then we can use these to index into the object. For read access, you can chain indices:

然后我们可以使用这些来索引对象。对于读取访问,您可以链接索引:

>>> df["A"][(df["B"] > 50) & (df["C"] == 900)]
2    5
3    8
Name: A, dtype: int64

but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .locinstead:

但是您可能会因为视图和副本之间的差异而陷入困境,因为这样做是为了写访问。您可以.loc改用:

>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]
2    5
3    8
Name: A, dtype: int64
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].values
array([5, 8], dtype=int64)
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000
>>> df
      A   B    C
0     9  40  300
1     9  70  700
2  5000  70  900
3  8000  80  900
4     7  50  200
5     9  30  900
6     2  80  700
7     2  80  400
8     5  80  300
9     7  70  800

Note that I accidentally typed == 900and not != 900, or ~(df["C"] == 900), but I'm too lazy to fix it. Exercise for the reader. :^)

请注意,我不小心输入了== 900and not != 900, or ~(df["C"] == 900),但我懒得修复它。为读者练习。:^)

回答by Nikos Tavoularis

Another solution is to use the querymethod:

另一种解决方案是使用查询方法:

import pandas as pd

from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
                   'B': [randint(1, 9) * 10 for x in xrange(10)],
                   'C': [randint(1, 9) * 100 for x in xrange(10)]})
print df

   A   B    C
0  7  20  300
1  7  80  700
2  4  90  100
3  4  30  900
4  7  80  200
5  7  60  800
6  3  80  900
7  9  40  100
8  6  40  100
9  3  10  600

print df.query('B > 50 and C != 900')

   A   B    C
1  7  80  700
2  4  90  100
4  7  80  200
5  7  60  800

Now if you want to change the returned values in column A you can save their index:

现在,如果您想更改 A 列中的返回值,您可以保存它们的索引:

my_query_index = df.query('B > 50 & C != 900').index

....and use .ilocto change them i.e:

....并用于.iloc更改它们,即:

df.iloc[my_query_index, 0] = 5000

print df

      A   B    C
0     7  20  300
1  5000  80  700
2  5000  90  100
3     4  30  900
4  5000  80  200
5  5000  60  800
6     3  80  900
7     9  40  100
8     6  40  100
9     3  10  600

回答by Christopher Matthews

You can use pandas it has some built in functions for comparison. So if you want to select values of "A" that are met by the conditions of "B" and "C" (assuming you want back a DataFrame pandas object)

您可以使用 Pandas,它有一些内置函数进行比较。因此,如果您想选择满足“B”和“C”条件的“A”值(假设您想要返回 DataFrame pandas 对象)

df[['A']][df.B.gt(50) & df.C.ne(900)]

df[['A']][df.B.gt(50) & df.C.ne(900)]

df[['A']]will give you back column A in DataFrame format.

df[['A']]将以 DataFrame 格式返回 A 列。

pandas 'gt' function will return the positions of column B that are greater than 50 and 'ne' will return the positions not equal to 900.

pandas 'gt' 函数将返回 B 列大于 50 的位置,'ne' 将返回不等于 900 的位置。

回答by Tomasz Bartkowiak

And remember to use parenthesis!

并记住使用括号!

Keep in mind that &operator takes a precedence over operators such as >or <etc. That is why

请记住,&运算符优先于诸如><等运算符。这就是为什么

4 < 5 & 6 > 4

evaluates to False. Therefore if you're using pd.loc, you need to put brackets around your logical statements, otherwise you get an error. That's why do:

评估为False. 因此,如果您使用pd.loc,则需要在逻辑语句周围加上括号,否则会出现错误。这就是为什么要这样做:

df.loc[(df['A'] > 10) & (df['B'] < 15)]

instead of

代替

df.loc[df['A'] > 10 & df['B'] < 15]

which would result in

这将导致

TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]

类型错误:无法将 dtyped [float64] 数组与 [bool] 类型的标量进行比较