Python 从 pandas.DataFrame 中选择复杂的标准

Question

提问by Gill Bates

For example I have simple DF:

例如我有简单的 DF：

import pandas as pd
from random import randint

df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
                   'B': [randint(1, 9)*10 for x in xrange(10)],
                   'C': [randint(1, 9)*100 for x in xrange(10)]})

Can I select values from 'A' for which corresponding values for 'B' will be greater than 50, and for 'C' - not equal 900, using methods and idioms of Pandas?

我可以使用 Pandas 的方法和习语从“A”中选择值，“B”的相应值将大于 50，“C”的相应值不等于 900？

Answer 1

采纳答案by DSM

Sure! Setup:

当然！设置：

>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
                   'B': [randint(1, 9)*10 for x in range(10)],
                   'C': [randint(1, 9)*100 for x in range(10)]})
>>> df
   A   B    C
0  9  40  300
1  9  70  700
2  5  70  900
3  8  80  900
4  7  50  200
5  9  30  900
6  2  80  700
7  2  80  400
8  5  80  300
9  7  70  800

We can apply column operations and get boolean Series objects:

我们可以应用列操作并获得布尔系列对象：

>>> df["B"] > 50
0    False
1     True
2     True
3     True
4    False
5    False
6     True
7     True
8     True
9     True
Name: B
>>> (df["B"] > 50) & (df["C"] == 900)
0    False
1    False
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False

[Update, to switch to new-style .loc]:

[更新，切换到新样式.loc]：

And then we can use these to index into the object. For read access, you can chain indices:

然后我们可以使用这些来索引对象。对于读取访问，您可以链接索引：

>>> df["A"][(df["B"] > 50) & (df["C"] == 900)]
2    5
3    8
Name: A, dtype: int64

but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .locinstead:

但是您可能会因为视图和副本之间的差异而陷入困境，因为这样做是为了写访问。您可以.loc改用：

>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]
2    5
3    8
Name: A, dtype: int64
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].values
array([5, 8], dtype=int64)
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000
>>> df
      A   B    C
0     9  40  300
1     9  70  700
2  5000  70  900
3  8000  80  900
4     7  50  200
5     9  30  900
6     2  80  700
7     2  80  400
8     5  80  300
9     7  70  800

Note that I accidentally typed == 900and not != 900, or ~(df["C"] == 900), but I'm too lazy to fix it. Exercise for the reader. :^)

请注意，我不小心输入了== 900and not != 900, or ~(df["C"] == 900)，但我懒得修复它。为读者练习。:^)

Answer 2

回答by Nikos Tavoularis

Another solution is to use the querymethod:

另一种解决方案是使用查询方法：

import pandas as pd

from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
                   'B': [randint(1, 9) * 10 for x in xrange(10)],
                   'C': [randint(1, 9) * 100 for x in xrange(10)]})
print df

   A   B    C
0  7  20  300
1  7  80  700
2  4  90  100
3  4  30  900
4  7  80  200
5  7  60  800
6  3  80  900
7  9  40  100
8  6  40  100
9  3  10  600

print df.query('B > 50 and C != 900')

   A   B    C
1  7  80  700
2  4  90  100
4  7  80  200
5  7  60  800

Now if you want to change the returned values in column A you can save their index:

现在，如果您想更改 A 列中的返回值，您可以保存它们的索引：

my_query_index = df.query('B > 50 & C != 900').index

....and use .ilocto change them i.e:

....并用于.iloc更改它们，即：

df.iloc[my_query_index, 0] = 5000

print df

      A   B    C
0     7  20  300
1  5000  80  700
2  5000  90  100
3     4  30  900
4  5000  80  200
5  5000  60  800
6     3  80  900
7     9  40  100
8     6  40  100
9     3  10  600

Answer 3

回答by Christopher Matthews

You can use pandas it has some built in functions for comparison. So if you want to select values of "A" that are met by the conditions of "B" and "C" (assuming you want back a DataFrame pandas object)

您可以使用 Pandas，它有一些内置函数进行比较。因此，如果您想选择满足“B”和“C”条件的“A”值（假设您想要返回 DataFrame pandas 对象）

df[['A']][df.B.gt(50) & df.C.ne(900)]

df[['A']]will give you back column A in DataFrame format.

df[['A']]将以 DataFrame 格式返回 A 列。

pandas 'gt' function will return the positions of column B that are greater than 50 and 'ne' will return the positions not equal to 900.

pandas 'gt' 函数将返回 B 列大于 50 的位置，'ne' 将返回不等于 900 的位置。

Answer 4

回答by Tomasz Bartkowiak

And remember to use parenthesis!

并记住使用括号！

Keep in mind that &operator takes a precedence over operators such as >or <etc. That is why

请记住，&运算符优先于诸如>或<等运算符。这就是为什么

4 < 5 & 6 > 4

evaluates to False. Therefore if you're using pd.loc, you need to put brackets around your logical statements, otherwise you get an error. That's why do:

评估为False. 因此，如果您使用pd.loc，则需要在逻辑语句周围加上括号，否则会出现错误。这就是为什么要这样做：

df.loc[(df['A'] > 10) & (df['B'] < 15)]

instead of

代替

df.loc[df['A'] > 10 & df['B'] < 15]

which would result in

这将导致

TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]

类型错误：无法将 dtyped [float64] 数组与 [bool] 类型的标量进行比较

Python 从 pandas.DataFrame 中选择复杂的标准

提问by Gill Bates

采纳答案by DSM

回答by Nikos Tavoularis

回答by Christopher Matthews

回答by Tomasz Bartkowiak

相关推荐

最近更新

标签

Python 从 pandas.DataFrame 中选择复杂的标准

提问by Gill Bates

采纳答案by DSM

回答by Nikos Tavoularis

回答by Christopher Matthews

回答by Tomasz Bartkowiak

相关推荐

Python 使用 PyMongo 将 Pandas Dataframe 插入 mongodb

Python UnicodeEncodeError: 'ascii' 编解码器无法对位置 0-3 中的字符进行编码：序号不在范围内 (128)

Python Pandas DataFrame 的起始索引为 1

Python matplotlib 减小颜色条标签的大小

相关推荐

最近更新

标签