Python 以系列结尾的 Pandas OR 语句包含
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20062684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas OR statement ending in series contains
提问by TristanMatthews
I have a DataFrame dfthat has columns typeand subtypeand about 100k rows, I'm trying to classify what kind of data dfcontains by checking type/ subtypecombinations. While dfcan contain many different combinations there are particular combinations that only appear in certain data types. To check if my objects contains any of these combinations I'm currently doing:
我有一个df包含列type和subtype大约 10 万行的 DataFrame ,我试图df通过检查type/subtype组合来对包含的数据类型进行分类。虽然df可以包含许多不同的组合,但有一些特定的组合仅出现在某些数据类型中。要检查我的对象是否包含我目前正在做的任何这些组合:
typeA = ((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) |
(df.subtype == 5) | (df.subtype == 6))) |
((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) |
(df.subtype == 8)))
A = typeA.sum()
Where typeA is a long Series of Falses that might have some Trues, if A > 0 then I know it contained a True. The problem with this scheme is that if the first row of the df produces a True it still has to check everything else. Checking the whole DataFrame is faster then using a for loop with a break, but I'm wondering if there is a better way to do it.
typeA 是一长串 Falses,可能有一些 Trues,如果 A > 0 那么我知道它包含一个 True。这个方案的问题是,如果 df 的第一行产生一个 True ,它仍然必须检查其他一切。检查整个 DataFrame 比使用带中断的 for 循环更快,但我想知道是否有更好的方法来做到这一点。
Thanks for any suggestions.
感谢您的任何建议。
采纳答案by HYRY
use Pandas crosstab:
使用熊猫crosstab:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 10, size=(100, 2)), columns=["type", "subtype"])
counts = pd.crosstab(df.type, df.subtype)
print counts.loc[0, [2, 3, 5, 6]].sum() + counts.loc[5, [3, 4, 7, 8]].sum()
the result is same as:
结果是一样的:
a = (((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) |
(df.subtype == 5) | (df.subtype == 6))) |
((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) |
(df.subtype == 8))))
a.sum()
回答by Andy Hayden
In pandas 0.13 (soon to be released) you can pass this as a query, which will use numexpr, which should be more efficient for your usecase:
在 pandas 0.13(即将发布)中,您可以将其作为query传递,它将使用numexpr,这对您的用例来说应该更有效:
df.query("((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) |
(df.subtype == 5) | (df.subtype == 6))) |
((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) |
(df.subtype == 8)))")
Note: I would probably clean up the indentation to make this more readable (you can also replace df.type with type in most cases:
注意:我可能会清理缩进以使其更具可读性(在大多数情况下,您也可以用 type 替换 df.type :
df.query("((type == 0) & ((subtype == 2)"
"|(subtype == 3)"
"|(subtype == 5)"
"|(subtype == 6)))"
"|((type == 5) & ((subtype == 3)"
"|(subtype == 4)"
"|(subtype == 7)"
"|(subtype == 8)))")
Update: It may be able to do this more efficiently, certainly more concisely, using the "in" syntax:
更新:使用“in”语法,它可以更有效地做到这一点,当然更简洁:
df.query("(type == 0) & (subtype in [2, 3, 5, 6])"
"|(type == 5) & (subtype in [3, 4, 7, 8])")

