Python 以系列结尾的 Pandas OR 语句包含

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20062684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:24:56  来源:igfitidea点击:

Pandas OR statement ending in series contains

pythonpandas

提问by TristanMatthews

I have a DataFrame dfthat has columns typeand subtypeand about 100k rows, I'm trying to classify what kind of data dfcontains by checking type/ subtypecombinations. While dfcan contain many different combinations there are particular combinations that only appear in certain data types. To check if my objects contains any of these combinations I'm currently doing:

我有一个df包含列typesubtype大约 10 万行的 DataFrame ,我试图df通过检查type/subtype组合来对包含的数据类型进行分类。虽然df可以包含许多不同的组合,但有一些特定的组合仅出现在某些数据类型中。要检查我的对象是否包含我目前正在做的任何这些组合:

typeA = ((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | 
         (df.subtype == 5) | (df.subtype == 6))) | 
         ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | 
         (df.subtype ==  8)))
A = typeA.sum()

Where typeA is a long Series of Falses that might have some Trues, if A > 0 then I know it contained a True. The problem with this scheme is that if the first row of the df produces a True it still has to check everything else. Checking the whole DataFrame is faster then using a for loop with a break, but I'm wondering if there is a better way to do it.

typeA 是一长串 Falses,可能有一些 Trues,如果 A > 0 那么我知道它包含一个 True。这个方案的问题是,如果 df 的第一行产生一个 True ,它仍然必须检查其他一切。检查整个 DataFrame 比使用带中断的 for 循环更快,但我想知道是否有更好的方法来做到这一点。

Thanks for any suggestions.

感谢您的任何建议。

采纳答案by HYRY

use Pandas crosstab:

使用熊猫crosstab

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 10, size=(100, 2)), columns=["type", "subtype"])
counts = pd.crosstab(df.type, df.subtype)

print counts.loc[0, [2, 3, 5, 6]].sum() + counts.loc[5, [3, 4, 7, 8]].sum()

the result is same as:

结果是一样的:

a = (((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | 
         (df.subtype == 5) | (df.subtype == 6))) | 
         ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | 
         (df.subtype ==  8))))
a.sum()

回答by Andy Hayden

In pandas 0.13 (soon to be released) you can pass this as a query, which will use numexpr, which should be more efficient for your usecase:

在 pandas 0.13(即将发布)中,您可以将其作为query传递,它将使用numexpr,这对您的用例来说应该更有效:

df.query("((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | 
         (df.subtype == 5) | (df.subtype == 6))) | 
         ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | 
         (df.subtype ==  8)))")

Note: I would probably clean up the indentation to make this more readable (you can also replace df.type with type in most cases:

注意:我可能会清理缩进以使其更具可读性(在大多数情况下,您也可以用 type 替换 df.type :

df.query("((type == 0) & ((subtype == 2)"
                        "|(subtype == 3)"
                        "|(subtype == 5)"
                        "|(subtype == 6)))"
        "|((type == 5) & ((subtype == 3)"
                        "|(subtype == 4)"
                        "|(subtype == 7)"
                        "|(subtype ==  8)))")

Update: It may be able to do this more efficiently, certainly more concisely, using the "in" syntax:

更新:使用“in”语法,它可以更有效地做到这一点,当然更简洁:

df.query("(type == 0) & (subtype in [2, 3, 5, 6])"
        "|(type == 5) & (subtype in [3, 4, 7, 8])")