Python 以系列结尾的 Pandas OR 语句包含

Question

提问by TristanMatthews

I have a DataFrame dfthat has columns typeand subtypeand about 100k rows, I'm trying to classify what kind of data dfcontains by checking type/ subtypecombinations. While dfcan contain many different combinations there are particular combinations that only appear in certain data types. To check if my objects contains any of these combinations I'm currently doing:

我有一个df包含列type和subtype大约 10 万行的 DataFrame ，我试图df通过检查type/subtype组合来对包含的数据类型进行分类。虽然df可以包含许多不同的组合，但有一些特定的组合仅出现在某些数据类型中。要检查我的对象是否包含我目前正在做的任何这些组合：

typeA = ((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | 
         (df.subtype == 5) | (df.subtype == 6))) | 
         ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | 
         (df.subtype ==  8)))
A = typeA.sum()

Where typeA is a long Series of Falses that might have some Trues, if A > 0 then I know it contained a True. The problem with this scheme is that if the first row of the df produces a True it still has to check everything else. Checking the whole DataFrame is faster then using a for loop with a break, but I'm wondering if there is a better way to do it.

typeA 是一长串 Falses，可能有一些 Trues，如果 A > 0 那么我知道它包含一个 True。这个方案的问题是，如果 df 的第一行产生一个 True ，它仍然必须检查其他一切。检查整个 DataFrame 比使用带中断的 for 循环更快，但我想知道是否有更好的方法来做到这一点。

Thanks for any suggestions.

感谢您的任何建议。

Answer 1

采纳答案by HYRY

use Pandas crosstab:

使用熊猫crosstab：

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 10, size=(100, 2)), columns=["type", "subtype"])
counts = pd.crosstab(df.type, df.subtype)

print counts.loc[0, [2, 3, 5, 6]].sum() + counts.loc[5, [3, 4, 7, 8]].sum()

the result is same as:

结果是一样的：

a = (((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | 
         (df.subtype == 5) | (df.subtype == 6))) | 
         ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | 
         (df.subtype ==  8))))
a.sum()

Answer 2

回答by Andy Hayden

In pandas 0.13 (soon to be released) you can pass this as a query, which will use numexpr, which should be more efficient for your usecase:

在 pandas 0.13（即将发布）中，您可以将其作为query传递，它将使用numexpr，这对您的用例来说应该更有效：

df.query("((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | 
         (df.subtype == 5) | (df.subtype == 6))) | 
         ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | 
         (df.subtype ==  8)))")

Note: I would probably clean up the indentation to make this more readable (you can also replace df.type with type in most cases:

注意：我可能会清理缩进以使其更具可读性（在大多数情况下，您也可以用 type 替换 df.type ：

df.query("((type == 0) & ((subtype == 2)"
                        "|(subtype == 3)"
                        "|(subtype == 5)"
                        "|(subtype == 6)))"
        "|((type == 5) & ((subtype == 3)"
                        "|(subtype == 4)"
                        "|(subtype == 7)"
                        "|(subtype ==  8)))")

Update: It may be able to do this more efficiently, certainly more concisely, using the "in" syntax:

更新：使用“in”语法，它可以更有效地做到这一点，当然更简洁：

df.query("(type == 0) & (subtype in [2, 3, 5, 6])"
        "|(type == 5) & (subtype in [3, 4, 7, 8])")

Python 以系列结尾的 Pandas OR 语句包含

提问by TristanMatthews

采纳答案by HYRY

回答by Andy Hayden

相关推荐

最近更新

标签

Python 以系列结尾的 Pandas OR 语句包含

提问by TristanMatthews

采纳答案by HYRY

回答by Andy Hayden

相关推荐

如何在python中将字节字符串拆分为单独的字节

Python 为 Scikit-Learn 向量化 Pandas 数据框

Python 在 DataFrame 索引上应用函数

Python 使用 Pandas 为字符串列中的每个值添加字符串前缀

相关推荐

最近更新

标签