具有冗余 nan 类别的 Pandas groupby
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48471648/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas groupby with categories with redundant nan
提问by jpp
I am having issues using pandas groupbywith categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categoriesmust be accounted for.
我在使用带有分类数据的Pandas groupby时遇到问题。从理论上讲,它应该非常高效:您通过整数而不是字符串进行分组和索引。但它坚持认为,当按多个类别分组时,必须考虑类别的每个组合。
I sometimes use categories even when there's a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each column. When grouping by 3 columns, pandas
forces us to hold results for 1000^3 groups.
我有时甚至在公共字符串的密度较低时使用类别,仅仅是因为这些字符串很长并且可以节省内存/提高性能。有时每列中有数千个类别。按 3 列分组时,pandas
强制我们保留 1000^3 组的结果。
My question: is there a convenient way to use groupby
with categories while avoiding this untoward behaviour? I'm not looking for any of these solutions:
我的问题:有没有一种方便的方法可以groupby
在避免这种不良行为的同时使用类别?我不是在寻找以下任何解决方案:
- Recreating all the functionality via
numpy
. - Continually converting to strings/codes before
groupby
, reverting to categories later. - Making a tuple column from group columns, then group by the tuple column.
- 通过
numpy
. - 之前不断转换为字符串/代码
groupby
,稍后恢复为类别。 - 从组列中创建一个元组列,然后按元组列进行分组。
I'm hoping there's a way to modify just this particular pandas
idiosyncrasy. A simple example is below. Instead of 4 categories I want in the output, I end up with 12.
我希望有一种方法可以修改这种特殊的pandas
特质。下面是一个简单的例子。我想要的输出不是 4 个类别,而是 12 个。
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
df.groupby(group_cols, as_index=False).sum()
Group1 Group2 Group3 Value
# A A A NaN
# A A C NaN
# A A D NaN
# A B A NaN
# A B C 54.34
# A B D 826.74
# B A A 765.40
# B A C 514.50
# B A D NaN
# B B A NaN
# B B C NaN
# B B D NaN
Bounty update
赏金更新
The issue is poorly addressed by pandas development team (cf github.com/pandas-dev/pandas/issues/17594). Therefore, I am looking for responses that address any of the following:
pandas 开发团队没有很好地解决这个问题(参见github.com/pandas-dev/pandas/issues/17594)。因此,我正在寻找解决以下任何问题的回复:
- Why, with reference to pandas source code, is categorical data treated differently in groupby operations?
- Why would the current implementation be preferred? I appreciate this is subjective, but I am struggling to find any answer to this question. Current behaviour is prohibitive in many situations without cumbersome, potentially expensive, workarounds.
- Is there a clean solution to override pandas treatment of categorical data in groupby operations? Note the 3 no-go routes (dropping down to numpy; conversions to/from codes; creating and grouping by tuple columns). I would prefer a solution that is "pandas-compliant" to minimise / avoid loss of other pandas categorical functionality.
- A response from pandas development team to support and clarify existing treatment. Also, why should considering all category combinations not be configurable as a Boolean parameter?
- 为什么参考pandas源代码,在groupby操作中分类数据的处理方式不同?
- 为什么当前的实现是首选?我很欣赏这是主观的,但我正在努力找到这个问题的任何答案。在许多情况下,当前的行为是令人望而却步的,而没有繁琐、可能昂贵的变通方法。
- 是否有一个干净的解决方案来覆盖 groupby 操作中的分类数据的Pandas处理?请注意 3 条禁止路线(下拉到 numpy;与代码之间的转换;按元组列创建和分组)。我更喜欢“符合Pandas标准”的解决方案,以最小化/避免其他Pandas分类功能的丢失。
- Pandas开发团队对支持和阐明现有治疗方法的回应。另外,为什么要考虑不能将所有类别组合配置为布尔参数?
Bounty update #2
赏金更新 #2
To be clear, I'm not expecting answers to all of the above 4 questions. The main question I am asking is whether it's possible, or advisable, to overwrite pandas
library methods so that categories are treated in a way that facilitates groupby
/ set_index
operations.
需要明确的是,我并不期待以上 4 个问题的所有答案。我要问的主要问题是是否可以或建议覆盖pandas
库方法,以便以促进groupby
/set_index
操作的方式处理类别。
采纳答案by Ismael EL ATIFI
Since Pandas 0.23.0, the groupby
methodcan now take a parameter observed
which fixes this issue if it is set to True
(False by default).
Below is the exact same code as in the question with just observed=True
added :
从 Pandas 0.23.0 开始,该groupby
方法现在可以采用一个参数observed
来解决这个问题,如果它被设置为True
(默认为False)。以下是与刚刚observed=True
添加的问题完全相同的代码:
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
df.groupby(group_cols, as_index=False, observed=True).sum()
回答by Gabriel A
I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?
我能够得到一个应该非常有效的解决方案。我会用更好的解释来编辑我的帖子。但与此同时,这对你有用吗?
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
result = df.groupby([df[col].values.codes for col in group_cols]).sum()
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
result
So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.
所以这个问题的答案更像是一个适当的编程,而不是一个普通的 Pandas 问题。在幕后,所有分类序列只是一堆数字,这些数字索引到一个类别的名称中。我对这些基础数字进行了分组,因为它们与分类列没有相同的问题。这样做之后,我不得不重命名列。然后我使用 from_codes 构造函数有效地创建将整数列表转换回分类列。
Group1 Group2 Group3 Value
A B C 54.34
A B D 826.74
B A A 765.40
B A C 514.50
So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.
所以我知道这不完全是你的答案,但我已经将我的解决方案变成了一个小功能,供将来遇到这个问题的人使用。
def categorical_groupby(df,group_cols,agg_fuction="sum"):
"Does a groupby on a number of categorical columns"
result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
return result
call it like this:
像这样称呼它:
df.pipe(categorical_groupby,group_cols)
回答by Tai
I found the behavior similar to what's documented in the operations section of Categorical Data.
我发现这种行为类似于Categorical Data的操作部分中记录的行为。
In particular, similar to
特别地,类似于
In [121]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"]) In [122]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]}) In [123]: df2.groupby(["cats","B"]).mean() Out[123]: values cats B a c 1.0 d 2.0 b c 3.0 d 4.0 c c NaN d NaN
In [121]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"]) In [122]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]}) In [123]: df2.groupby(["cats","B"]).mean() Out[123]: values cats B a c 1.0 d 2.0 b c 3.0 d 4.0 c c NaN d NaN
Some other words describing the related behavior in Series
and groupby
. There is also a pivot table example in the end of the section.
一些换句话说描述的相关行为Series
和groupby
。本节末尾还有一个数据透视表示例。
Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with categorical data:
Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data:
Groupby will also show “unused” categories:
除了 Series.min()、Series.max() 和 Series.mode() 之外,还可以对分类数据进行以下操作:
像 Series.value_counts() 这样的系列方法将使用所有类别,即使数据中不存在某些类别:
Groupby 还将显示“未使用”类别:
The words and the example are cited from Categorical Data.
单词和示例均来自Categorical Data。
回答by Qusai Alothman
There is a lotof questions to be answered here.
Let's start by understanding what a 'category' is...
这里有很多问题需要回答。
让我们首先了解什么是“类别”...
Definition of Categorical dtype
分类数据类型的定义
Quoting from pandas docsfor "Categorical Data":
从Pandas文档中引用“分类数据”:
Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.
Categoricals 是一种 Pandas 数据类型,它对应于统计中的分类变量:一个变量,它只能采用有限的,通常是固定数量的可能值(类别;R 中的级别)。例如性别、社会阶层、血型、国家隶属关系、观察时间或通过李克特量表的评级。
There are two points I want to focus on here:
我想在这里重点关注两点:
The definition of categoricals as a statistical variable:
basically, this means we have to look at them from a statistical point of view, not the "regular" programming one. i.e. they are not 'enumerates'. Statistical categorical variables has specific operations and usecases, you can read more about them in wikipedia.
I'll talk more about this after the second point.Categories are levels in R:
We can understand more about categoricals if we read aboutR
levels and factors.
I don't know much about R, but I found this sourcesimple and sufficient. Quoting an interesting example from it:When a factor is first created, all of its levels are stored along with the factor, and if subsets of the factor are extracted, they will retain all of the original levels. This can create problems when constructing model matrices and may or may not be useful when displaying the data using, say, the table function. As an example, consider a random sample from the letters vector, which is part of the base R distribution. > lets = sample(letters,size=100,replace=TRUE) > lets = factor(lets) > table(lets[1:5]) a b c d e f g h i j k l m n o p q r s t u v w x y z 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 Even though only five of the levels were actually represented, the table function shows the frequencies for all of the levels of the original factors. To change this, we can simply use another call to factor > table(factor(lets[1:5])) a k q s z 1 1 1 1 1
分类作为统计变量的定义:
基本上,这意味着我们必须从统计的角度来看待它们,而不是“常规”编程的角度。即它们不是“枚举”。统计分类变量具有特定的操作和用例,您可以在wikipedia 中阅读有关它们的更多信息。
我会在第二点之后更多地谈论这个。类别是 R 中的级别:
如果我们阅读有关R
级别和因素的内容,我们可以了解更多关于类别的信息。
我对 R 了解不多,但我发现这个来源简单而充分。引用一个有趣的例子:When a factor is first created, all of its levels are stored along with the factor, and if subsets of the factor are extracted, they will retain all of the original levels. This can create problems when constructing model matrices and may or may not be useful when displaying the data using, say, the table function. As an example, consider a random sample from the letters vector, which is part of the base R distribution. > lets = sample(letters,size=100,replace=TRUE) > lets = factor(lets) > table(lets[1:5]) a b c d e f g h i j k l m n o p q r s t u v w x y z 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 Even though only five of the levels were actually represented, the table function shows the frequencies for all of the levels of the original factors. To change this, we can simply use another call to factor > table(factor(lets[1:5])) a k q s z 1 1 1 1 1
Basically this tells us that displaying/using all the categories even if they are not needed is not that uncommon. And actually, it's the default behavior!
This is due to the usual use-cases of categorical variables in statistics. Almost in all the cases you docare about all the categories even if they are not used. Take for example the pandas function cut.
基本上这告诉我们显示/使用所有类别,即使它们不需要它们也并不少见。实际上,这是默认行为!
这是由于统计中分类变量的常见用例。几乎在所有情况下,您都关心所有类别,即使它们没有被使用。以Pandas函数cut为例。
I hope by this point that you understood why this behavior exists in pandas.
我希望此时您已经理解为什么Pandas中存在这种行为。
GroupBy on Categorical Variables
分类变量上的 GroupBy
As of why does groupby
consider all the combinations of categories: I can't say for sure, but my best guess based on a quick review of the source code (and the github issue you mentioned), is that they consider the groupby
on categorical variables an interactionbetween them. Hence, It should consider all the pairs/tuples (like a Cartesian product). AFAIK, this helps a lot when you are trying to do something like ANOVA.
This also means that in this context you can't think of it in the usual SQL-like terminology.
至于为什么要groupby
考虑所有类别的组合:我不能肯定地说,但是基于对源代码(以及您提到的 github 问题)的快速,我的最佳猜测是他们认为groupby
分类变量是他们之间的互动。因此,它应该考虑所有的对/元组(如笛卡尔积)。AFAIK,当您尝试执行诸如ANOVA 之类的操作时,这会很有帮助。
这也意味着在这种情况下,您无法用通常的类似 SQL 的术语来考虑它。
Solutions?
解决方案?
Ok, but what if you don't want this behavior?
To the best of my knowledge, and taking into account that I spent the last night tracing this in pandas source code, you can't "disable" it. It's hard coded in every critical step.
However, because of the way groupby
works, the actual "expanding" doesn't happen until it's needed. For example, when calling sum
over the groups or trying to print them.
Hence, you can do any of the following to get only the needed groups:
好的,但是如果您不想要这种行为怎么办?
据我所知,考虑到我昨晚花了在Pandas源代码中跟踪它,你不能“禁用”它。它在每个关键步骤中都经过硬编码。
但是,由于工作方式的原因groupby
,实际的“扩展”只有在需要时才会发生。例如,当呼叫sum
组或尝试打印它们时。
因此,您可以执行以下任何操作以仅获取所需的组:
df.groupby(group_cols).indices
#{('A', 'B', 'C'): array([0]),
# ('A', 'B', 'D'): array([1, 4]),
# ('B', 'A', 'A'): array([3]),
# ('B', 'A', 'C'): array([2])}
df.groupby(group_cols).groups
#{('A', 'B', 'C'): Int64Index([0], dtype='int64'),
# ('A', 'B', 'D'): Int64Index([1, 4], dtype='int64'),
# ('B', 'A', 'A'): Int64Index([3], dtype='int64'),
# ('B', 'A', 'C'): Int64Index([2], dtype='int64')}
# an example
for g in df.groupby(group_cols).groups:
print(g, grt.get_group(g).sum()[0])
#('A', 'B', 'C') 54.34
#('A', 'B', 'D') 826.74
#('B', 'A', 'A') 765.4
#('B', 'A', 'C') 514.5
I know this is a no-go for you, but I'm 99% sure that there is no direct way to do this.
I agree that there should be a boolean variable to disable this behavior and use the "regular" SQL-like one.
我知道这对您来说是行不通的,但我 99% 肯定没有直接的方法可以做到这一点。
我同意应该有一个布尔变量来禁用这种行为并使用“常规”类似 SQL 的变量。
回答by Randall Goodwin
I found this post while debugging something similar. Very good post, and I really like the inclusion of boundary conditions!
我在调试类似的东西时发现了这篇文章。非常好的帖子,我真的很喜欢包含边界条件!
Here's the code that accomplishes the initial goal:
这是实现初始目标的代码:
r = df.groupby(group_cols, as_index=False).agg({'Value': 'sum'})
r.columns = ['_'.join(col).strip('_') for col in r.columns]
The downside of this solution is that it results in a hierarchical column index that you may want to flatten (especially if you have multiple statistics). I included flattening of column index in the code above.
此解决方案的缺点是它会导致您可能想要展平的分层列索引(特别是如果您有多个统计信息)。我在上面的代码中包含了列索引的展平。
I don't know why instance methods:
我不知道为什么实例方法:
df.groupby(group_cols).sum()
df.groupby(group_cols).mean()
df.groupby(group_cols).stdev()
use all unique combinations of categorical variables, while the .agg() method:
使用分类变量的所有唯一组合,而 .agg() 方法:
df.groupby(group_cols).agg(['count', 'sum', 'mean', 'std'])
ignores the unused level combinations of the groups. That seems inconsistent. Just happy that we can use the .agg() method and not have to worry about a Cartesian combination explosion.
忽略组的未使用级别组合。这似乎不一致。很高兴我们可以使用 .agg() 方法而不必担心笛卡尔组合爆炸。
Also, I think it is very common to have a much lower unique cardinality count vs. the Cartesian product. Think of all the cases where data has columns like "State", "County", 'Zip"... these are all nested variables and many data sets out there have variables that have a high degree of nesting.
此外,我认为与笛卡尔积相比,唯一基数计数要低得多是很常见的。想一想数据具有“State”、“County”、“Zip”等列的所有情况……这些都是嵌套变量,而且许多数据集都具有高度嵌套的变量。
In our case the difference between Cartesian product of the grouping variables and the naturally occurring combinations is over 1000x (and the starting data set is over 1,000,000 rows).
在我们的例子中,分组变量的笛卡尔积与自然发生的组合之间的差异超过 1000 倍(并且起始数据集超过 1,000,000 行)。
Consequently, I would have voted for making observed=True the default behavior.
因此,我会投票支持将 Observed=True 设为默认行为。