Python 熊猫按组聚合和列排序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14941366/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas sort by group aggregate and column
提问by beardc
Given the following dataframe
鉴于以下数据框
In [31]: rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2,
'B': rand.randn(6),
'C': rand.rand(6) > .5})
In [32]: df
Out[32]: A B C
0 foo 1.624345 False
1 bar -0.611756 True
2 baz -0.528172 False
3 foo -1.072969 True
4 bar 0.865408 False
5 baz -2.301539 True
I would like to sort it in groups (A) by the aggregated sum of B, and then by the value in C(not aggregated). So basically get the order of the Agroups with
我想A按 的总和将其按组 ( )排序B,然后按中的值C(未聚合)进行排序。所以基本上得到A组的顺序
In [28]: df.groupby('A').sum().sort('B')
Out[28]: B C
A
baz -2.829710 1
bar 0.253651 1
foo 0.551377 1
And then by True/False, so that it ultimately looks like this:
然后通过 True/False,使其最终看起来像这样:
In [30]: df.ix[[5, 2, 1, 4, 3, 0]]
Out[30]: A B C
5 baz -2.301539 True
2 baz -0.528172 False
1 bar -0.611756 True
4 bar 0.865408 False
3 foo -1.072969 True
0 foo 1.624345 False
How can this be done?
如何才能做到这一点?
采纳答案by Zelazny7
Groupby A:
分组 A:
In [0]: grp = df.groupby('A')
Within each group, sum over B and broadcast the values using transform. Then sort by B:
在每个组内,对 B 求和并使用变换广播值。然后按B排序:
In [1]: grp[['B']].transform(sum).sort('B')
Out[1]:
B
2 -2.829710
5 -2.829710
1 0.253651
4 0.253651
0 0.551377
3 0.551377
Index the original df by passing the index from above. This will re-order the A values by the aggregate sum of the B values:
通过从上面传递索引来索引原始 df。这将按 B 值的总和对 A 值重新排序:
In [2]: sort1 = df.ix[grp[['B']].transform(sum).sort('B').index]
In [3]: sort1
Out[3]:
A B C
2 baz -0.528172 False
5 baz -2.301539 True
1 bar -0.611756 True
4 bar 0.865408 False
0 foo 1.624345 False
3 foo -1.072969 True
Finally, sort the 'C' values within groups of 'A' using the sort=Falseoption to preserve the A sort order from step 1:
最后,使用sort=False保留步骤 1 中的 A 排序顺序的选项对'A' 组中的 'C' 值进行排序:
In [4]: f = lambda x: x.sort('C', ascending=False)
In [5]: sort2 = sort1.groupby('A', sort=False).apply(f)
In [6]: sort2
Out[6]:
A B C
A
baz 5 baz -2.301539 True
2 baz -0.528172 False
bar 1 bar -0.611756 True
4 bar 0.865408 False
foo 3 foo -1.072969 True
0 foo 1.624345 False
Clean up the df index by using reset_indexwith drop=True:
使用reset_indexwith清理 df 索引drop=True:
In [7]: sort2.reset_index(0, drop=True)
Out[7]:
A B C
5 baz -2.301539 True
2 baz -0.528172 False
1 bar -0.611756 True
4 bar 0.865408 False
3 foo -1.072969 True
0 foo 1.624345 False
回答by Andy Hayden
One way to do this is to insert a dummy column with the sums in order to sort:
一种方法是插入一个包含总和的虚拟列以进行排序:
In [10]: sum_B_over_A = df.groupby('A').sum().B
In [11]: sum_B_over_A
Out[11]:
A
bar 0.253652
baz -2.829711
foo 0.551376
Name: B
in [12]: df['sum_B_over_A'] = df.A.apply(sum_B_over_A.get_value)
In [13]: df
Out[13]:
A B C sum_B_over_A
0 foo 1.624345 False 0.551376
1 bar -0.611756 True 0.253652
2 baz -0.528172 False -2.829711
3 foo -1.072969 True 0.551376
4 bar 0.865408 False 0.253652
5 baz -2.301539 True -2.829711
In [14]: df.sort(['sum_B_over_A', 'A', 'B'])
Out[14]:
A B C sum_B_over_A
5 baz -2.301539 True -2.829711
2 baz -0.528172 False -2.829711
1 bar -0.611756 True 0.253652
4 bar 0.865408 False 0.253652
3 foo -1.072969 True 0.551376
0 foo 1.624345 False 0.551376
and maybe you would drop the dummy row:
也许你会放弃虚拟行:
In [15]: df.sort(['sum_B_over_A', 'A', 'B']).drop('sum_B_over_A', axis=1)
Out[15]:
A B C
5 baz -2.301539 True
2 baz -0.528172 False
1 bar -0.611756 True
4 bar 0.865408 False
3 foo -1.072969 True
0 foo 1.624345 False
回答by Mark Byers
Here's a more concise approach...
这是一个更简洁的方法......
df['a_bsum'] = df.groupby('A')['B'].transform(sum)
df.sort(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1)
The first line adds a column to the data frame with the groupwise sum. The second line performs the sort and then removes the extra column.
第一行使用分组总和向数据框中添加一列。第二行执行排序,然后删除多余的列。
Result:
结果:
A B C
5 baz -2.301539 True
2 baz -0.528172 False
1 bar -0.611756 True
4 bar 0.865408 False
3 foo -1.072969 True
0 foo 1.624345 False
NOTE: sortis deprecated, use sort_valuesinstead
注意:sort已弃用,请sort_values改用

