Python 熊猫按组聚合和列排序

Question

提问by beardc

Given the following dataframe

鉴于以下数据框

In [31]: rand = np.random.RandomState(1)
         df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2,
                            'B': rand.randn(6),
                            'C': rand.rand(6) > .5})

In [32]: df
Out[32]:      A         B      C
         0  foo  1.624345  False
         1  bar -0.611756   True
         2  baz -0.528172  False
         3  foo -1.072969   True
         4  bar  0.865408  False
         5  baz -2.301539   True

I would like to sort it in groups (A) by the aggregated sum of B, and then by the value in C(not aggregated). So basically get the order of the Agroups with

我想A按的总和将其按组 ( )排序B，然后按中的值C（未聚合）进行排序。所以基本上得到A组的顺序

In [28]: df.groupby('A').sum().sort('B')
Out[28]:             B  C
         A               
         baz -2.829710  1
         bar  0.253651  1
         foo  0.551377  1

And then by True/False, so that it ultimately looks like this:

然后通过 True/False，使其最终看起来像这样：

In [30]: df.ix[[5, 2, 1, 4, 3, 0]]
Out[30]: A         B      C
    5  baz -2.301539   True
    2  baz -0.528172  False
    1  bar -0.611756   True
    4  bar  0.865408  False
    3  foo -1.072969   True
    0  foo  1.624345  False

How can this be done?

如何才能做到这一点？

Answer 1

采纳答案by Zelazny7

Groupby A:

分组 A：

In [0]: grp = df.groupby('A')

Within each group, sum over B and broadcast the values using transform. Then sort by B:

在每个组内，对 B 求和并使用变换广播值。然后按B排序：

In [1]: grp[['B']].transform(sum).sort('B')
Out[1]:
          B
2 -2.829710
5 -2.829710
1  0.253651
4  0.253651
0  0.551377
3  0.551377

Index the original df by passing the index from above. This will re-order the A values by the aggregate sum of the B values:

通过从上面传递索引来索引原始 df。这将按 B 值的总和对 A 值重新排序：

In [2]: sort1 = df.ix[grp[['B']].transform(sum).sort('B').index]

In [3]: sort1
Out[3]:
     A         B      C
2  baz -0.528172  False
5  baz -2.301539   True
1  bar -0.611756   True
4  bar  0.865408  False
0  foo  1.624345  False
3  foo -1.072969   True

Finally, sort the 'C' values within groups of 'A' using the sort=Falseoption to preserve the A sort order from step 1:

最后，使用sort=False保留步骤 1 中的 A 排序顺序的选项对'A' 组中的 'C' 值进行排序：

In [4]: f = lambda x: x.sort('C', ascending=False)

In [5]: sort2 = sort1.groupby('A', sort=False).apply(f)

In [6]: sort2
Out[6]:
         A         B      C
A
baz 5  baz -2.301539   True
    2  baz -0.528172  False
bar 1  bar -0.611756   True
    4  bar  0.865408  False
foo 3  foo -1.072969   True
    0  foo  1.624345  False

Clean up the df index by using reset_indexwith drop=True:

使用reset_indexwith清理 df 索引drop=True：

In [7]: sort2.reset_index(0, drop=True)
Out[7]:
     A         B      C
5  baz -2.301539   True
2  baz -0.528172  False
1  bar -0.611756   True
4  bar  0.865408  False
3  foo -1.072969   True
0  foo  1.624345  False

Answer 2

回答by Andy Hayden

One way to do this is to insert a dummy column with the sums in order to sort:

一种方法是插入一个包含总和的虚拟列以进行排序：

In [10]: sum_B_over_A = df.groupby('A').sum().B

In [11]: sum_B_over_A
Out[11]: 
A
bar    0.253652
baz   -2.829711
foo    0.551376
Name: B

in [12]: df['sum_B_over_A'] = df.A.apply(sum_B_over_A.get_value)

In [13]: df
Out[13]: 
     A         B      C  sum_B_over_A
0  foo  1.624345  False      0.551376
1  bar -0.611756   True      0.253652
2  baz -0.528172  False     -2.829711
3  foo -1.072969   True      0.551376
4  bar  0.865408  False      0.253652
5  baz -2.301539   True     -2.829711

In [14]: df.sort(['sum_B_over_A', 'A', 'B'])
Out[14]: 
     A         B      C   sum_B_over_A
5  baz -2.301539   True      -2.829711
2  baz -0.528172  False      -2.829711
1  bar -0.611756   True       0.253652
4  bar  0.865408  False       0.253652
3  foo -1.072969   True       0.551376
0  foo  1.624345  False       0.551376

and maybe you would drop the dummy row:

也许你会放弃虚拟行：

In [15]: df.sort(['sum_B_over_A', 'A', 'B']).drop('sum_B_over_A', axis=1)
Out[15]: 
     A         B      C
5  baz -2.301539   True
2  baz -0.528172  False
1  bar -0.611756   True
4  bar  0.865408  False
3  foo -1.072969   True
0  foo  1.624345  False

Answer 3

回答by Mark Byers

Here's a more concise approach...

这是一个更简洁的方法......

df['a_bsum'] = df.groupby('A')['B'].transform(sum)
df.sort(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1)

The first line adds a column to the data frame with the groupwise sum. The second line performs the sort and then removes the extra column.

第一行使用分组总和向数据框中添加一列。第二行执行排序，然后删除多余的列。

Result:

结果：

    A       B           C
5   baz     -2.301539   True
2   baz     -0.528172   False
1   bar     -0.611756   True
4   bar      0.865408   False
3   foo     -1.072969   True
0   foo      1.624345   False

NOTE: sortis deprecated, use sort_valuesinstead

注意：sort已弃用，请sort_values改用

Python 熊猫按组聚合和列排序

提问by beardc

采纳答案by Zelazny7

回答by Andy Hayden

回答by Mark Byers

相关推荐

最近更新

标签

Python 熊猫按组聚合和列排序

提问by beardc

采纳答案by Zelazny7

回答by Andy Hayden

回答by Mark Byers

相关推荐

Python 将 URL 转换为屏幕截图（脚本）

Python 与 numpy.timedelta64 的时差（以秒为单位）

Python 如何检查一个目录是否是另一个目录的子目录

创建列表/节点类 Python

相关推荐

最近更新

标签