Python pandas groupby 可以聚合成一个列表,而不是 sum、mean 等吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19530568/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:59:11  来源:igfitidea点击:

Can pandas groupby aggregate into a list, rather than sum, mean, etc?

pythonpandas

提问by M.A.Kline

I've had success using the groupby function to sum or average a given variable by groups, but is there a way to aggregate into a list of values, rather than to get a single result? (And would this still be called aggregation?)

我已经成功地使用 groupby 函数按组对给定变量求和或求平均值,但是有没有办法聚合到值列表中,而不是获得单个结果?(这仍然会被称为聚合吗?)

I am not entirely sure this is the approach I should be taking anyhow, so below is an example of the transformation I'd like to make, with toy data.

我并不完全确定这是我应该采用的方法,所以下面是我想要使用玩具数据进行转换的示例。

That is, if the data look something like this:

也就是说,如果数据看起来像这样:

    A    B    C  
    1    10   22
    1    12   20
    1    11   8
    1    10   10
    2    11   13
    2    12   10 
    3    14   0

What I am trying to end up with is something like the following. I am not totally sure whether this can be done through groupby aggregating into lists, and am rather lost as to where to go from here.

我想要结束的是类似于以下内容。我不完全确定这是否可以通过 groupby 聚合到列表中来完成,并且对于从这里去哪里感到很迷茫。

Hypothetical output:

假设输出:

     A    B    C  New1  New2  New3  New4  New5  New6
    1    10   22  12    20    11    8     10    10
    2    11   13  12    10 
    3    14   0

Perhaps I should be pursuing pivots instead? The order by which the data are put into columns does not matter - all columns B through New6 in this example are equivalent. All suggestions/corrections are much appreciated.

也许我应该追求支点?将数据放入列的顺序无关紧要 - 本示例中的所有列 B 到 New6 都是等效的。非常感谢所有建议/更正。

采纳答案by Roman Pekar

my solution is a bit longer than you may expect, I'm sure it could be shortened, but:

我的解决方案比您预期的要长一些,我相信它可以缩短,但是:

g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"])))
k = g.reset_index()
k["i"] = k1.index
k["rn"] = k1.groupby("A")["i"].rank()
k.pivot_table(rows="A", cols="rn", values=0)

# output
# rn   1   2   3   4   5   6
# A                         
# 1   10  12  11  22  20   8
# 2   10  11  10  13 NaN NaN
# 3   14  10 NaN NaN NaN NaN

A bit of explanation. First line, g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"]))). This one group dfby Aand then put columns Band Cinto one column:

一点解释。第一行,g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"])))。这一组dfbyA然后将列BC放入一列:

A   
1  0    10
   1    12
   2    11
   0    22
   1    20
   2     8
2  3    10
   4    11
   3    10
   4    13
3  5    14
   5    10

Then k = g.reset_index(), creating sequential index, result is:

然后k = g.reset_index(),创建顺序索引,结果为:

    A  level_1   0
0   1        0  10
1   1        1  12
2   1        2  11
3   1        0  22
4   1        1  20
5   1        2   8
6   2        3  10
7   2        4  11
8   2        3  10
9   2        4  13
10  3        5  14
11  3        5  10

Now I want to move this index into column (I'd like to hear how I can make a sequential column without resetting index), k["i"] = k1.index:

现在我想这个指标进入塔(我很想听听我怎样才能使一个顺序列而不重置指数)k["i"] = k1.index

    A  level_1   0   i
0   1        0  10   0
1   1        1  12   1
2   1        2  11   2
3   1        0  22   3
4   1        1  20   4
5   1        2   8   5
6   2        3  10   6
7   2        4  11   7
8   2        3  10   8
9   2        4  13   9
10  3        5  14  10
11  3        5  10  11

Now, k["rn"] = k1.groupby("A")["i"].rank()will add row_number inside each A(like row_number() over(partition by A order by i)in SQL:

现在,k["rn"] = k1.groupby("A")["i"].rank()将在每个中添加 row_number A(如row_number() over(partition by A order by i)在 SQL 中:

    A  level_1   0   i  rn
0   1        0  10   0   1
1   1        1  12   1   2
2   1        2  11   2   3
3   1        0  22   3   4
4   1        1  20   4   5
5   1        2   8   5   6
6   2        3  10   6   1
7   2        4  11   7   2
8   2        3  10   8   3
9   2        4  13   9   4
10  3        5  14  10   1
11  3        5  10  11   2

And finally, just pivoting with k.pivot_table(rows="A", cols="rn", values=0):

最后,只是旋转k.pivot_table(rows="A", cols="rn", values=0)

rn   1   2   3   4   5   6
A                         
1   10  12  11  22  20   8
2   10  11  10  13 NaN NaN
3   14  10 NaN NaN NaN NaN

回答by Woody Pride

I have been struggling with the exact same issues, and the answer is that yes you can use grouby to obtain lists. I am not 100% sure I am doing this in the most pythonic way, but here for what its worth is my attempt to get at your question. You can create lists of the data contained in the bygroups like this:

我一直在努力解决完全相同的问题,答案是是的,您可以使用 grouby 来获取列表。我不是 100% 确定我是用最 Pythonic 的方式来做这件事的,但在这里,我试图解决你的问题是值得的。您可以创建包含在 bygroups 中的数据列表,如下所示:

import pandas as pd
import numpy as np
from itertools import chain

Data = {'A' : [1, 1, 1, 1, 2, 2, 3], 'B' : [10, 12, 11, 10, 11, 12, 14], 'C' : [22, 20,     8, 10, 13, 10, 0]}
DF = pd.DataFrame(Data)
DFGrouped = DF.groupby('A')

OutputLists = []

for group in DFGrouped:
    AList = list(group[1].A)
    BList = list(group[1].B)
    CList = list(group[1].C)
    print list(group[1].A)
    print list(group[1].B)
    print list(group[1].C)
    ZIP =  zip(BList, CList)
    print ZIP
    OutputLists.append(list(chain(*ZIP)))

OutputLists

This outputs your data in a list of lists, in the way that I think you want it. You can then make it a data frame. The above print statements are for illustrative purposes only clearly. The most efficient (in terms of code) way to do this using my method is as follows:

这会以我认为您想要的方式在列表列表中输出您的数据。然后,您可以将其设为数据框。上述印刷声明仅用于说明目的。使用我的方法执行此操作的最有效(就代码而言)方法如下:

import pandas as pd
import numpy as np
from itertools import chain

Data = {'A' : [1, 1, 1, 1, 2, 2, 3], 'B' : [10, 12, 11, 10, 11, 12, 14], 'C' : [22, 20, 8, 10, 13, 10, 0]}
DF = pd.DataFrame(Data)
DFGrouped = DF.groupby('A')
OutputLists = []
for group in DFGrouped:
    ZIPPED = zip(group[1].B, group[1].C)
    OutputLists.append(list(chain(*ZIPPED)))
OutputLists

The key to getting lists out of grouped data as far as I can tell is to recognise that the data themselves are stored in group[1] for each group in your grouped data.

据我所知,从分组数据中获取列表的关键是认识到数据本身存储在分组数据中每个组的 group[1] 中。

hope this helps!

希望这可以帮助!

回答by CPBL

I am answering the question as stated in its title and first sentence: the following aggregates values to lists.

我正在回答标题和第一句话中所述的问题:以下将值聚合到列表中。

import pandas as pd

df = pd.DataFrame( {'A' : [1, 1, 1, 1, 2, 2, 3], 'B' : [10, 12, 11, 10, 11, 12, 14], 'C' : [22, 20,     8, 10, 13, 10, 0]})
print df

# Old version
# df2=df.groupby(['A']).apply(lambda tdf: pd.Series(  dict([[vv,tdf[vv].unique().tolist()] for vv in tdf if vv not in ['A']])  )) 
df2 = df.groupby('A').aggregate(lambda tdf: tdf.unique().tolist())
print df2

The output is as follows:

输出如下:

In [3]: run tmp
   A   B   C
0  1  10  22
1  1  12  20
2  1  11   8
3  1  10  10
4  2  11  13
5  2  12  10
6  3  14   0

[7 rows x 3 columns]
              B                C
A                               
1  [10, 12, 11]  [22, 20, 8, 10]
2      [11, 12]         [13, 10]
3          [14]              [0]

[3 rows x 2 columns]

回答by user2623954

I used the following

我使用了以下

grouped = df.groupby('A')

df = grouped.aggregate(lambda x: tuple(x))

df['grouped'] = df['B'] + df['C']

回答by muon

Here is a one liner

这是一个单班轮

# if list of unique items is desired, use set
df.groupby('A',as_index=False)['B'].aggregate(lambda x: set(x))

# if duplicate items are okay, use list
df.groupby('A',as_index=False)['B'].aggregate(lambda x: list(x))

回答by Data-phile

Similar solution, but fairly transparent (I think). you can get full list or unique lists.

类似的解决方案,但相当透明(我认为)。您可以获得完整列表或唯一列表。

df = pd.DataFrame({'A':[1,1,2,2,2,3,3,3,4,5], 
                   'B':[6,7, 8,8,9, 9,9,10,11,12], 
                   'C':['foo']*10})

df
Out[24]: 
   A   B    C
0  1   6  foo
1  1   7  foo
2  2   8  foo
3  2   8  foo
4  2   9  foo
5  3   9  foo
6  3   9  foo
7  3  10  foo
8  4  11  foo
9  5  12  foo

list_agg = df.groupby(by='A').agg({'B':lambda x: list(x), 
                                   'C':lambda x: tuple(x)})

list_agg
Out[26]: 
                 C           B
A                             
1       (foo, foo)      [6, 7]
2  (foo, foo, foo)   [8, 8, 9]
3  (foo, foo, foo)  [9, 9, 10]
4           (foo,)        [11]
5           (foo,)        [12]

unique_list_agg = df.groupby(by='A').agg({'B':lambda x: list(pd.unique(x)), 
                                          'C':lambda x: tuple(pd.unique(x))})

unique_list_agg
Out[28]: 
        C        B
A                 
1  (foo,)   [6, 7]
2  (foo,)   [8, 9]
3  (foo,)  [9, 10]
4  (foo,)     [11]
5  (foo,)     [12]

回答by SummmerFort

df2 = df.groupby('A').aggregate(lambda tdf: tdf.unique().tolist())

This seems to work perfect, but the resultant dataframe has two layers of columns and df.columns shows only one column in the dataframe. To correct this, use:

这似乎很完美,但结果数据帧有两层列,而 df.columns 仅显示数据帧中的一列。要纠正此问题,请使用:

df2_copy=df2.copy()
df2_copy = df2_copy.reset_index(col_level=0)

You can view the column levels using: df2_copy.columns=df2_copy.columns.get_level_values(0)

您可以使用以下方法查看列级别:df2_copy.columns=df2_copy.columns.get_level_values(0)

the df2_copy()should solve this.

df2_copy()应该解决这个问题。