Python 如何将数据帧行分组到pandas groupby中的列表中?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22219004/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:30:24  来源:igfitidea点击:

How to group dataframe rows into list in pandas groupby?

pythonpandaslistaggregatepandas-groupby

提问by Abhishek Thakur

I have a pandas data frame dflike:

我有一个熊猫数据框,df如:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second column as lists in rows:

我想按第一列分组并将第二列作为行中的列表

A [1,2]
B [5,5,4]
C [6]

Is it possible to do something like this using pandas groupby?

是否可以使用 Pandas groupby 做这样的事情?

采纳答案by EdChum

You can do this using groupbyto group on the column of interest and then applylistto every group:

您可以使用groupby对感兴趣的列进行分组,然后applylist对每个组进行分组:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

回答by Acorbe

As you were saying the groupbymethod of a pd.DataFrameobject can do the job.

正如您所说,对象的groupby方法pd.DataFrame可以完成这项工作。

Example

例子

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

它给出了组的索引描述。

To get elements of single groups, you can do, for instance

要获取单个组的元素,您可以执行以下操作,例如

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

回答by B. M.

If performance is important go down to numpy level:

如果性能很重要,请下降到 numpy 级别:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

Tests:

测试:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

回答by Anamika Modi

A handy way to achieve this would be:

实现这一目标的便捷方法是:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

研究编写自定义聚合:https: //www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

回答by Markus Dutschke

To solve this for several columns of a dataframe:

要为数据帧的多列解决此问题:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi's answer. Thank you!

这个答案的灵感来自Anamika Modi的答案。谢谢!

回答by YOBEN_S

Let us using df.groupbywith list and Seriesconstructor

让我们使用df.groupby列表和Series构造函数

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

回答by cs95

Use any of the following groupbyand aggrecipes.

使用以下任何一种groupbyagg食谱。

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:

要将多列聚合为列表,请使用以下任一方法:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupByobject, then call SeriesGroupBy.agg. Use,

要仅对单个列进行分组列表,请将 groupby 转换为SeriesGroupBy对象,然后调用SeriesGroupBy.agg. 用,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

回答by Ganesh Kharad

Here I have grouped elements with "|" as a separator

在这里,我用“|”对元素进行了分组 作为分隔符

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]

回答by Vanshika

If looking for a uniquelistwhile grouping multiple columns this could probably help:

如果在对多个列进行分组时寻找唯一列表,这可能会有所帮助:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

回答by Mithril

It is time to use agginstead of apply.

是时候使用agg而不是apply.

When

什么时候

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

If you want multiple columns stack into list , result in pd.DataFrame

如果您希望多列堆叠到列表中,则导致 pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

If you want single column in list, result in ps.Series

如果你想要列表中的单列,结果 ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

Note, result in pd.DataFrameis about 10x slower than result in ps.Serieswhen you only aggregate single column, use it in multicolumns case .

请注意, result inpd.DataFrameps.Series仅聚合单列时的result 慢约 10 倍,在多列情况下使用它。