Python Pandas groupby:如何获得字符串的并集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17841149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:18:20  来源:igfitidea点击:

Pandas groupby: How to get a union of strings

pythonpandas

提问by Anne

I have a dataframe like this:

我有一个这样的数据框:

   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

Calling

打电话

In [10]: print df.groupby("A")["B"].sum()

will return

将返回

A
1    1.615586
2    0.421821
3    0.463468
4    0.643961

Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.

现在我想对“C”列做“相同的”。因为该列包含字符串,所以 sum() 不起作用(尽管您可能认为它会连接字符串)。我真正想看到的是每个组的字符串列表或集合,即

A
1    {This, string}
2    {is, !}
3    {a}
4    {random}

I have been trying to find ways to do this.

我一直在努力寻找方法来做到这一点。

Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although

Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html)不起作用,虽然

df.groupby("A")["B"]

is a

是一个

pandas.core.groupby.SeriesGroupBy object

so I was hoping any Series method would work. Any ideas?

所以我希望任何系列方法都能奏效。有任何想法吗?

采纳答案by Jeff

In [4]: df = read_csv(StringIO(data),sep='\s+')

In [5]: df
Out[5]: 
   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

In [6]: df.dtypes
Out[6]: 
A      int64
B    float64
C     object
dtype: object

When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum()to the groupby

当您应用自己的函数时,不会自动排除非数字列。但是,这比应用.sum()groupby

In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]: 
   A         B           C
A                         
1  2  1.615586  Thisstring
2  4  0.421821         is!
3  3  0.463468           a
4  4  0.643961      random

sumby default concatenates

sum默认连接

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]: 
A
1    Thisstring
2           is!
3             a
4        random
dtype: object

You can do pretty much what you want

你几乎可以做你想做的事

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]: 
A
1    {This, string}
2           {is, !}
3               {a}
4          {random}
dtype: object

Doing this on a whole frame, one group at a time. Key is to return a Series

在整个框架上执行此操作,一次一组。关键是返回一个Series

def f(x):
     return Series(dict(A = x['A'].sum(), 
                        B = x['B'].sum(), 
                        C = "{%s}" % ', '.join(x['C'])))

In [14]: df.groupby('A').apply(f)
Out[14]: 
   A         B               C
A                             
1  2  1.615586  {This, string}
2  4  0.421821         {is, !}
3  3  0.463468             {a}
4  4  0.643961        {random}

回答by BrenBarn

You can use the applymethod to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.

您可以使用该apply方法将任意函数应用于分组数据。所以如果你想要一套,申请set。如果你想要一个列表,申请list

>>> d
   A       B
0  1    This
1  2      is
2  3       a
3  4  random
4  1  string
5  2       !
>>> d.groupby('A')['B'].apply(list)
A
1    [This, string]
2           [is, !]
3               [a]
4          [random]
dtype: object

If you want something else, just write a function that does what you want and then applythat.

如果您想要其他东西,只需编写一个可以执行您想要的功能的函数,然后就apply可以了。

回答by voithos

You may be able to use the aggregate(or agg) function to concatenate the values. (Untested code)

您可以使用aggregate(or agg) 函数来连接值。(未经测试的代码)

df.groupby('A')['B'].agg(lambda col: ''.join(col))

回答by UserYmY

a simple solution would be :

一个简单的解决方案是:

>>> df.groupby(['A','B']).c.unique().reset_index()

回答by user3241146

You could try this:

你可以试试这个:

df.groupby('A').agg({'B':'sum','C':'-'.join})

回答by Amit

If you'd like to overwrite column B in the dataframe, this should work:

如果您想覆盖数据框中的 B 列,这应该有效:

    df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))

回答by Erfan

Named aggregations with pandas >= 0.25.0

命名聚合 pandas >= 0.25.0

Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:

从 Pandas 0.25.0 版本开始,我们已经命名了聚合,我们可以在其中分组、聚合并同时为我们的列分配新名称。这样我们就不会得到 MultiIndex 列,考虑到它们包含的数据,列名更有意义:



aggregate and get a list of strings

聚合并获取字符串列表

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', list)).reset_index()

print(grp)
   A     B_sum               C
0  1  1.615586  [This, string]
1  2  0.421821         [is, !]
2  3  0.463468             [a]
3  4  0.643961        [random]


aggregate and join the strings

聚合并连接字符串

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', ', '.join)).reset_index()

print(grp)
   A     B_sum             C
0  1  1.615586  This, string
1  2  0.421821         is, !
2  3  0.463468             a
3  4  0.643961        random

回答by Paul Rougieux

Following @Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:

遵循@Erfan 的好答案,大多数情况下,在对聚合值的分析中,您需要这些现有字符值的独特可能组合:

unique_chars = lambda x: ', '.join(x.unique())
(df
 .groupby(['A'])
 .agg({'C': unique_chars}))