Python Pandas groupby:如何获得字符串的并集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17841149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas groupby: How to get a union of strings
提问by Anne
I have a dataframe like this:
我有一个这样的数据框:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
Calling
打电话
In [10]: print df.groupby("A")["B"].sum()
will return
将返回
A
1 1.615586
2 0.421821
3 0.463468
4 0.643961
Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.
现在我想对“C”列做“相同的”。因为该列包含字符串,所以 sum() 不起作用(尽管您可能认为它会连接字符串)。我真正想看到的是每个组的字符串列表或集合,即
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
I have been trying to find ways to do this.
我一直在努力寻找方法来做到这一点。
Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although
Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html)不起作用,虽然
df.groupby("A")["B"]
is a
是一个
pandas.core.groupby.SeriesGroupBy object
so I was hoping any Series method would work. Any ideas?
所以我希望任何系列方法都能奏效。有任何想法吗?
采纳答案by Jeff
In [4]: df = read_csv(StringIO(data),sep='\s+')
In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum()
to the groupby
当您应用自己的函数时,不会自动排除非数字列。但是,这比应用.sum()
到groupby
In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random
sum
by default concatenates
sum
默认连接
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object
You can do pretty much what you want
你几乎可以做你想做的事
In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]:
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
dtype: object
Doing this on a whole frame, one group at a time. Key is to return a Series
在整个框架上执行此操作,一次一组。关键是返回一个Series
def f(x):
return Series(dict(A = x['A'].sum(),
B = x['B'].sum(),
C = "{%s}" % ', '.join(x['C'])))
In [14]: df.groupby('A').apply(f)
Out[14]:
A B C
A
1 2 1.615586 {This, string}
2 4 0.421821 {is, !}
3 3 0.463468 {a}
4 4 0.643961 {random}
回答by BrenBarn
You can use the apply
method to apply an arbitrary function to the grouped data. So if you want a set, apply set
. If you want a list, apply list
.
您可以使用该apply
方法将任意函数应用于分组数据。所以如果你想要一套,申请set
。如果你想要一个列表,申请list
。
>>> d
A B
0 1 This
1 2 is
2 3 a
3 4 random
4 1 string
5 2 !
>>> d.groupby('A')['B'].apply(list)
A
1 [This, string]
2 [is, !]
3 [a]
4 [random]
dtype: object
If you want something else, just write a function that does what you want and then apply
that.
如果您想要其他东西,只需编写一个可以执行您想要的功能的函数,然后就apply
可以了。
回答by voithos
You may be able to use the aggregate
(or agg
) function to concatenate the values. (Untested code)
您可以使用aggregate
(or agg
) 函数来连接值。(未经测试的代码)
df.groupby('A')['B'].agg(lambda col: ''.join(col))
回答by UserYmY
a simple solution would be :
一个简单的解决方案是:
>>> df.groupby(['A','B']).c.unique().reset_index()
回答by user3241146
You could try this:
你可以试试这个:
df.groupby('A').agg({'B':'sum','C':'-'.join})
回答by Amit
If you'd like to overwrite column B in the dataframe, this should work:
如果您想覆盖数据框中的 B 列,这应该有效:
df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))
回答by Erfan
Named aggregations with pandas >= 0.25.0
命名聚合 pandas >= 0.25.0
Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:
从 Pandas 0.25.0 版本开始,我们已经命名了聚合,我们可以在其中分组、聚合并同时为我们的列分配新名称。这样我们就不会得到 MultiIndex 列,考虑到它们包含的数据,列名更有意义:
aggregate and get a list of strings
聚合并获取字符串列表
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', list)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 [This, string]
1 2 0.421821 [is, !]
2 3 0.463468 [a]
3 4 0.643961 [random]
aggregate and join the strings
聚合并连接字符串
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', ', '.join)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 This, string
1 2 0.421821 is, !
2 3 0.463468 a
3 4 0.643961 random
回答by Paul Rougieux
Following @Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:
遵循@Erfan 的好答案,大多数情况下,在对聚合值的分析中,您需要这些现有字符值的独特可能组合:
unique_chars = lambda x: ', '.join(x.unique())
(df
.groupby(['A'])
.agg({'C': unique_chars}))