Python 将函数应用于 Pandas groupby
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15374597/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apply function to pandas groupby
提问by turtle
I have a pandas dataframe with a column called my_labelswhich contains strings: 'A', 'B', 'C', 'D', 'E'. I would like to count the number of occurances of each of these strings then divide the number of counts by the sum of all the counts. I'm trying to do this in Pandas like this:
我有一个名为列大熊猫数据帧my_labels包含字符串:'A', 'B', 'C', 'D', 'E'。我想计算每个字符串出现的次数,然后将计数除以所有计数的总和。我正在尝试在 Pandas 中这样做:
func = lambda x: x.size() / x.sum()
data = frame.groupby('my_labels').apply(func)
This code throws an error, 'DataFrame object has no attribute 'size'. How can I apply a function to calculate this in Pandas?
此代码引发错误,“DataFrame 对象没有属性“大小”。如何在 Pandas 中应用函数来计算它?
采纳答案by monkut
applytakes a function to apply to eachvalue, not the series, and accepts kwargs.
So, the values do not have the .size()method.
apply接受一个函数应用于每个值,而不是系列,并接受 kwargs。因此,值没有.size()方法。
Perhaps this would work:
也许这会奏效:
from pandas import *
d = {"my_label": Series(['A','B','A','C','D','D','E'])}
df = DataFrame(d)
def as_perc(value, total):
return value/float(total)
def get_count(values):
return len(values)
grouped_count = df.groupby("my_label").my_label.agg(get_count)
data = grouped_count.apply(as_perc, total=df.my_label.count())
The .agg()method here takes a function that is applied to allvalues of the groupby object.
.agg()这里的方法采用一个函数,该函数应用于groupby 对象的所有值。
回答by Reservedegotist
Try:
尝试:
g = pd.DataFrame(['A','B','A','C','D','D','E'])
# Group by the contents of column 0
gg = g.groupby(0)
# Create a DataFrame with the counts of each letter
histo = gg.apply(lambda x: x.count())
# Add a new column that is the count / total number of elements
histo[1] = histo.astype(np.float)/len(g)
print histo
Output:
输出:
0 1
0
A 2 0.285714
B 1 0.142857
C 1 0.142857
D 2 0.285714
E 1 0.142857
回答by Dickster
I saw a nested function technique for computing a weighted average on S.O. one time, altering that technique can solve your issue.
我看到了一种用于计算 SO 一次加权平均值的嵌套函数技术,改变该技术可以解决您的问题。
def group_weight(overall_size):
def inner(group):
return len(group)/float(overall_size)
inner.__name__ = 'weight'
return inner
d = {"my_label": pd.Series(['A','B','A','C','D','D','E'])}
df = pd.DataFrame(d)
print df.groupby('my_label').apply(group_weight(len(df)))
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
dtype: float64
Here is how to do a weighted average within groups
以下是如何在组内进行加权平均
def wavg(val_col_name,wt_col_name):
def inner(group):
return (group[val_col_name] * group[wt_col_name]).sum() / group[wt_col_name].sum()
inner.__name__ = 'wgt_avg'
return inner
d = {"P": pd.Series(['A','B','A','C','D','D','E'])
,"Q": pd.Series([1,2,3,4,5,6,7])
,"R": pd.Series([0.1,0.2,0.3,0.4,0.5,0.6,0.7])
}
df = pd.DataFrame(d)
print df.groupby('P').apply(wavg('Q','R'))
P
A 2.500000
B 2.000000
C 4.000000
D 5.545455
E 7.000000
dtype: float64
回答by Cleb
As of Pandas version 0.22, there exists also an alternative to apply: pipe, which can be considerably faster than using apply(you can also check this questionfor more differences between the two functionalities).
从 Pandas 版本 0.22 开始,还有一种替代apply: 的方法pipe,它比使用要快得多apply(您也可以检查此问题以了解两个功能之间的更多差异)。
For your example:
对于您的示例:
df = pd.DataFrame({"my_label": ['A','B','A','C','D','D','E']})
my_label
0 A
1 B
2 A
3 C
4 D
5 D
6 E
The applyversion
该apply版本
df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
gives
给
my_label
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
and the pipeversion
和pipe版本
df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
yields
产量
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
So the values are identical, however, the timings differ quite a lot (at least for this small dataframe):
所以这些值是相同的,但是,时间差异很大(至少对于这个小数据帧):
%timeit df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
100 loops, best of 3: 5.52 ms per loop
and
和
%timeit df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
1000 loops, best of 3: 843 μs per loop
Wrapping it into a function is then also straightforward:
将它包装成一个函数也很简单:
def get_perc(grp_obj):
gr_size = grp_obj.size()
return gr_size / gr_size.sum()
Now you can call
现在你可以打电话
df.groupby('my_label').pipe(get_perc)
yielding
屈服
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
However, for this particular case, you do not even need a groupby, but you can just use value_countslike this:
但是,对于这种特殊情况,您甚至不需要 a groupby,但您可以value_counts像这样使用:
df['my_label'].value_counts(sort=False) / df.shape[0]
yielding
屈服
A 0.285714
C 0.142857
B 0.142857
E 0.142857
D 0.285714
Name: my_label, dtype: float64
For this small dataframe it is quite fast
对于这个小数据框,它非常快
%timeit df['my_label'].value_counts(sort=False) / df.shape[0]
1000 loops, best of 3: 770 μs per loop
As pointed out by @anmol, the last statement can also be simplified to
正如@anmol 所指出的,最后一条语句也可以简化为
df['my_label'].value_counts(sort=False, normalize=True)
回答by Vaibhav
Regarding the issue with 'size', size is not a function on a dataframe, it is rather a property. So instead of using size(), plain size should work
关于“大小”的问题,大小不是数据帧上的函数,而是一个属性。因此,与其使用 size(),不如使用普通大小
Apart from that, a method like this should work
除此之外,这样的方法应该有效
def doCalculation(df):
groupCount = df.size
groupSum = df['my_labels'].notnull().sum()
return groupCount / groupSum
dataFrame.groupby('my_labels').apply(doCalculation)

