Python 将函数应用于 Pandas groupby

Question

提问by turtle

I have a pandas dataframe with a column called my_labelswhich contains strings: 'A', 'B', 'C', 'D', 'E'. I would like to count the number of occurances of each of these strings then divide the number of counts by the sum of all the counts. I'm trying to do this in Pandas like this:

我有一个名为列大熊猫数据帧my_labels包含字符串：'A', 'B', 'C', 'D', 'E'。我想计算每个字符串出现的次数，然后将计数除以所有计数的总和。我正在尝试在 Pandas 中这样做：

func = lambda x: x.size() / x.sum()
data = frame.groupby('my_labels').apply(func)

This code throws an error, 'DataFrame object has no attribute 'size'. How can I apply a function to calculate this in Pandas?

此代码引发错误，“DataFrame 对象没有属性“大小”。如何在 Pandas 中应用函数来计算它？

Answer 1

采纳答案by monkut

applytakes a function to apply to eachvalue, not the series, and accepts kwargs. So, the values do not have the .size()method.

apply接受一个函数应用于每个值，而不是系列，并接受 kwargs。因此，值没有.size()方法。

Perhaps this would work:

也许这会奏效：

from pandas import *

d = {"my_label": Series(['A','B','A','C','D','D','E'])}
df = DataFrame(d)


def as_perc(value, total):
    return value/float(total)

def get_count(values):
    return len(values)

grouped_count = df.groupby("my_label").my_label.agg(get_count)
data = grouped_count.apply(as_perc, total=df.my_label.count())

The .agg()method here takes a function that is applied to allvalues of the groupby object.

.agg()这里的方法采用一个函数，该函数应用于groupby 对象的所有值。

Answer 2

回答by Reservedegotist

Try:

尝试：

g = pd.DataFrame(['A','B','A','C','D','D','E'])

# Group by the contents of column 0 
gg = g.groupby(0)  

# Create a DataFrame with the counts of each letter
histo = gg.apply(lambda x: x.count())

# Add a new column that is the count / total number of elements    
histo[1] = histo.astype(np.float)/len(g) 

print histo

Output:

输出：

   0         1
0             
A  2  0.285714
B  1  0.142857
C  1  0.142857
D  2  0.285714
E  1  0.142857

Answer 3

回答by Dickster

I saw a nested function technique for computing a weighted average on S.O. one time, altering that technique can solve your issue.

我看到了一种用于计算 SO 一次加权平均值的嵌套函数技术，改变该技术可以解决您的问题。

def group_weight(overall_size):
    def inner(group):
        return len(group)/float(overall_size)
    inner.__name__ = 'weight'
    return inner

d = {"my_label": pd.Series(['A','B','A','C','D','D','E'])}
df = pd.DataFrame(d)
print df.groupby('my_label').apply(group_weight(len(df)))



my_label
A    0.285714
B    0.142857
C    0.142857
D    0.285714
E    0.142857
dtype: float64

Here is how to do a weighted average within groups

以下是如何在组内进行加权平均

def wavg(val_col_name,wt_col_name):
    def inner(group):
        return (group[val_col_name] * group[wt_col_name]).sum() / group[wt_col_name].sum()
    inner.__name__ = 'wgt_avg'
    return inner



d = {"P": pd.Series(['A','B','A','C','D','D','E'])
     ,"Q": pd.Series([1,2,3,4,5,6,7])
    ,"R": pd.Series([0.1,0.2,0.3,0.4,0.5,0.6,0.7])
     }

df = pd.DataFrame(d)
print df.groupby('P').apply(wavg('Q','R'))

P
A    2.500000
B    2.000000
C    4.000000
D    5.545455
E    7.000000
dtype: float64

Answer 4

回答by Cleb

As of Pandas version 0.22, there exists also an alternative to apply: pipe, which can be considerably faster than using apply(you can also check this questionfor more differences between the two functionalities).

从 Pandas 版本 0.22 开始，还有一种替代apply: 的方法pipe，它比使用要快得多apply（您也可以检查此问题以了解两个功能之间的更多差异）。

For your example:

对于您的示例：

df = pd.DataFrame({"my_label": ['A','B','A','C','D','D','E']})

  my_label
0        A
1        B
2        A
3        C
4        D
5        D
6        E

The applyversion

该apply版本

df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])

gives

给

          my_label
my_label          
A         0.285714
B         0.142857
C         0.142857
D         0.285714
E         0.142857

and the pipeversion

和pipe版本

df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())

yields

产量

my_label
A    0.285714
B    0.142857
C    0.142857
D    0.285714
E    0.142857

So the values are identical, however, the timings differ quite a lot (at least for this small dataframe):

所以这些值是相同的，但是，时间差异很大（至少对于这个小数据帧）：

%timeit df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
100 loops, best of 3: 5.52 ms per loop

and

和

%timeit df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
1000 loops, best of 3: 843 μs per loop

Wrapping it into a function is then also straightforward:

将它包装成一个函数也很简单：

def get_perc(grp_obj):
    gr_size = grp_obj.size()
    return gr_size / gr_size.sum()

Now you can call

现在你可以打电话

df.groupby('my_label').pipe(get_perc)

yielding

屈服

my_label
A    0.285714
B    0.142857
C    0.142857
D    0.285714
E    0.142857

However, for this particular case, you do not even need a groupby, but you can just use value_countslike this:

但是，对于这种特殊情况，您甚至不需要 a groupby，但您可以value_counts像这样使用：

df['my_label'].value_counts(sort=False) / df.shape[0]

yielding

屈服

A    0.285714
C    0.142857
B    0.142857
E    0.142857
D    0.285714
Name: my_label, dtype: float64

For this small dataframe it is quite fast

对于这个小数据框，它非常快

%timeit df['my_label'].value_counts(sort=False) / df.shape[0]
1000 loops, best of 3: 770 μs per loop

As pointed out by @anmol, the last statement can also be simplified to

正如@anmol 所指出的，最后一条语句也可以简化为

df['my_label'].value_counts(sort=False, normalize=True)

Answer 5

回答by Vaibhav

Regarding the issue with 'size', size is not a function on a dataframe, it is rather a property. So instead of using size(), plain size should work

关于“大小”的问题，大小不是数据帧上的函数，而是一个属性。因此，与其使用 size()，不如使用普通大小

Apart from that, a method like this should work

除此之外，这样的方法应该有效

 def doCalculation(df):
    groupCount = df.size
    groupSum = df['my_labels'].notnull().sum()

    return groupCount / groupSum

dataFrame.groupby('my_labels').apply(doCalculation)

Python 将函数应用于 Pandas groupby

提问by turtle

采纳答案by monkut

回答by Reservedegotist

回答by Dickster

回答by Cleb

回答by Vaibhav

相关推荐

最近更新

标签

Python 将函数应用于 Pandas groupby

提问by turtle

采纳答案by monkut

回答by Reservedegotist

回答by Dickster

回答by Cleb

回答by Vaibhav

相关推荐

Python 如何将 Pandas DataFrame 的第一列作为系列获取？

如何检查变量是否是 Python 中的字典？

Python 如何在不覆盖数据的情况下写入现有的 excel 文件（使用 Pandas）？

for循环中python变量的范围

相关推荐

最近更新

标签