Python 将百分位数传递给 Pandas agg 函数

Question

提问by slizb

I want to pass the numpy percentile() function through pandas' agg() function as I do below with various other numpy statistics functions.

我想通过 Pandas 的 agg() 函数传递 numpy percentile() 函数，就像我在下面使用各种其他 numpy 统计函数所做的那样。

Right now I have a dataframe that looks like this:

现在我有一个如下所示的数据框：

AGGREGATE   MY_COLUMN
A           10
A           12
B           5
B           9
A           84
B           22

And my code looks like this:

我的代码如下所示：

grouped = dataframe.groupby('AGGREGATE')
column = grouped['MY_COLUMN']
column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max])

The above code works, but I want to do something like

上面的代码有效，但我想做类似的事情

column.agg([np.sum, np.mean, np.percentile(50), np.percentile(95)])

i.e. specify various percentiles to return from agg()

即指定从 agg() 返回的各种百分位数

How should this be done?

这应该怎么做？

Answer 1

采纳答案by Andy Hayden

Perhaps not super efficient, but one way would be to create a function yourself:

也许效率不高，但一种方法是自己创建一个函数：

def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_

Then include this in your agg:

然后将其包含在您的agg：

In [11]: column.agg([np.sum, np.mean, np.std, np.median,
                     np.var, np.min, np.max, percentile(50), percentile(95)])
Out[11]:
           sum       mean        std  median          var  amin  amax  percentile_50  percentile_95
AGGREGATE
A          106  35.333333  42.158431      12  1777.333333    10    84             12           76.8
B           36  12.000000   8.888194       9    79.000000     5    22             12           76.8

Note sure this is how it shouldbe done though...

请注意，这是应该如何完成的...

Answer 2

回答by prl900

Being more specific, if you just want to aggregate your pandas groupby results using the percentile function, the python lambda function offers a pretty neat solution. Using the question's notation, aggregating by the percentile 95, should be:

更具体地说，如果您只想使用 percentile 函数聚合您的 Pandas groupby 结果，python lambda 函数提供了一个非常简洁的解决方案。使用问题的符号，按百分位数 95 汇总，应该是：

dataframe.groupby('AGGREGATE').agg(lambda x: np.percentile(x['COL'], q = 95))

You can also assign this function to a variable and use it in conjunction with other aggregation functions.

您还可以将此函数分配给变量并将其与其他聚合函数结合使用。

Answer 3

回答by scottlittle

Try this for the 50% and 95% percentile:

试试这个 50% 和 95% 的百分位数：

column.describe( percentiles = [ 0.5, 0.95 ] )

Answer 4

回答by Fakira

Multiple function can be called as below:

可以调用多个函数，如下所示：

import pandas as pd

import numpy as np

import random

C = ['Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima', 'Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima']

A = [ random.randint(0,100) for i in range(10) ]

B = [ random.randint(0,100) for i in range(10) ]

df = pd.DataFrame({ 'field_A': A, 'field_B': B, 'field_C': C })

print(df)

d = df.groupby('field_C')['field_A'].describe()[['mean', 'count', '25%', '50%', '75%']]
print(d)

I was unable to call median in this, but able to work other functions.

我无法在此调用中值，但可以使用其他功能。

Answer 5

回答by Thomas

I really like the solution Andy Hayden gave, however, this had multiple issues for me:

我真的很喜欢Andy Hayden 给出的解决方案，但是，这对我来说有很多问题：

If the dataframe has multiple columns, it aggregated over the columns instead of over the rows?
For me, the row names were percentile_0.5 (dot instead of underscore). Not sure what caused this, probably that I am using Python 3.
Need to import numpy as well instead of staying in pandas (I know, numpy is imported implicitely in pandas...)

如果数据框有多个列，它会在列上聚合而不是在行上聚合？
对我来说，行名称是percentile_0.5（点而不是下划线）。不确定是什么原因造成的，可能是我使用的是 Python 3。
还需要导入 numpy 而不是留在熊猫中（我知道，numpy 是隐式导入到熊猫中的...）

Here is an updated version that fixes these issues:

这是修复这些问题的更新版本：

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:2.0f}'.format(n*100)
    return percentile_

Answer 6

回答by Maksim

For situations where all you need is a subset of the describe(typically the most common needed statistics) you can just index the returned pandas series without needing any extra functions.

对于所有您需要的是describe（通常是最常见的所需统计数据）的子集的情况，您只需索引返回的熊猫系列，而无需任何额外的功能。

For example, I commonly find myself just needing to present the 25th, median, 75th and count. This can be done in just one line like so:

例如，我通常发现自己只需要显示第 25 个、中位数、第 75 个和计数。这可以在一行中完成，如下所示：

columns.agg('describe')[['25%', '50%', '75%', 'count']]

For specifying your own set of percentiles, the chosen answer is a good choice, but for simple use case, there is no need for extra functions.

对于指定您自己的一组百分位数，选择的答案是一个不错的选择，但对于简单的用例，不需要额外的功能。

Answer 7

回答by Arun Nalpet

You can have agg() use a custom function to be executed on specified column:

您可以让 agg() 使用自定义函数在指定列上执行：

# 50th Percentile
def q50(x):
            return x.quantile(0.5)

# 90th Percentile
def q90(x):
            return x.quantile(0.9)

my_DataFrame.groupby(['AGGREGATE']).agg({'MY_COLUMN': [q50, q90, 'max']})

Answer 8

回答by jvans

I believe the idiomatic way to do this in pandas is:

我相信在熊猫中这样做的惯用方法是：

df.groupby("AGGREGATE").quantile([0, 0.25, 0.5, 0.75, 0.95, 1])

Answer 9

回答by Agredalopez

df.groupby("AGGREGATE").describe(percentile=[0, 0.25, 0.5, 0.75, 0.95, 1])

by default describefunction give us mean, count, std, min, max.

默认情况下，describe函数给我们mean, count, std, min, max。

Answer 10

回答by magraf

Just to throw a more general solution into the ring. Assume you have a DF with just one column to group:

只是为了将更通用的解决方案放入戒指中。假设您有一个只有一列要分组的 DF：

df = pd.DataFrame((('A',10),('A',12),('B',5),('B',9),('A',84),('B',22)), 
                    columns=['My_KEY', 'MY_COL1'])

One can aggregate and calcualte basically any descriptive metric with a list of anonymous (lambda) functions like:

基本上可以使用匿名 (lambda) 函数列表聚合和计算任何描述性指标，例如：

df.groupby(['My_KEY']).agg( [np.sum, np.mean, lambda x: np.percentile(x, q=25)] )

However, if you have multiple columns to aggregate, you have to call a non anonymous function or call the columns explicitly:

但是，如果您有多个要聚合的列，则必须调用非匿名函数或显式调用列：

df = pd.DataFrame((('A',10,3),('A',12,4),('B',5,6),('B',9,3),('A',84,2),('B',22,1)), 
                    columns=['My_KEY', 'MY_COL1', 'MY_COL2'])

# non-anonymous function
def percentil25 (x): 
    return np.percentile(x, q=25)

# type 1: call for both columns 
df.groupby(['My_KEY']).agg( [np.sum, np.mean, percentil25 ]  )

# type 2: call each column separately
df.groupby(['My_KEY']).agg( {'MY_COL1': [np.sum, np.mean, lambda x: np.percentile(x, q=25)],
                             'MY_COL2': np.size})

Python 将百分位数传递给 Pandas agg 函数

提问by slizb

采纳答案by Andy Hayden

回答by prl900

回答by scottlittle

回答by Fakira

回答by Thomas

回答by Maksim

回答by Arun Nalpet

回答by jvans

回答by Agredalopez

回答by magraf

相关推荐

最近更新

标签

Python 将百分位数传递给 Pandas agg 函数

提问by slizb

采纳答案by Andy Hayden

回答by prl900

回答by scottlittle

回答by Fakira

回答by Thomas

回答by Maksim

回答by Arun Nalpet

回答by jvans

回答by Agredalopez

回答by magraf

相关推荐

如何将自定义字段添加到 Python 日志格式字符串？

Python 在 Windows 8.1 上安装 lxml、libxml2、libxslt

Python StatsModels 的置信区间和预测区间

Python 确定对象不能被腌制的原因

相关推荐

最近更新

标签