pandas Groupby 给定所选 DataFrame 列值的百分位数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24657177/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:14:41  来源:igfitidea点击:

Groupby given percentiles of the values of the chosen DataFrame column

pythonpandasgroup-by

提问by pms

Imagine that I have a DataFramewith columns that contain only real values.

想象一下,我有一个DataFrame只包含真实值的列。

>> df        
          col1   col2      col3  
0     0.907609     82  4.207991 
1     3.743659   1523  6.488842 
2     2.358696    324  5.092592  
3     0.006793      0  0.000000  
4    19.319746  11969  7.405685 

I want to group it by quartiles (or any other percentiles specified by me) of the chosen column (e.g., col1), to perform some operations on these groups. Ideally, I would like to do something like:

我想按所选列(例如col1)的四分位数(或我指定的任何其他百分位数)对其进行分组,以对这些组执行一些操作。理想情况下,我想做类似的事情:

df.groupy( quartiles_of_col1 ).mean()  # not working, how to code quartiles_of_col1?

The output should give the mean of each of the columns for four groups corresponding to the quartiles of col1. Is this possible with the groupbycommand? What's the simplest way of achieving it?

输出应给出对应于 的四分位数的四组的每一列的平均值col1。这可以通过groupby命令实现吗?实现它的最简单方法是什么?

回答by CT Zhu

I don't have a computer to test it right now, but I think you can do it by: df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean(). Will update after 150mins.

我现在没有电脑来测试它,但我认为你可以通过:df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean(). 150分钟后更新。

Some explanations:

一些解释:

In [42]:
#use np.percentile to get the bin edges of any percentile you want 
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
 0.907609,
 3.7436589999999996,
 13.089311200000001,
 19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
                       col0     col1      col2
col0                                          
[0.00679, 0.908]   0.457201     41.0  2.103996
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
                       col0     col1      col2
col0                                          
(0.00679, 0.908]   0.907609     82.0  4.207991
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685

回答by biobirdman

I hope this will solve your problem. It is not pretty but I hope it will work for you

我希望这能解决你的问题。它不漂亮,但我希望它对你有用

    import pandas as pd
    import random 
    import numpy as np
    ## create a mock df as example. with column A, B, C and D
    df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

    ## select dataframe based on the quantile of column A, using the quantile method.
    df[df['A'] < df['A'].quantile(0.3)].mean()

this will print

这将打印

A   -1.157615
B    0.205529
C   -0.108263
D    0.346752
dtype: float64

回答by RickB

Pandas has a native solution, pandas.qcut, to this as well:

Pandas 也有一个原生的解决方案pandas.qcut

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html