pandas Groupby 给定所选 DataFrame 列值的百分位数

Question

提问by pms

Imagine that I have a DataFramewith columns that contain only real values.

想象一下，我有一个DataFrame只包含真实值的列。

>> df        
          col1   col2      col3  
0     0.907609     82  4.207991 
1     3.743659   1523  6.488842 
2     2.358696    324  5.092592  
3     0.006793      0  0.000000  
4    19.319746  11969  7.405685

I want to group it by quartiles (or any other percentiles specified by me) of the chosen column (e.g., col1), to perform some operations on these groups. Ideally, I would like to do something like:

我想按所选列（例如col1）的四分位数（或我指定的任何其他百分位数）对其进行分组，以对这些组执行一些操作。理想情况下，我想做类似的事情：

df.groupy( quartiles_of_col1 ).mean()  # not working, how to code quartiles_of_col1?

The output should give the mean of each of the columns for four groups corresponding to the quartiles of col1. Is this possible with the groupbycommand? What's the simplest way of achieving it?

输出应给出对应于的四分位数的四组的每一列的平均值col1。这可以通过groupby命令实现吗？实现它的最简单方法是什么？

Answer 1

回答by CT Zhu

I don't have a computer to test it right now, but I think you can do it by: df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean(). Will update after 150mins.

我现在没有电脑来测试它，但我认为你可以通过：df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean(). 150分钟后更新。

Some explanations:

一些解释：

In [42]:
#use np.percentile to get the bin edges of any percentile you want 
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
 0.907609,
 3.7436589999999996,
 13.089311200000001,
 19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
                       col0     col1      col2
col0                                          
[0.00679, 0.908]   0.457201     41.0  2.103996
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
                       col0     col1      col2
col0                                          
(0.00679, 0.908]   0.907609     82.0  4.207991
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685

Answer 2

回答by biobirdman

I hope this will solve your problem. It is not pretty but I hope it will work for you

我希望这能解决你的问题。它不漂亮，但我希望它对你有用

    import pandas as pd
    import random 
    import numpy as np
    ## create a mock df as example. with column A, B, C and D
    df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

    ## select dataframe based on the quantile of column A, using the quantile method.
    df[df['A'] < df['A'].quantile(0.3)].mean()

this will print

这将打印

A   -1.157615
B    0.205529
C   -0.108263
D    0.346752
dtype: float64

Answer 3

回答by RickB

Pandas has a native solution, pandas.qcut, to this as well:

Pandas 也有一个原生的解决方案pandas.qcut：

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html

pandas Groupby 给定所选 DataFrame 列值的百分位数

提问by pms

回答by CT Zhu

回答by biobirdman

回答by RickB

相关推荐

最近更新

标签

pandas Groupby 给定所选 DataFrame 列值的百分位数

提问by pms

回答by CT Zhu

回答by biobirdman

回答by RickB

相关推荐

pandas 从谷歌财经下载股票数据

Python Pandas DataFrame：不可排序的类型：str() > int()

如何基于三角函数计算 Pandas 中的新列？

Python Pandas 合并数据框中的同名列

相关推荐

最近更新

标签