Python Pandas:按索引值分组,然后计算分位数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35060846/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:56:03  来源:igfitidea点击:

Pandas: group by index value, then calculate quantile?

pythonpandasdataframe

提问by Richard

I have a DataFrame indexed on the monthcolumn (set using df = df.set_index('month'), in case that's relevant):

我在month列上索引了一个 DataFrame (使用 设置df = df.set_index('month'),以防万一):

             org_code  ratio_cost   
month
2010-08-01   1847      8.685939     
2010-08-01   1848      7.883951     
2010-08-01   1849      6.798465     
2010-08-01   1850      7.352603     
2010-09-01   1847      8.778501     

I want to add a new column called quantile, which will assign a quantile value to each row, based on the value of its ratio_costfor that month.

我想添加一个名为 的新列quantile,它将根据该月的值为每一行分配一个分位数值ratio_cost

So the example above might look like this:

所以上面的例子可能是这样的:

             org_code  ratio_cost   quantile
month
2010-08-01   1847      8.685939     100 
2010-08-01   1848      7.883951     66.6 
2010-08-01   1849      6.798465     0  
2010-08-01   1850      7.352603     33.3
2010-09-01   1847      8.778501     100

How can I do this? I've tried this:

我怎样才能做到这一点?我试过这个:

df['quantile'] = df.groupby('month')['ratio_cost'].rank(pct=True)

But I get KeyError: 'month'.

但我明白了KeyError: 'month'

UPDATE: I can reproduce the bug.

更新:我可以重现该错误。

Here is my CSV file: http://pastebin.com/raw/6xbjvEL0

这是我的 CSV 文件:http: //pastebin.com/raw/6xbjvEL0

And here is the code to reproduce the error:

这是重现错误的代码:

df = pd.read_csv('temp.csv')
df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')
df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile']

I'm using Pandas 0.17.1 on OSX.

我在 OSX 上使用 Pandas 0.17.1。

采纳答案by jezrael

You have to sort_indexbefore rank:

你必须sort_index之前rank

import pandas as pd

df = pd.read_csv('http://pastebin.com/raw/6xbjvEL0')

df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')

df = df.sort_index()

df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile'].head()

month
2010-08-01    0.2500
2010-08-01    0.6875
2010-08-01    0.6250
2010-08-01    0.9375
2010-08-01    0.7500
Name: percentile, dtype: float64