Python Pandas:按索引值分组,然后计算分位数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35060846/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: group by index value, then calculate quantile?
提问by Richard
I have a DataFrame indexed on the month
column (set using df = df.set_index('month')
, in case that's relevant):
我在month
列上索引了一个 DataFrame (使用 设置df = df.set_index('month')
,以防万一):
org_code ratio_cost
month
2010-08-01 1847 8.685939
2010-08-01 1848 7.883951
2010-08-01 1849 6.798465
2010-08-01 1850 7.352603
2010-09-01 1847 8.778501
I want to add a new column called quantile
, which will assign a quantile value to each row, based on the value of its ratio_cost
for that month.
我想添加一个名为 的新列quantile
,它将根据该月的值为每一行分配一个分位数值ratio_cost
。
So the example above might look like this:
所以上面的例子可能是这样的:
org_code ratio_cost quantile
month
2010-08-01 1847 8.685939 100
2010-08-01 1848 7.883951 66.6
2010-08-01 1849 6.798465 0
2010-08-01 1850 7.352603 33.3
2010-09-01 1847 8.778501 100
How can I do this? I've tried this:
我怎样才能做到这一点?我试过这个:
df['quantile'] = df.groupby('month')['ratio_cost'].rank(pct=True)
But I get KeyError: 'month'
.
但我明白了KeyError: 'month'
。
UPDATE: I can reproduce the bug.
更新:我可以重现该错误。
Here is my CSV file: http://pastebin.com/raw/6xbjvEL0
这是我的 CSV 文件:http: //pastebin.com/raw/6xbjvEL0
And here is the code to reproduce the error:
这是重现错误的代码:
df = pd.read_csv('temp.csv')
df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')
df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile']
I'm using Pandas 0.17.1 on OSX.
我在 OSX 上使用 Pandas 0.17.1。
采纳答案by jezrael
You have to sort_index
before rank
:
你必须sort_index
之前rank
:
import pandas as pd
df = pd.read_csv('http://pastebin.com/raw/6xbjvEL0')
df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')
df = df.sort_index()
df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile'].head()
month
2010-08-01 0.2500
2010-08-01 0.6875
2010-08-01 0.6250
2010-08-01 0.9375
2010-08-01 0.7500
Name: percentile, dtype: float64