pandas 根据前几年的数据计算熊猫数据框行的百分位数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32941436/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Compute percentile for pandas dataframe row based on previous years data
提问by user308827
I have the following dataframe:
我有以下数据框:
datetime JD YEAR VAL
2000-01-01 1 2000 0.5
2000-01-02 2 2000 1.2
2000-01-03 3 2000 2.1
2000-01-04 4 2000 3.4
2000-01-05 5 2000 4.6
2000-01-06 6 2000 6.8
2000-01-07 7 2000 7.2
2000-01-08 8 2000 0.2
2000-01-09 9 2000 0.9
...
2010-12-31 365 2014 4.1
The first year is 2000 and last year is 2010. There are no leap years (i.e. no row corresponding to Feb 29th), datetime is the index column.
第一年是2000年,去年是2010年。没有闰年(即没有对应2月29日的行),datetime是索引列。
I would like to compute a new dataframe, stretching from Jan 1st 2010 to Dec 31st 2010. I would like it to contains a column which computes the percentile of Jan 1st 2010 value (VAL) in the array composed of 10 values (Jan 1st 2000, Jan 1st 2001...Jan 1st 2009). Similarly, Jan 2nd 2010 is compared against Jan 2nd from previous years....
我想计算一个新的数据框,从 2010 年 1 月 1 日延伸到 2010 年 12 月 31 日。我希望它包含一个列,该列计算由 10 个值(2000 年 1 月 1 日)组成的数组中 2010 年 1 月 1 日值(VAL)的百分位数,2001 年 1 月 1 日...2009 年 1 月 1 日)。同样,将 2010 年 1 月 2 日与前几年的 1 月 2 日进行比较......
lyr = df.YEAR.max() # last year i.e. 2010
cdf = df[df.YEAR == lyr]# Latest year dataframe
pdf = df[df.index.year < lyr] # Previous years dataframe
pdf.groupby('JD')['VAL']
stats.percentileofscore(pdf['VAL'], cdf['VAL'])
However, I am not sure how to get the code to work. The groupby only returns group whereas I need a list of values.
但是,我不确定如何使代码正常工作。groupby 只返回 group 而我需要一个值列表。
回答by JohnE
Set up with a small sample dataframe:
设置一个小样本数据框:
np.random.seed(1234)
df = pd.DataFrame({ 'jd': np.tile([1,2],3),
'yr': np.repeat([2008,2009,2010],2),
'val': np.random.randn(6) })
Then it's just one line:
那么它只是一行:
df['pctile'] = df.groupby('jd')['val'].rank(pct=True)
Here's the output, sorted with sort_values(['jd','val'])
这是输出,排序为 sort_values(['jd','val'])
jd val yr pctile
4 1 -0.720589 2010 0.333333
0 1 0.471435 2008 0.666667
2 1 1.432707 2009 1.000000
1 2 -1.190976 2008 0.333333
3 2 -0.312652 2009 0.666667
5 2 0.887163 2010 1.000000

