Python 消除给定百分位数上的所有数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18580461/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:09:00  来源:igfitidea点击:

Eliminating all data over a given percentile

pythonpandasfilteringpercentile

提问by Roy Smith

I have a pandas DataFramecalled datawith a column called ms. I want to eliminate all the rows where data.msis above the 95% percentile. For now, I'm doing this:

我有一个DataFrame名为.pandasdata的列ms。我想消除data.ms95% 以上的所有行。现在,我正在这样做:

limit = data.ms.describe(90)['95%']
valid_data = data[data['ms'] < limit]

which works, but I want to generalize that to any percentile. What's the best way to do that?

哪个有效,但我想将其推广到任何百分位数。这样做的最佳方法是什么?

采纳答案by Phillip Cloud

Use the Series.quantile()method:

使用Series.quantile()方法:

In [48]: cols = list('abc')

In [49]: df = DataFrame(randn(10, len(cols)), columns=cols)

In [50]: df.a.quantile(0.95)
Out[50]: 1.5776961953820687

To filter out rows of dfwhere df.ais greater than or equal to the 95th percentile do:

过滤掉的行df,其中df.a大于或等于第95百分位做:

In [72]: df[df.a < df.a.quantile(.95)]
Out[72]:
       a      b      c
0 -1.044 -0.247 -1.149
2  0.395  0.591  0.764
3 -0.564 -2.059  0.232
4 -0.707 -0.736 -1.345
5  0.978 -0.099  0.521
6 -0.974  0.272 -0.649
7  1.228  0.619 -0.849
8 -0.170  0.458 -0.515
9  1.465  1.019  0.966

回答by 2diabolos.com

numpy is much faster than Pandas for this kind of things :

在这种情况下,numpy 比 Pandas 快得多:

numpy.percentile(df.a,95) # attention : the percentile is given in percent (5 = 5%)

is equivalent but 3 times faster than :

等效但比 快 3 倍:

df.a.quantile(.95)  # as you already noticed here it is ".95" not "95"

so for your code, it gives :

所以对于你的代码,它给出:

df[df.a < np.percentile(df.a,95)]