pandas 将数据转换为分位数 bin
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14298433/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert data to the quantile bin
提问by zach
I have a dataframe with numerical columns. For each column I would like calculate quantile information and assign each row to one of them. I tried to use the qcut()methodto return a list of bins but instead ended up calculating the bins individually. What I thought might exist but I couldn't find it would be a method like df.to_quintile(num of quantiles). This is what I came up with but I am wondering if there is a more succint/pandas way of doing this.
我有一个带有数字列的数据框。对于每一列,我想计算分位数信息并将每一行分配给其中之一。我尝试使用该qcut()方法返回垃圾箱列表,但最终单独计算了垃圾箱。我认为可能存在但我找不到它是一种像df.to_quintile(num of quantiles). 这就是我想出的,但我想知道是否有更简洁/Pandas的方式来做到这一点。
import pandas as pd
#create a dataframe
df = pd.DataFrame(randn(10, 4), columns=['A', 'B', 'C', 'D'])
def quintile(df, column):
"""
calculate quintiles and assign each sample/column to a quintile
"""
#calculate the quintiles using pandas .quantile() here
quintiles = [df[column].quantile(value) for value in [0.0,0.2,0.4,0.6,0.8]]
quintiles.reverse() #reversing makes the next loop simpler
#function to check membership in quintile to be used with pandas apply
def check_quintile(x, quintiles=quintiles):
for num,level in enumerate(quintiles):
#print number, level, level[1]
if x >= level:
print x, num
return num+1
df[column] = df[column].apply(check_quintile)
quintile(df,'A')
thanks, zach cp
谢谢,扎克cp
EDIT: After seeing DSMs answer the function can be written much simpler (below). Man, thats sweet.
编辑:在看到 DSM 回答后,该函数可以写得更简单(如下)。伙计,那很甜蜜。
def quantile(column, quantile=5):
q = qcut(column, quantile)
return len(q.levels)- q.labels
df.apply(quantile)
#or
df['A'].apply(quantile)
回答by DSM
I think using the labelsstored inside the Categoricalobject returned by qcutcan make this a lot simpler. For example:
我认为使用labels存储在Categorical返回的对象中qcut可以使这更简单。例如:
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(1001)
>>> df = pd.DataFrame(np.random.randn(10, 2), columns=['A', 'B'])
>>> df
A B
0 -1.086446 -0.896065
1 -0.306299 -1.339934
2 -1.206586 -0.641727
3 1.307946 1.845460
4 0.829115 -0.023299
5 -0.208564 -0.916620
6 -1.074743 -0.086143
7 1.175839 -1.635092
8 1.228194 1.076386
9 0.394773 -0.387701
>>> q = pd.qcut(df["A"], 5)
>>> q
Categorical: A
array([[-1.207, -1.0771], (-1.0771, -0.248], [-1.207, -1.0771],
(1.186, 1.308], (0.569, 1.186], (-0.248, 0.569], (-1.0771, -0.248],
(0.569, 1.186], (1.186, 1.308], (-0.248, 0.569]], dtype=object)
Levels (5): Index([[-1.207, -1.0771], (-1.0771, -0.248],
(-0.248, 0.569], (0.569, 1.186], (1.186, 1.308]], dtype=object)
>>> q.labels
array([0, 1, 0, 4, 3, 2, 1, 3, 4, 2])
or to match your code:
或匹配您的代码:
>>> len(q.levels) - q.labels
array([5, 4, 5, 1, 2, 3, 4, 2, 1, 3])
>>> quintile(df, "A")
>>> np.array(df["A"])
array([5, 4, 5, 1, 2, 3, 4, 2, 1, 3])

