Python Pandas 使用 pd.qcut 创建新的 Bin/Bucket 变量

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28442991/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:56:12  来源:igfitidea点击:

Python Pandas Create New Bin/Bucket Variable with pd.qcut

pythonpandasbinsbuckets

提问by sfortney

How do you create a new Bin/Bucket Variable using pd.qut in python?

你如何在 python 中使用 pd.qut 创建一个新的 Bin/Bucket 变量?

This might seem elementary to experienced users but I was not super clear on this and it was surprisingly unintuitive to search for on stack overflow/google. Some thorough searching yielded this (Assignment of qcut as new column) but it didn't quite answer my question because it didn't take the last step and put everything into bins (i.e. 1,2,...).

对于有经验的用户来说,这似乎很基本,但我对此并不十分清楚,而且在堆栈溢出/谷歌上搜索令人惊讶地不直观。一些彻底的搜索产生了这个(将 qcut 分配为新列)但它并没有完全回答我的问题,因为它没有采取最后一步并将所有内容放入垃圾箱(即 1,2,...)。

采纳答案by sfortney

EDIT: The below answer is only valid for versions of Pandas less than 0.15.0. If you are running Pandas 15 or higher, see:

编辑:以下答案仅对小于 0.15.0 的 Pandas 版本有效。如果您运行的是 Pandas 15 或更高版本,请参阅:

data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)

Thanks to @unutbu for pointing it out. :)

感谢@unutbu 指出。:)

Say you have some data that you want to bin, in my case options spreads, and you want to make a new variable with the buckets corresponding to each observation. The link mentioned above that you can do this by:

假设您有一些数据要分箱,在我的情况下,选项传播,并且您希望使用与每个观察对应的桶创建一个新变量。上面提到的链接,您可以通过以下方式执行此操作:

print pd.qcut(data3['spd_pct'], 40)

(0.087, 0.146]
(0.0548, 0.087]
(0.146, 0.5]
(0.146, 0.5]
(0.087, 0.146]
(0.0548, 0.087]
(0.5, 2]

which gives you what the bin endpoints are that correspond to each observation. However, if you would like the corresponding bin numbers for each observation then you can do this:

它为您提供了对应于每个观察值的 bin 端点。但是,如果您想要每个观察对应的 bin 编号,那么您可以这样做:

print pd.qcut(data3['spd_pct'],5).labels

[2 1 3 ..., 0 1 4] 

Putting it all together if you would like to create a new variable with just the bin numbers, this should suffice:

如果您想创建一个仅包含 bin 编号的新变量,那么将它们放在一起就足够了:

data3['bins_spd']=pd.qcut(data3['spd_pct'],5).labels

print data3.head()

   secid      date    symbol  symbol_flag     exdate   last_date cp_flag  0   5005  1/2/1997  099F2.37            0  1/18/1997         NaN       P   
1   5005  1/2/1997  09B0B.1B            0  2/22/1997   12/3/1996       P   
2   5005  1/2/1997  09B7C.2F            0  2/22/1997  12/11/1996       P   
3   5005  1/2/1997  09EE6.6E            0  1/18/1997  12/27/1996       C   
4   5005  1/2/1997  09F2F.CE            0  8/16/1997         NaN       P   

   strike_price  best_bid  best_offer     ...      close  volume_y    return  0          7500     2.875      3.2500     ...        4.5     99200  0.074627   
1         10000     5.375      5.7500     ...        4.5     99200  0.074627   
2          5000     0.625      0.8750     ...        4.5     99200  0.074627   
3          5000     0.125      0.1875     ...        4.5     99200  0.074627   
4          7500     3.000      3.3750     ...        4.5     99200  0.074627   

   cfadj_y  open  cfret  shrout      mid   spd_pct  bins_spd  
0        1   4.5      1   57735  3.06250  0.122449         2  
1        1   4.5      1   57735  5.56250  0.067416         1  
2        1   4.5      1   57735  0.75000  0.333333         3  
3        1   4.5      1   57735  0.15625  0.400000         3  
4        1   4.5      1   57735  3.18750  0.117647         2  

[5 rows x 35 columns]

Hope this helps somebody else. At the very least it should be easier to search for now. :)

希望这对其他人有帮助。至少现在应该更容易搜索。:)

回答by unutbu

In Pandas 0.15.0 or newer, pd.qcutwill return a Series, not a Categorical if the input is a Series (as it is, in your case) or if labels=False. If you set labels=False, then qcutwill return a Series with the integer indicators of the bins as values.

在 Pandas 0.15.0 或更新版本中,pd.qcut如果输入是系列(就您的情况而言)或如果labels=False. 如果您设置labels=Falseqcut则将返回一个以 bin 的整数指示符作为值的系列。

So to future-proof your code, you could use

因此,为了使您的代码面向未来,您可以使用

data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)

or, pass a NumPy array to pd.qcutso you get a Categorical as the return value. Note that the Categorical attribute labelsis deprecated. Use codesinstead:

或者,将 NumPy 数组传递给,pd.qcut以便获得 Categorical 作为返回值。请注意,labels不推荐使用Categorical 属性。使用codes来代替:

data3['bins_spd'] = pd.qcut(data3['spd_pct'].values, 5).codes