Python Pandas 使用 pd.qcut 创建新的 Bin/Bucket 变量
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28442991/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas Create New Bin/Bucket Variable with pd.qcut
提问by sfortney
How do you create a new Bin/Bucket Variable using pd.qut in python?
你如何在 python 中使用 pd.qut 创建一个新的 Bin/Bucket 变量?
This might seem elementary to experienced users but I was not super clear on this and it was surprisingly unintuitive to search for on stack overflow/google. Some thorough searching yielded this (Assignment of qcut as new column) but it didn't quite answer my question because it didn't take the last step and put everything into bins (i.e. 1,2,...).
对于有经验的用户来说,这似乎很基本,但我对此并不十分清楚,而且在堆栈溢出/谷歌上搜索令人惊讶地不直观。一些彻底的搜索产生了这个(将 qcut 分配为新列)但它并没有完全回答我的问题,因为它没有采取最后一步并将所有内容放入垃圾箱(即 1,2,...)。
采纳答案by sfortney
EDIT: The below answer is only valid for versions of Pandas less than 0.15.0. If you are running Pandas 15 or higher, see:
编辑:以下答案仅对小于 0.15.0 的 Pandas 版本有效。如果您运行的是 Pandas 15 或更高版本,请参阅:
data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)
Thanks to @unutbu for pointing it out. :)
感谢@unutbu 指出。:)
Say you have some data that you want to bin, in my case options spreads, and you want to make a new variable with the buckets corresponding to each observation. The link mentioned above that you can do this by:
假设您有一些数据要分箱,在我的情况下,选项传播,并且您希望使用与每个观察对应的桶创建一个新变量。上面提到的链接,您可以通过以下方式执行此操作:
print pd.qcut(data3['spd_pct'], 40)
(0.087, 0.146]
(0.0548, 0.087]
(0.146, 0.5]
(0.146, 0.5]
(0.087, 0.146]
(0.0548, 0.087]
(0.5, 2]
which gives you what the bin endpoints are that correspond to each observation. However, if you would like the corresponding bin numbers for each observation then you can do this:
它为您提供了对应于每个观察值的 bin 端点。但是,如果您想要每个观察对应的 bin 编号,那么您可以这样做:
print pd.qcut(data3['spd_pct'],5).labels
[2 1 3 ..., 0 1 4]
Putting it all together if you would like to create a new variable with just the bin numbers, this should suffice:
如果您想创建一个仅包含 bin 编号的新变量,那么将它们放在一起就足够了:
data3['bins_spd']=pd.qcut(data3['spd_pct'],5).labels
print data3.head()
secid date symbol symbol_flag exdate last_date cp_flag 0 5005 1/2/1997 099F2.37 0 1/18/1997 NaN P
1 5005 1/2/1997 09B0B.1B 0 2/22/1997 12/3/1996 P
2 5005 1/2/1997 09B7C.2F 0 2/22/1997 12/11/1996 P
3 5005 1/2/1997 09EE6.6E 0 1/18/1997 12/27/1996 C
4 5005 1/2/1997 09F2F.CE 0 8/16/1997 NaN P
strike_price best_bid best_offer ... close volume_y return 0 7500 2.875 3.2500 ... 4.5 99200 0.074627
1 10000 5.375 5.7500 ... 4.5 99200 0.074627
2 5000 0.625 0.8750 ... 4.5 99200 0.074627
3 5000 0.125 0.1875 ... 4.5 99200 0.074627
4 7500 3.000 3.3750 ... 4.5 99200 0.074627
cfadj_y open cfret shrout mid spd_pct bins_spd
0 1 4.5 1 57735 3.06250 0.122449 2
1 1 4.5 1 57735 5.56250 0.067416 1
2 1 4.5 1 57735 0.75000 0.333333 3
3 1 4.5 1 57735 0.15625 0.400000 3
4 1 4.5 1 57735 3.18750 0.117647 2
[5 rows x 35 columns]
Hope this helps somebody else. At the very least it should be easier to search for now. :)
希望这对其他人有帮助。至少现在应该更容易搜索。:)
回答by unutbu
In Pandas 0.15.0 or newer, pd.qcutwill return a Series, not a Categorical if the input is a Series (as it is, in your case) or if labels=False. If you set labels=False, then qcutwill return a Series with the integer indicators of the bins as values.
在 Pandas 0.15.0 或更新版本中,pd.qcut如果输入是系列(就您的情况而言)或如果labels=False. 如果您设置labels=False,qcut则将返回一个以 bin 的整数指示符作为值的系列。
So to future-proof your code, you could use
因此,为了使您的代码面向未来,您可以使用
data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)
or, pass a NumPy array to pd.qcutso you get a Categorical as the return value.
Note that the Categorical attribute labelsis deprecated. Use codesinstead:
或者,将 NumPy 数组传递给,pd.qcut以便获得 Categorical 作为返回值。请注意,labels不推荐使用Categorical 属性。使用codes来代替:
data3['bins_spd'] = pd.qcut(data3['spd_pct'].values, 5).codes

