Qcut Pandas:ValueError:Bin 边缘必须是唯一的

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38309144/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:33:31  来源:igfitidea点击:

Qcut Pandas : ValueError: Bin edges must be unique

pythonarrayspandasdataframe

提问by Arij SEDIRI

I'm using Qcut from Pandas in order to discretize my Data into equal-sized buckets. I want to have price buckets. This is my DataFrame :

我正在使用 Pandas 的 Qcut 来将我的数据离散化到大小相等的桶中。我想要价格桶。这是我的数据帧:

        productId   sell_prix   categ   popularity
11997   16758760.0  28.75        50      524137.0
11998   16758760.0  28.75        50      166795.0
13154   16782105.0  24.60        50      126890.5
13761   16790082.0  65.00        50      245437.0
13762   16790082.0  65.00        50      245242.0
15355   16792720.0  29.00        50      360219.0
15356   16792720.0  29.00        50      360100.0
15357   16792720.0  29.00        50      360027.0
15358   16792720.0  29.00        50      462850.0
15367   16792728.0  29.00        50      193030.5

And this is my code :

这是我的代码:

df['PriceBucket'] = pd.qcut(df['sell_prix'], 3)

I have this error message :

我有这个错误信息:

**ValueError: Bin edges must be unique: array([ 24.6,  29. ,  29. ,  65. ])**

In reality, I have a DataFrame with 7413 rows. So this is just a sampling of the real DataFrame. The strange thing is that when I use the same code with a DataFrame with 359824 rows, with practically the same Data, it works ! Is there any dependence with the length of DataFrame ?

实际上,我有一个包含 7413 行的 DataFrame。所以这只是真实 DataFrame 的一个样本。奇怪的是,当我将相同的代码与具有 359824 行的 DataFrame 一起使用时,几乎相同的数据,它起作用了!与 DataFrame 的长度有任何依赖关系吗?

Help please ! Many thanks.

请帮忙 !非常感谢。

回答by luca

Various solutions are discussed here, but briefly:

这里讨论了各种解决方案,但简要说明:

> pd.qcut(df['a'].rank(method='first'), 3)
0        [1, 2.333]
1        [1, 2.333]
2    (2.333, 3.667]
3        (3.667, 5]
4        (3.667, 5]

Or

或者

> pd.qcut(df['a'].rank(method='first'), 3, labels=False)
0    0
1    0
2    1
3    2
4    2

回答by Fortunato

The 'sell_prix' field in your smaller DataFrame don't have enough unique values to break into three equally-sized buckets. As a result, the endpoint of the first and second bucket are the same, which is why you are getting an error.

较小的 DataFrame 中的 'sell_prix' 字段没有足够的唯一值来分成三个相同大小的存储桶。结果,第一个和第二个存储桶的端点相同,这就是您收到错误的原因。

Consider

考虑

df = pd.DataFrame([[1,2,3],[1,4,5],[1,5,6],[1,3,4], [2,3,4]], columns = ['a','b','c'])
df
   a  b  c
0  1  2  3
1  1  4  5
2  1  5  6
3  1  3  4
4  2  3  4

pd.qcut(df['a'], 3)

ValueError: Bin edges must be unique: array([ 1.,  1.,  1.,  2.])

try using cut

尝试使用 cut

pd.cut(df['a'], 3)

0    (0.999, 1.333]
1    (0.999, 1.333]
2    (0.999, 1.333]
3    (0.999, 1.333]
4        (1.667, 2]
Name: a, dtype: category
Categories (3, object): [(0.999, 1.333] < (1.333, 1.667] < (1.667, 2]]