pandas 为什么使用pandas qcut return ValueError: Bin edge must be unique?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36880490/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:08:06  来源:igfitidea点击:

Why use pandas qcut return ValueError: Bin edges must be unique?

pythonpandas

提问by ihsansat

I have dataset :

我有数据集:

recency;frequency;monetary
21;156;41879955
13;88;16850284
8;74;79150488
2;74;26733719
9;55;16162365
...;...;...

detail raw data -> http://pastebin.com/beiEeS80and i put into DataFrameand here is my complete code :

详细原始数据-> http://pastebin.com/beiEeS80和我输入DataFrame这里是我的完整代码:

df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)

df['recency'] = pd.qcut(df['recency'].values, 5).codes + 1
df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
df['monetary'] = pd.qcut(df['monetary'].values, 5).codes + 1

but it's return error

但它的返回错误

df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
ValueError: Bin edges must be unique: array([   1.,    1.,    2.,    4.,    9.,  156.])

How to solve this ?

如何解决这个问题?

回答by piRSquared

I ran this in Jupyter and placed the exampledata.txt to the same directory as the notebook.

我在 Jupyter 中运行了它,并将 exampledata.txt 放在与笔记本相同的目录中。

Please note that the first line:

请注意第一行:

df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])

loads the colums 'userid'when it isn't defined in the data file. I removed this column name.

'userid'当数据文件中未定义时加载列。我删除了这个列名。

Solution

解决方案

import pandas as pd

def pct_rank_qcut(series, n):
    edges = pd.Series([float(i) / n for i in range(n + 1)])
    f = lambda x: (edges >= x).argmax()
    return series.rank(pct=1).apply(f)

datas = pd.read_csv('./exampledata.txt', delimiter=';')

df = pd.DataFrame(datas, columns=['recency', 'frequency', 'monetary'])

df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)

df['recency'] = pct_rank_qcut(df.recency, 5)
df['frequency'] = pct_rank_qcut(df.frequency, 5)
df['monetary'] = pct_rank_qcut(df.monetary, 5)

Explanation

解释

The problem you were seeing was a result of pd.qcut assuming 5 bins of equal size. In the data you provided, 'frequency'has more than 28% number 1's. This broke qcut.

您看到的问题是 pd.qcut 假设 5 个大小相等的 bin 的结果。在您提供的数据中,'frequency'有超过 28% 的数字是 1。这破了qcut

I provided a new function pct_rank_qcutthat addresses this and pushes all 1's into the first bin.

我提供了一个新函数pct_rank_qcut来解决这个问题并将所有 1 推入第一个 bin。

    edges = pd.Series([float(i) / n for i in range(n + 1)])

This line defines a series of percentile edges based on the desired number of bins defined by n. In the case of n = 5the edges will be [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

这条线根据由 定义的所需的 bin 数量定义了一系列百分位边缘n。在n = 5边缘的情况下将是[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

    f = lambda x: (edges >= x).argmax()

this line defines a helper function to be applied to another series in the next line. edges >= xwill return a series equal in length to edgeswhere each element is Trueor Falsedepending on whether xis less than or equal to that edge. In the case of x = 0.14the resulting (edges >= x)will be [False, True, True, True, True, True]. By the taking the argmax()I've identified the first index where the series is True, in this case 1.

这一行定义了一个辅助函数,用于下一行中的另一个系列。 edges >= x将返回一个长度等于edges每个元素所在位置的序列,True或者False取决于是否x小于或等于该边。在的情况下,x = 0.14由此而来(edges >= x)[False, True, True, True, True, True]。通过argmax()我已经确定了系列所在的第一个索引True,在这种情况下是1

    return series.rank(pct=1).apply(f)

This line takes the input seriesand turns it into a percentile ranking. I can compare these rankings to the edges I've created and that's why I use the apply(f). What's returned should be a series of bin numbers numbered 1 to n. This series of bin numbers is the same thing you were trying to get with:

此行获取输入series并将其转换为百分位排名。我可以将这些排名与我创建的边缘进行比较,这就是我使用apply(f). 返回的应该是一系列编号为 1 到 n 的 bin 编号。这一系列的 bin 编号与您试图获得的相同:

pd.qcut(df['recency'].values, 5).codes + 1

This has consequences in that the bins are no longer equal and that bin 1 borrows completely from bin 2. But some choice had to be made. If you don't like this choice, use the concept to build your own ranking.

这会导致 bin 不再相等,并且 bin 1 完全从 bin 2 借用。但是必须做出一些选择。如果您不喜欢这个选择,请使用这个概念来建立您自己的排名。

Demonstration

示范

print df.head()

   recency  frequency  monetary
0        3          5         5
1        2          5         5
2        2          5         5
3        1          5         5
4        2          5         5

Update

更新

pd.Series.argmax()is now deprecated. Simply switch to pd.Series.values.argmax()()to update!

pd.Series.argmax()现在已弃用。只需切换pd.Series.values.argmax()()到更新!

def pct_rank_qcut(series, n):
    edges = pd.Series([float(i) / n for i in range(n + 1)])
    f = lambda x: (edges >= x).values.argmax()
    return series.rank(pct=1).apply(f)

回答by luca

Various solutions are discussed here, but briefly:

这里讨论了各种解决方案,但简要说明:

if you are using pandas, >= 0.20.0 they added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.

如果您使用的是Pandas,>= 0.20.0 他们添加了一个选项 duplicates='raise'|'drop' 来控制是在重复的边缘上升高还是丢弃它们,这将导致比指定的 bin 少,并且一些更大(元素更多)。

For previous pandas versions try passing the ranked values instead of the values themselves:

对于以前的Pandas版本,尝试传递排名值而不是值本身:

pd.qcut(df['frequency'].rank(method='first').values, 5).codes + 1

In this way you might have that identical values go into different quantiles. This might be correct or not depending on your specific needs (if this is not what you want you probably want to have a look at pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin)

通过这种方式,您可能会将相同的值放入不同的分位数。这可能是正确的还是不正确的,具体取决于您的特定需求(如果这不是您想要的,您可能想看看 pandas.cut,它根据值本身选择要均匀间隔的垃圾箱,而 pandas.qcut 选择bins 以便您在每个 bin 中拥有相同数量的记录)