Python 如何使用非唯一的 bin 边缘进行 qcut？

Question

提问by geronimo

My question is the same as this previous one:

我的问题与上一个相同：

however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.

但是，我仍然想在分位数中包含 0 值。有没有办法做到这一点？换句话说，如果我有 600 个值，其中 50% 是 0，其余的比方说介于 1 和 100 之间，我将如何将所有 0 值归类到 fractile 1 中，然后是其余的非零值在分位数标签 2 到 10 中（假设我想要 10 个分位数）。我可以将 0 转换为 nan，将剩余的非 nan 数据 qcut 为 9 个分位数（1 到 9），然后在每个标签上添加 1（现在是 2 到 10）并手动将所有 0 值标记为分位数 1？即使这很棘手，因为在我的数据集中，除了 600 个值之外，我还有另外几百个在我将 0 转换为 nan 之前可能已经是 nan 的。

Update 1/26/14:

14 年 1 月 26 日更新：

I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.

我想出了以下临时解决方案。但是，此代码的问题在于，如果高频值不在分布的边缘，则它会在现有 bin 集的中间插入一个额外的 bin，并将所有内容稍微（或很多）抛出。

def fractile_cut(ser, num_fractiles):
    num_valid = ser.valid().shape[0]
    remain_fractiles = num_fractiles
    vcounts = ser.value_counts()
    high_freq = []
    i = 0
    while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
        curr_val = vcounts.index[i]
        high_freq.append(curr_val)
        remain_fractiles -= 1
        num_valid = num_valid - vcounts[i]
        i += 1
    curr_ser = ser.copy()
    curr_ser = curr_ser[~curr_ser.isin(high_freq)]
    qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
    qcut_bins = qcut[1]
    all_bins = list(qcut_bins)
    for val in high_freq:
        bisect.insort(all_bins, val)
    cut = pd.cut(ser, bins=all_bins)
    ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
    return ser_fractiles

Answer 1

回答by ashishsingal

I've had a lot of problems with qcut as well, so I used the Series.rank function combined with creating my own bins using those results. My code is on Github:

我在 qcut 上也遇到了很多问题，所以我使用了 Series.rank 函数，并使用这些结果创建了我自己的 bin。我的代码在 Github 上：

https://gist.github.com/ashishsingal1/e1828ffd1a449513b8f8

Answer 2

回答by OYRM

You ask about binning with non-unique bin edges, for which I have a fairly simple answer. In the case of your example, your intent and the behavior of qcut diverge where in the pandas.tools.tile.qcutfunction where bins are defined:

您询问使用非唯一 bin 边缘进行分箱的问题，对此我有一个相当简单的答案。在您的示例中，您的意图和 qcut 的行为在pandas.tools.tile.qcut定义 bin的函数中出现分歧：

bins = algos.quantile(x, quantiles)

Which, because your data is 50% 0s, causes bins to be returned with multiple bin edges at the value 0 for any value of quantiles greater than 2. I see two possible resolutions. In the first, the fractile space is divided evenly, binning all 0s, but not only 0s, in the first bin. In the second, the fractile space is divided evenly for values greater than 0, binning all 0s and only 0s in the first bin.

其中，因为您的数据是 50% 0，所以对于任何大于 2 的分位数值，都会返回值 0 处的多个 bin 边缘的 bin。我看到两种可能的分辨率。在第一个中，分形空间被平均划分，在第一个 bin 中分箱所有 0，但不仅仅是 0。在第二个中，分位数空间被平均划分为大于 0 的值，将第一个 bin 中的所有 0 和仅 0 分箱。

import numpy as np
import pandas as pd
import pandas.core.algorithms as algos
from pandas import Series

In both cases, I'll create some random sample data fitting your description of 50% zeroes and the remaining values between 1 and 100

在这两种情况下，我都会创建一些随机样本数据，以符合您对 50% 零和 1 到 100 之间的其余值的描述

zs = np.zeros(300)
rs = np.random.randint(1, 100, size=300)
arr=np.concatenate((zs, rs))
ser = Series(arr)

Solution 1: bin 1 contains both 0s and low values

解决方案 1：bin 1 包含 0 和低值

bins = algos.quantile(np.unique(ser), np.linspace(0, 1, 11))
result = pd.tools.tile._bins_to_cuts(ser, bins, include_lowest=True)

The result is

结果是

In[61]: result.value_counts()
Out[61]: 
[0, 9.3]        323
(27.9, 38.2]     37
(9.3, 18.6]      37
(88.7, 99]       35
(57.8, 68.1]     32
(68.1, 78.4]     31
(78.4, 88.7]     30
(38.2, 48.5]     27
(48.5, 57.8]     26
(18.6, 27.9]     22
dtype: int64

Solution 2: bin1 contains only 0s

解决方案2：bin1只包含0

mx = np.ma.masked_equal(arr, 0, copy=True)
bins = algos.quantile(arr[~mx.mask], np.linspace(0, 1, 11))
bins = np.insert(bins, 0, 0)
bins[1] = bins[1]-(bins[1]/2)
result = pd.tools.tile._bins_to_cuts(arr, bins, include_lowest=True)

The result is:

结果是：

In[133]: result.value_counts()
Out[133]: 
[0, 0.5]        300
(0.5, 11]        32
(11, 18.8]       28
(18.8, 29.7]     30
(29.7, 39]       35
(39, 50]         26
(50, 59]         31
(59, 71]         31
(71, 79.2]       27
(79.2, 90.2]     30
(90.2, 99]       30
dtype: int64

There is work that could be done to Solution 2 to make it a little prettier I think, but you can see that the masked array is a useful tool to approach your goals.

我认为可以对解决方案 2 进行一些工作以使其更漂亮一些，但是您可以看到掩码数组是实现您的目标的有用工具。

Answer 3

回答by mgoldwasser

Another way to do this is to introduce a minimal amount of noise, which will artificially create unique bin edges. Here's an example:

另一种方法是引入最少量的噪声，这将人为地创建独特的 bin 边缘。下面是一个例子：

a = pd.Series(range(100) + ([0]*20))

def jitter(a_series, noise_reduction=1000000):
    return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction))

# and now this works by adding a little noise
a_deciles = pd.qcut(a + jitter(a), 10, labels=False)

we can recreate the original error using something like this:

我们可以使用以下方法重新创建原始错误：

a_deciles = pd.qcut(a, 10, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 173, in qcut
    precision=precision, include_lowest=True)
  File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 192, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([  0.        ,   0.        ,   0.        ,   3.8       ,
        11.73333333,  19.66666667,  27.6       ,  35.53333333,
        43.46666667,  51.4       ,  59.33333333,  67.26666667,
        75.2       ,  83.13333333,  91.06666667,  99.        ])

Answer 4

回答by luca

The problem isthat pandas.qcut chooses the bins so that you have the same number of records in each bin/quantile, but records with the same value cannot go in different bins/quantiles.

问题是pandas.qcut 选择了 bin，以便您在每个 bin/分位数中有相同数量的记录，但具有相同值的记录不能进入不同的 bin/分位数。

The solutions are:

解决方案是：

1 - Use pandas >= 0.20.0that has this fix. They added an option duplicates='raise'|'drop'to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.

1 - 使用具有此修复程序的Pandas >= 0.20.0。他们添加了一个选项duplicates='raise'|'drop'来控制是在重复的边缘上升高还是丢弃它们，这将导致比指定的垃圾箱少，有些比其他的大（元素更多）。

2 - Use pandas.cutthat chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin

2 - 使用 pandas.cut根据值本身选择均匀间隔的 bin，而 pandas.qcut 选择 bin，以便每个 bin 中的记录数相同

3 -Decreasethe number of quantiles. Less quantiles means more elements per quantile

3 -减少的数目位数。较少的分位数意味着每个分位数有更多的元素

4 -Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile

4 -指定自定义分位数范围，例如 [0, .50, .75, 1.] 以获得每个分位数不相等的项目数

5 - Rank your datawith DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first'). This fixes the issue but you might have that identical (pre-ranking) values go into different quantiles, which can be correct or not depending on your intent.

5 -使用 DataFrame.rank(method='first') 对您的数据进行排名。排名为数据帧中的每个元素（排名）分配一个唯一值，同时保持元素的顺序（除了相同的值，它们将按照它们在数组中出现的顺序进行排名，参见 method='first'）。这解决了问题，但您可能会将相同的（预排序）值放入不同的分位数，这取决于您的意图，这可能是正确的，也可能是不正确的。

Example:

例子：

pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"

Then use this instead:

然后改用这个：

pd.qcut(df.rank(method='first'), nbins)

Answer 5

回答by Jena Vint

If you want to enforce equal size bins, even in the presence of duplicate values, you can use the following, 2 step process:

如果要强制执行相等大小的 bin，即使存在重复值，也可以使用以下 2 步过程：

Rank your values, using method='first'to have python assign a unique rank to all your records. If there is a duplicate value (i.e. a tie in the rank), this method will choose the first record it comes to and rank in that order.

对您的值进行排名，使用 method='first'让 python 为您的所有记录分配一个唯一的排名。如果存在重复值（即排名并列），此方法将选择它遇到的第一条记录并按该顺序排名。

df['rank'] = df['value'].rank(method='first')

Use qcut on the rank to determine equal sized quantiles.Below example creates deciles (bins=10).

在排名上使用 qcut 来确定大小相等的分位数。下面的示例创建十分位数（bins=10）。

df['decile'] = pd.qcut(df['rank'].values, 10).codes

Python 如何使用非唯一的 bin 边缘进行 qcut？

提问by geronimo

回答by ashishsingal

回答by OYRM

回答by mgoldwasser

回答by luca

回答by Jena Vint

相关推荐

最近更新

标签

Python 如何使用非唯一的 bin 边缘进行 qcut？

提问by geronimo

回答by ashishsingal

回答by OYRM

回答by mgoldwasser

回答by luca

回答by Jena Vint

相关推荐

python中的TF-IDF实现

Python 使用双引号为特定列编写csv文件不起作用

Python 如何提取变量中的字典单键值对

Python 从 sklearn 导入时出现导入错误：无法导入名称 check_build

相关推荐

最近更新

标签