使用 python 熊猫分箱列

Question

提问by Night Walker

I have a Data Frame column with numeric values:

我有一个带有数值的数据框列：

df['percentage'].head()
46.5
44.2
100.0
42.12

I want to see the column as bin counts:

我想将列视为 bin 计数：

bins = [0, 1, 5, 10, 25, 50, 100]

How can I get the result as bins with their value counts?

我怎样才能得到结果与他们的垃圾箱value counts？

[0, 1] bin amount
[1, 5] etc 
[5, 10] etc 
......

Answer 1

回答by jezrael

You can use pandas.cut:

您可以使用pandas.cut：

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
   percentage     binned
0       46.50   (25, 50]
1       44.20   (25, 50]
2      100.00  (50, 100]
3       42.12   (25, 50]

bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
   percentage binned
0       46.50      5
1       44.20      5
2      100.00      6
3       42.12      5

Or numpy.searchsorted:

或者numpy.searchsorted：

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
   percentage  binned
0       46.50       5
1       44.20       5
2      100.00       6
3       42.12       5

...and then value_countsor groupbyand aggregate size:

...然后value_counts或groupby和聚合size：

s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50]     3
(50, 100]    1
(10, 25]     0
(5, 10]      0
(1, 5]       0
(0, 1]       0
Name: percentage, dtype: int64

s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1]       0
(1, 5]       0
(5, 10]      0
(10, 25]     0
(25, 50]     3
(50, 100]    1
dtype: int64

By default cutreturn categorical.

默认cut返回categorical。

Seriesmethods like Series.value_counts()will use all categories, even if some categories are not present in the data, operations in categorical.

Series方法 likeSeries.value_counts()将使用所有类别，即使数据中不存在某些类别， categorical 中的操作。

Answer 2

回答by Erfan

Using `numba`module for speed up.

使用`numba`模块加速。

On big datasets (500k >) pd.cutcan be quite slow for binning data.

在大数据集上 ( 500k >)pd.cut对数据进行分箱可能会很慢。

I wrote my own function in numbawith just in time compilation, which is roughly 16xfaster:

我numba用即时编译编写了自己的函数，这大致16x更快：

from numba import njit

@njit
def cut(arr):
    bins = np.empty(arr.shape[0])
    for idx, x in enumerate(arr):
        if (x >= 0) & (x < 1):
            bins[idx] = 1
        elif (x >= 1) & (x < 5):
            bins[idx] = 2
        elif (x >= 5) & (x < 10):
            bins[idx] = 3
        elif (x >= 10) & (x < 25):
            bins[idx] = 4
        elif (x >= 25) & (x < 50):
            bins[idx] = 5
        elif (x >= 50) & (x < 100):
            bins[idx] = 6
        else:
            bins[idx] = 7

    return bins

cut(df['percentage'].to_numpy())

# array([5., 5., 7., 5.])

Optional: you can also map it to bins as strings:

可选：您还可以将其作为字符串映射到 bin：

a = cut(df['percentage'].to_numpy())

conversion_dict = {1: 'bin1',
                   2: 'bin2',
                   3: 'bin3',
                   4: 'bin4',
                   5: 'bin5',
                   6: 'bin6',
                   7: 'bin7'}

bins = list(map(conversion_dict.get, a))

# ['bin5', 'bin5', 'bin7', 'bin5']

Speed comparison:

速度比较：

# create dataframe of 8 million rows for testing
dfbig = pd.concat([df]*2000000, ignore_index=True)

dfbig.shape

# (8000000, 1)

%%timeit
cut(dfbig['percentage'].to_numpy())

# 38 ms ± 616 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
pd.cut(dfbig['percentage'], bins=bins, labels=labels)

# 215 ms ± 9.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

使用 python 熊猫分箱列

提问by Night Walker

回答by jezrael

回答by Erfan

Using `numba`module for speed up.

使用`numba`模块加速。

相关推荐

最近更新

标签

使用 python 熊猫分箱列

提问by Night Walker

回答by jezrael

回答by Erfan

Using numbamodule for speed up.

使用numba模块加速。

相关推荐

如何使用 Python requests 库发出 post 请求？

Python逐行写入文本文件

Python 发布到 django rest 框架

Python 在 TensorFlow 中使用预训练的词嵌入（word2vec 或 Glove）

相关推荐

最近更新

标签

Using `numba`module for speed up.

使用`numba`模块加速。