创建列的 bin 并获取 Pandas 中的计数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46803556/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:39:27  来源:igfitidea点击:

Creating bins of a column and getting the count in pandas

pandas

提问by Shubham R

i have a pandas dataframe:

我有一个Pandas数据框:

item_code    price
   1           15
   1           30
   1           60
   2           50
   3           90
   4           110
   5           130
   4           150

We can see that the max price is 150. i want to divide it into 5 bins of 30 each(into new columns) and get the count of occurance of each item code in that price bin.

我们可以看到最高价格是 150。我想将它分成 5 个箱子,每个箱子 30 个(分成新的列),并获取该价格箱中每个商品代码的出现次数。

final df=

最终 df=

item_code    0-30    31-60    61-90    91-120    121-150
    1         2         1       0         0          0
    2         0         1       0         0          0
    3         0         0       1         0          0
    4         0         0       0         1          1
    5         0         0       0         0          1

i.e item_code 1falls twice in the price range 0-30 therefore under column 0-30put count as 2. item_code 1falls once in price range 31-60. Therefore put count as 1.... Similarly for other item codes.

item_code 1在 0-30 价格范围内下跌两次,因此在列0-30看跌期权下计为 2。item_code 1在 31-60 价格范围内下跌一次。因此将计数为 1.... 其他项目代码也是如此。

I tried using pd.cut

我尝试使用 pd.cut

bins = [0, 30, 60, 90, 120,150]
df2 = pd.cut(df['price'], bins)

But it did not work.

但它没有用。

回答by piRSquared

Setup

设置

cats = ['0-30', '31-60', '61-90', '91-120', '121-150']
bins = [0, 30, 60, 90, 120, 150]


Option 1
Use pd.get_dummiesand pd.DataFrame.join

选项 1
使用pd.get_dummiespd.DataFrame.join

df[['item_code']].join(pd.get_dummies(pd.cut(df.price, bins, labels=cats)))

   item_code  0-30  31-60  61-90  91-120  121-150
0          1     1      0      0       0        0
1          1     1      0      0       0        0
2          1     0      1      0       0        0
3          2     0      1      0       0        0
4          3     0      0      1       0        0
5          4     0      0      0       1        0
6          5     0      0      0       0        1
7          4     0      0      0       0        1


Option 2
Using numpy's searchsortedand some string array addition.

选项 2
使用 numpysearchsorted和一些字符串数组添加。

from numpy.core.defchararray import add

bins = np.arange(30, 121, 30)

b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))

df[['item_code']].join(pd.get_dummies(cats[bins.searchsorted(df.price)]))

   item_code  0-30  120-150  30-60  60-90  90-120
0          1     1        0      0      0       0
1          1     1        0      0      0       0
2          1     0        0      1      0       0
3          2     0        0      1      0       0
4          3     0        0      0      1       0
5          4     0        0      0      0       1
6          5     0        1      0      0       0
7          4     0        1      0      0       0


If you are looking to sum the like valued item_codes. You can use groupbyinstead of join

如果您想对类似值的item_codes求和。您可以使用groupby代替join

from numpy.core.defchararray import add

bins = np.arange(30, 121, 30)

b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))

pd.get_dummies(cats[bins.searchsorted(df.price)]).groupby(df.item_code).sum().reset_index()

   item_code  0-30  120-150  30-60  60-90  90-120
0          1     2        0      1      0       0
1          2     0        0      1      0       0
2          3     0        0      0      1       0
3          4     0        1      0      0       1
4          5     0        1      0      0       0


Option 3
A very fast approach using pd.factorizeand np.bincount

选项 3
使用pd.factorize和的非常快速的方法np.bincount

from numpy.core.defchararray import add

bins = np.arange(30, 121, 30)

b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))

j, c = pd.factorize(bins.searchsorted(df.price))
i, r = pd.factorize(df.item_code.values)
n, m = c.size, r.size

pd.DataFrame(
    np.bincount(i * m + j, minlength=n * m).reshape(n, m),
    r, cats).rename_axis('item_code').reset_index()

   item_code  0-30  30-60  60-90  90-120  120-150
0          1     2      1      0       0        0
1          2     0      1      0       0        0
2          3     0      0      1       0        0
3          4     0      0      0       1        1
4          5     0      0      0       0        1

回答by jezrael

Add parameter labels to cutand then groupbyand aggregate size:

将参数标签添加到cut然后groupby聚合size

cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
         .size()
         .unstack(fill_value=0))
print (df2)
price      0-30  31-60  61-90  91-120  121-150
item_code                                     
1             2      1      0       0        0
2             0      1      0       0        0
3             0      0      1       0        0
4             0      0      0       1        1
5             0      0      0       0        1

EDIT If you want general solution, add reindex:

编辑如果您想要一般解决方案,请添加reindex

print (df)
   item_code  price
0          1     15
1          1     30
2          1     60
3          2     50
4          3     90
5          4    110

cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
        .size()
        .unstack(fill_value=0)
        .reindex(columns=cats, fill_value=0))
print (df2)
price      0-30  31-60  61-90  91-120  121-150
item_code                                     
1             2      1      0       0        0
2             0      1      0       0        0
3             0      0      1       0        0
4             0      0      0       1        0

回答by Zero

Use groupbyand unstack

使用groupbyunstack

In [3835]: bins = np.array(bins)  # for dynamic labels 

In [3836]: labels = map('{0[0]}-{0[1]}'.format, zip(1+bins[:-1], bins[1:]))

In [3837]: (df.groupby(['item_code', pd.cut(df['price'], bins=bins, labels=labels)])
              .size().unstack(fill_value=0))
Out[3837]:
price      1-30  31-60  61-90  91-120  121-150
item_code
1             2      1      0       0        0
2             0      1      0       0        0
3             0      0      1       0        0
4             0      0      0       1        1
5             0      0      0       0        1

回答by cs95

Using cut+ pivot_table:

使用cut+ pivot_table

bins = [0, 30, 60, 90, 120,150]
labels = ['0-30', '31-60', '61-90', '91-120',' 121-150']

df = df.assign(bins=pd.cut(df.price, bins, labels=labels))\
       .pivot_table('price', 'item_code', 'bins', 'count').fillna(0).astype(int)

print(df)
bins       0-30  31-60  61-90  91-120   121-150
item_code                                      
1             2      1      0       0         0
2             0      1      0       0         0
3             0      0      1       0         0
4             0      0      0       1         1
5             0      0      0       0         1