创建列的 bin 并获取 Pandas 中的计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46803556/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Creating bins of a column and getting the count in pandas
提问by Shubham R
i have a pandas dataframe:
我有一个Pandas数据框:
item_code price
1 15
1 30
1 60
2 50
3 90
4 110
5 130
4 150
We can see that the max price is 150. i want to divide it into 5 bins of 30 each(into new columns) and get the count of occurance of each item code in that price bin.
我们可以看到最高价格是 150。我想将它分成 5 个箱子,每个箱子 30 个(分成新的列),并获取该价格箱中每个商品代码的出现次数。
final df=
最终 df=
item_code 0-30 31-60 61-90 91-120 121-150
1 2 1 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 1
5 0 0 0 0 1
i.e item_code 1
falls twice in the price range 0-30 therefore under column 0-30
put count as 2. item_code 1
falls once in price range 31-60. Therefore put count as 1.... Similarly for other item codes.
即item_code 1
在 0-30 价格范围内下跌两次,因此在列0-30
看跌期权下计为 2。item_code 1
在 31-60 价格范围内下跌一次。因此将计数为 1.... 其他项目代码也是如此。
I tried using pd.cut
我尝试使用 pd.cut
bins = [0, 30, 60, 90, 120,150]
df2 = pd.cut(df['price'], bins)
But it did not work.
但它没有用。
回答by piRSquared
Setup
设置
cats = ['0-30', '31-60', '61-90', '91-120', '121-150']
bins = [0, 30, 60, 90, 120, 150]
Option 1
Use pd.get_dummies
and pd.DataFrame.join
选项 1
使用pd.get_dummies
和pd.DataFrame.join
df[['item_code']].join(pd.get_dummies(pd.cut(df.price, bins, labels=cats)))
item_code 0-30 31-60 61-90 91-120 121-150
0 1 1 0 0 0 0
1 1 1 0 0 0 0
2 1 0 1 0 0 0
3 2 0 1 0 0 0
4 3 0 0 1 0 0
5 4 0 0 0 1 0
6 5 0 0 0 0 1
7 4 0 0 0 0 1
Option 2
Using numpy's searchsorted
and some string array addition.
选项 2
使用 numpysearchsorted
和一些字符串数组添加。
from numpy.core.defchararray import add
bins = np.arange(30, 121, 30)
b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))
df[['item_code']].join(pd.get_dummies(cats[bins.searchsorted(df.price)]))
item_code 0-30 120-150 30-60 60-90 90-120
0 1 1 0 0 0 0
1 1 1 0 0 0 0
2 1 0 0 1 0 0
3 2 0 0 1 0 0
4 3 0 0 0 1 0
5 4 0 0 0 0 1
6 5 0 1 0 0 0
7 4 0 1 0 0 0
If you are looking to sum the like valued item_code
s. You can use groupby
instead of join
如果您想对类似值的item_code
s求和。您可以使用groupby
代替join
from numpy.core.defchararray import add
bins = np.arange(30, 121, 30)
b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))
pd.get_dummies(cats[bins.searchsorted(df.price)]).groupby(df.item_code).sum().reset_index()
item_code 0-30 120-150 30-60 60-90 90-120
0 1 2 0 1 0 0
1 2 0 0 1 0 0
2 3 0 0 0 1 0
3 4 0 1 0 0 1
4 5 0 1 0 0 0
Option 3
A very fast approach using pd.factorize
and np.bincount
选项 3
使用pd.factorize
和的非常快速的方法np.bincount
from numpy.core.defchararray import add
bins = np.arange(30, 121, 30)
b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))
j, c = pd.factorize(bins.searchsorted(df.price))
i, r = pd.factorize(df.item_code.values)
n, m = c.size, r.size
pd.DataFrame(
np.bincount(i * m + j, minlength=n * m).reshape(n, m),
r, cats).rename_axis('item_code').reset_index()
item_code 0-30 30-60 60-90 90-120 120-150
0 1 2 1 0 0 0
1 2 0 1 0 0 0
2 3 0 0 1 0 0
3 4 0 0 0 1 1
4 5 0 0 0 0 1
回答by jezrael
Add parameter labels to cut
and then groupby
and aggregate size
:
cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
.size()
.unstack(fill_value=0))
print (df2)
price 0-30 31-60 61-90 91-120 121-150
item_code
1 2 1 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 1
5 0 0 0 0 1
EDIT If you want general solution, add reindex
:
编辑如果您想要一般解决方案,请添加reindex
:
print (df)
item_code price
0 1 15
1 1 30
2 1 60
3 2 50
4 3 90
5 4 110
cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
.size()
.unstack(fill_value=0)
.reindex(columns=cats, fill_value=0))
print (df2)
price 0-30 31-60 61-90 91-120 121-150
item_code
1 2 1 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
回答by Zero
Use groupby
and unstack
使用groupby
和unstack
In [3835]: bins = np.array(bins) # for dynamic labels
In [3836]: labels = map('{0[0]}-{0[1]}'.format, zip(1+bins[:-1], bins[1:]))
In [3837]: (df.groupby(['item_code', pd.cut(df['price'], bins=bins, labels=labels)])
.size().unstack(fill_value=0))
Out[3837]:
price 1-30 31-60 61-90 91-120 121-150
item_code
1 2 1 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 1
5 0 0 0 0 1
回答by cs95
Using cut
+ pivot_table
:
使用cut
+ pivot_table
:
bins = [0, 30, 60, 90, 120,150]
labels = ['0-30', '31-60', '61-90', '91-120',' 121-150']
df = df.assign(bins=pd.cut(df.price, bins, labels=labels))\
.pivot_table('price', 'item_code', 'bins', 'count').fillna(0).astype(int)
print(df)
bins 0-30 31-60 61-90 91-120 121-150
item_code
1 2 1 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 1
5 0 0 0 0 1