pandas 按值范围对数据进行分组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16349389/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Grouping data by value ranges
提问by PrestonDocks
I have a csv file that shows parts on order. The columns include days late, qty and commodity.
我有一个 csv 文件,显示订单中的零件。列包括延迟天数、数量和商品。
I need to group the data by days late and commodity with a sum of the qty. However the days late needs to be grouped into ranges.
我需要按延迟天数和商品的数量对数据进行分组。但是,迟到的天数需要按范围分组。
>56
>35 and <= 56
>14 and <= 35
>0 and <=14
I was hoping I could use a dict some how. Something like this
我希望我能以某种方式使用字典。像这样的东西
{'Red':'>56,'Amber':'>35 and <= 56','Yellow':'>14 and <= 35','White':'>0 and <=14'}
I am looking for a result like this
我正在寻找这样的结果
Red Amber Yellow White
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
I am new to pandas so I don't know if this is possible at all. Could anyone provide some advice.
我是大Pandas的新手,所以我根本不知道这是否可行。任何人都可以提供一些建议。
Thanks
谢谢
回答by unutbu
Suppose you start with this data:
假设您从以下数据开始:
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
# Days Late ID quantity
# 0 60 STRSUB 56
# 1 60 BOTDWG 20
# 2 50 STRSUB 60
# 3 50 BOTDWG 67
# 4 20 STRSUB 74
# 5 20 BOTDWG 87
# 6 10 STRSUB 40
# 7 10 BOTDWG 34
Then you can find the status category using pd.cut. Note that by default, pd.cutsplits the Series df['Days Late']into categories which are half-openintervals, (-1, 14], (14, 35], (35, 56], (56, 365]:
然后您可以使用 找到状态类别pd.cut。请注意,默认情况下,pd.cut将系列拆分df['Days Late']为半开区间的类别(-1, 14], (14, 35], (35, 56], (56, 365]:
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
print(df)
# ID quantity status
# 0 STRSUB 56 Red
# 1 BOTDWG 20 Red
# 2 STRSUB 60 Amber
# 3 BOTDWG 67 Amber
# 4 STRSUB 74 Yellow
# 5 BOTDWG 87 Yellow
# 6 STRSUB 40 White
# 7 BOTDWG 34 White
Now use pivotto get the DataFrame in the desired form:
现在使用pivot以所需形式获取 DataFrame:
df = df.pivot(index='ID', columns='status', values='quantity')
and use reindexto obtain the desired order for the rows and columns:
并用于reindex获取所需的行和列顺序:
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
Thus,
因此,
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
df = df.pivot(index='ID', columns='status', values='quantity')
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
print(df)
yields
产量
Red Amber Yellow White
ID
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
回答by mtadd
You can create a column in your DataFramebased on your Days Latecolumn by using the mapor applyfunctions as follows. Let's first create some sample data.
您可以使用或函数在您的DataFrame基于您的延迟天数列中创建一个列,如下所示。让我们首先创建一些示例数据。mapapply
df = pandas.DataFrame({ 'ID': 'foo,bar,foo,bar,foo,bar,foo,foo'.split(','),
'Days Late': numpy.random.randn(8)*20+30})
Days Late ID
0 30.746244 foo
1 16.234267 bar
2 14.771567 foo
3 33.211626 bar
4 3.497118 foo
5 52.482879 bar
6 11.695231 foo
7 47.350269 foo
Create a helper function to transform the data of the Days Latecolumn and add a column called Code.
创建一个辅助函数来转换Days Late列的数据并添加一个名为Code的列。
def days_late_xform(dl):
if dl > 56: return 'Red'
elif 35 < dl <= 56: return 'Amber'
elif 14 < dl <= 35: return 'Yellow'
elif 0 < dl <= 14: return 'White'
else: return 'None'
df["Code"] = df['Days Late'].map(days_late_xform)
Days Late ID Code
0 30.746244 foo Yellow
1 16.234267 bar Yellow
2 14.771567 foo Yellow
3 33.211626 bar Yellow
4 3.497118 foo White
5 52.482879 bar Amber
6 11.695231 foo White
7 47.350269 foo Amber
Lastly, you can use groupbyto aggregate by the IDand Codecolumns, and get the counts of the groups as follows:
最后,您可以使用groupby按ID和代码列聚合,并按如下方式获取组的计数:
g = df.groupby(["ID","Code"]).size()
print g
ID Code
bar Amber 1
Yellow 2
foo Amber 1
White 2
Yellow 2
df2 = g.unstack()
print df2
Code Amber White Yellow
ID
bar 1 NaN 2
foo 1 2 2
回答by choldgraf
I know this is coming a bit late, but I had the same problem as you and wanted to share the function np.digitize. It sounds like exactly what you want.
我知道这有点晚了,但我和你有同样的问题,想分享 np.digitize 函数。这听起来正是你想要的。
a = np.random.randint(0, 100, 50)
grps = np.arange(0, 100, 10)
grps2 = [1, 20, 25, 40]
print a
[35 76 83 62 57 50 24 0 14 40 21 3 45 30 79 32 29 80 90 38 2 77 50 73 51
71 29 53 76 16 93 46 14 32 44 77 24 95 48 23 26 49 32 15 2 33 17 88 26 17]
print np.digitize(a, grps)
[ 4 8 9 7 6 6 3 1 2 5 3 1 5 4 8 4 3 9 10 4 1 8 6 8 6
8 3 6 8 2 10 5 2 4 5 8 3 10 5 3 3 5 4 2 1 4 2 9 3 2]
print np.digitize(a, grps2)
[3 4 4 4 4 4 2 0 1 4 2 1 4 3 4 3 3 4 4 3 1 4 4 4 4 4 3 4 4 1 4 4 1 3 4 4 2
4 4 2 3 4 3 1 1 3 1 4 3 1]

