pandas 按组与熊猫相加唯一值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47159554/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sum unique values by group with pandas
提问by Stefano Frassetto
I got a dataframe like this:
我得到了这样的数据框:
data = {
'YEAR' : [2018,2018,2017,2018,2018,2018],
'SEASON': ['SPRING', 'SPRING', 'WINTER', 'SPRING', 'SPRING', 'SPRING'],
'CODE': ['A', 'A', 'A', 'B', 'C', 'D'],
'BUDGET': [500,200,300,4000,700,0],
'QUANTITY': [1000,1000,1000,2000,300,4000]
}
df = pd.DataFrame(data)
'''
BUDGET CODE QUANTITY SEASON YEAR
0 500 A 1000 SPRING 2018
1 200 A 1000 SPRING 2018
2 300 A 1000 WINTER 2017
3 4000 B 2000 SPRING 2018
4 700 C 300 SPRING 2018
5 0 D 4000 SPRING 2018
'''
For each CODEI got its correct BUDGETquantity, butunfortunately on the QUANTITYcolumn I got the total quantity for that code within every [Year, Season].
对于每个代码,我都获得了正确的BUDGET数量,但不幸的是,在QUANTITY列中,我获得了每个 [Year, Season] 内该代码的总数量。
I'm working on a function to aggregate my dataframe at different levels given as input: for example I'm giving the function a list like
我正在研究一个函数来聚合我作为输入的不同级别的数据帧:例如,我给函数一个列表
my_list = [
['YEAR']
['YEAR', 'SEASON']
]
and the function will output a series of dataframes grouped by each sublist.
该函数将输出一系列按每个子列表分组的数据帧。
The problem here is I can aggregate CODEwith pd.Series.nuniqueand I can sumthe BUDGETcolumn, but if I sumalso the QUANTITYcolumn I will obviously sum up more than I want. What I would need is some sort of sumUniquesfunction by uniques YEAR, SEASON, CODE.
这里的问题是,我可以聚集CODE与pd.Series.nunique,我可以总结的预算列,但如果我总结也是QUANTITY专栏中,我显然会总结超过了我希望。我需要的是独特的YEAR、SEASON、CODE 的某种sumUniques函数。
def sumUniques(x):
return '???'
print(df.groupby(['YEAR', 'SEASON']).agg({
'CODE': pd.Series.nunique,
'BUDGET': sum,
'QUANTITY' : sumUniques
}))
'''
CODE BUDGET QUANTITY
YEAR SEASON
2017 WINTER 1 300 ???
2018 SPRING 4 5400 ???
--> EXPECTED RESULT:
CODE BUDGET QUANTITY
YEAR SEASON
2017 WINTER 1 300 1000
2018 SPRING 4 5400 7300
'''
I'm asking myself which could be the best way to achieve this and I came upon Zero's answer to "Pandas: sum values from column to unique values": I've tried it out, by it seems like either I'm not applying it correctly or that does not apply my problem, as it raises a key error:
我在问自己哪一种可能是实现这一目标的最佳方式,我发现Zero 对“Pandas:从列到唯一值的总和值”的回答:我已经尝试过了,似乎要么我没有申请它正确或不适用我的问题,因为它引发了一个关键错误:
print(df.groupby(['YEAR', 'SEASON']).agg({
'CODE': pd.Series.nunique,
'BUDGET': sum,
'QUANTITY' : lambda x: x.groupby('CODE').QUANTITY.first().sum()
}))
'''
KeyError: 'CODE'
'''
I wonder what's the best way to make this work, hope this will be of help for others too!
我想知道使这项工作的最佳方法是什么,希望这对其他人也有帮助!
回答by cs95
Based on your comments, a slightly more involved procedure is required to get your result. The solution for QUANTITY
is very similar to how it is in jezrael's answer with apply
, so thanks to him.
根据您的意见,需要一个稍微复杂的程序才能获得您的结果。的解决方案QUANTITY
与 jezrael 对 的回答非常相似apply
,所以感谢他。
df
BUDGET CODE QUANTITY SEASON YEAR
0 500 A 1000 SPRING 2018
1 200 A 1000 SPRING 2018
2 300 A 1000 WINTER 2017
3 4000 B 2000 SPRING 2018
4 700 C 300 SPRING 2018
5 0 D 4000 SPRING 2018
6 500 E 1000 SPRING 2018
f = {
'CODE' : 'nunique',
'BUDGET' : 'sum'
}
g = df.groupby(['YEAR', 'SEASON'])
v1 = g.agg(f)
v2 = g.agg(lambda x: x.drop_duplicates('CODE', keep='first').QUANTITY.sum())
df = pd.concat([v1, v2.to_frame('QUANTITY')], 1)
df
CODE BUDGET QUANTITY
YEAR SEASON
2017 WINTER 1 300 1000
2018 SPRING 5 5900 8300
回答by jezrael
Use groupby
+ apply
with a custom function:
将groupby
+apply
与自定义函数一起使用:
def f(x):
a = x['CODE'].nunique()
b = x['BUDGET'].sum()
c = x.drop_duplicates('CODE').QUANTITY.sum()
#Or:
#c = x.groupby('CODE').QUANTITY.first().sum()
return pd.Series([a,b,c], index=['CODE','BUDGET','QUANTITY'])
print (df.groupby(['YEAR', 'SEASON']).apply(f) )
CODE BUDGET QUANTITY
YEAR SEASON
2017 WINTER 1 300 1000
2018 SPRING 4 5400 7300
Another solution:
另一种解决方案:
df1 = df.groupby(['YEAR', 'SEASON']).agg({ 'CODE' : 'nunique', 'BUDGET' : 'sum'})
s = df.drop_duplicates(['YEAR', 'SEASON','CODE']).groupby(['YEAR', 'SEASON'])['QUANTITY'].sum()
df = df1.join(s.rename('QUANTITY'))
print (df)
BUDGET CODE QUANTITY
YEAR SEASON
2017 WINTER 300 1 1000
2018 SPRING 5900 5 8300
Timings:
时间:
np.random.seed(123)
N = 1000000
a = ['WINTER', 'AUTUMN', 'SUMMER', 'SPRING']
b = list('ABCDEFGHIJKL')
c = range(1990, 2018)
data = {
'YEAR' : np.random.choice(c, N),
'SEASON': np.random.choice(a, N),
'CODE': np.random.choice(b, N),
'BUDGET': np.random.randint(1000,size= N),
'QUANTITY': np.random.randint(1000,size= N)
}
df = pd.DataFrame(data)
print (df.head())
BUDGET CODE QUANTITY SEASON YEAR
0 92 L 95 SUMMER 2003
1 961 A 696 SPRING 1992
2 481 G 351 WINTER 1992
3 296 A 51 SPRING 1996
4 896 G 58 AUTUMN 2007
def cols(df):
f = {
'CODE' : 'nunique',
'BUDGET' : 'sum'
}
g = df.groupby(['YEAR', 'SEASON'])
v1 = g.agg(f)
v2 = g.agg(lambda x: x.drop_duplicates('CODE', keep='first').QUANTITY.sum())
return pd.concat([v1, v2.to_frame('QUANTITY')], 1)
def jez2(df):
df1 = df.groupby(['YEAR', 'SEASON']).agg({ 'CODE' : 'nunique', 'BUDGET' : 'sum'})
s = df.drop_duplicates(['YEAR', 'SEASON','CODE']).groupby(['YEAR', 'SEASON'])['QUANTITY'].sum()
return df1.join(s.rename('QUANTITY'))
def f(x):
a = x['CODE'].nunique()
b = x['BUDGET'].sum()
c = x.groupby('CODE').QUANTITY.first().sum()
return pd.Series([a,b,c], index=['CODE','BUDGET','QUANTITY'])
print (df.groupby(['YEAR', 'SEASON']).apply(f))
print (jez2(df))
print (cols(df))
In [46]: %timeit (df.groupby(['YEAR', 'SEASON']).apply(f))
1 loop, best of 3: 674 ms per loop
In [47]: %timeit (jez2(df))
1 loop, best of 3: 1.31 s per loop
In [48]: %timeit (cols(df))
1 loop, best of 3: 1.88 s per loop