pandas 按组与熊猫相加唯一值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47159554/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:44:35  来源:igfitidea点击:

Sum unique values by group with pandas

pythonpandasdataframegroup-bypandas-groupby

提问by Stefano Frassetto

I got a dataframe like this:

我得到了这样的数据框:

data = {
    'YEAR' :    [2018,2018,2017,2018,2018,2018],
    'SEASON':   ['SPRING', 'SPRING', 'WINTER', 'SPRING', 'SPRING', 'SPRING'],
    'CODE':     ['A', 'A', 'A', 'B', 'C', 'D'],
    'BUDGET':   [500,200,300,4000,700,0],
    'QUANTITY': [1000,1000,1000,2000,300,4000]
}

df = pd.DataFrame(data)

'''
   BUDGET CODE  QUANTITY  SEASON  YEAR
0     500    A      1000  SPRING  2018
1     200    A      1000  SPRING  2018
2     300    A      1000  WINTER  2017
3    4000    B      2000  SPRING  2018
4     700    C       300  SPRING  2018
5       0    D      4000  SPRING  2018
'''

For each CODEI got its correct BUDGETquantity, butunfortunately on the QUANTITYcolumn I got the total quantity for that code within every [Year, Season].

对于每个代码,我都获得了正确的BUDGET数量,不幸的是,在QUANTITY列中,我获得了每个 [Year, Season] 内该代码的总数量。

I'm working on a function to aggregate my dataframe at different levels given as input: for example I'm giving the function a list like

我正在研究一个函数来聚合我作为输入的不同级别的数据帧:例如,我给函数一个列表

my_list = [
  ['YEAR']
  ['YEAR', 'SEASON']
]

and the function will output a series of dataframes grouped by each sublist.

该函数将输出一系列按每个子列表分组的数据帧。

The problem here is I can aggregate CODEwith pd.Series.nuniqueand I can sumthe BUDGETcolumn, but if I sumalso the QUANTITYcolumn I will obviously sum up more than I want. What I would need is some sort of sumUniquesfunction by uniques YEAR, SEASON, CODE.

这里的问题是,我可以聚集CODEpd.Series.nunique,我可以总结预算列,但如果我总结也是QUANTITY专栏中,我显然会总结超过了我希望。我需要的是独特YEARSEASONCODE 的某种sumUniques函数。

def sumUniques(x):
    return '???'

print(df.groupby(['YEAR', 'SEASON']).agg({
    'CODE': pd.Series.nunique,
    'BUDGET': sum,
    'QUANTITY' : sumUniques
}))

'''
             CODE  BUDGET QUANTITY
YEAR SEASON                       
2017 WINTER     1     300      ???
2018 SPRING     4    5400      ???


--> EXPECTED RESULT:
             CODE  BUDGET QUANTITY
YEAR SEASON                       
2017 WINTER     1     300      1000
2018 SPRING     4    5400      7300

'''

I'm asking myself which could be the best way to achieve this and I came upon Zero's answer to "Pandas: sum values from column to unique values": I've tried it out, by it seems like either I'm not applying it correctly or that does not apply my problem, as it raises a key error:

我在问自己哪一种可能是实现这一目标的最佳方式,我发现Zero 对“Pandas:从列到唯一值的总和值”的回答:我已经尝试过了,似乎要么我没有申请它正确或不适用我的问题,因为它引发了一个关键错误

print(df.groupby(['YEAR', 'SEASON']).agg({
    'CODE': pd.Series.nunique,
    'BUDGET': sum,
    'QUANTITY' : lambda x: x.groupby('CODE').QUANTITY.first().sum()
}))

'''
KeyError: 'CODE'
'''

I wonder what's the best way to make this work, hope this will be of help for others too!

我想知道使这项工作的最佳方法是什么,希望这对其他人也有帮助!

回答by cs95

Based on your comments, a slightly more involved procedure is required to get your result. The solution for QUANTITYis very similar to how it is in jezrael's answer with apply, so thanks to him.

根据您的意见,需要一个稍微复杂的程序才能获得您的结果。的解决方案QUANTITY与 jezrael 对 的回答非常相似apply,所以感谢他。

df

   BUDGET CODE  QUANTITY  SEASON  YEAR
0     500    A      1000  SPRING  2018
1     200    A      1000  SPRING  2018
2     300    A      1000  WINTER  2017
3    4000    B      2000  SPRING  2018
4     700    C       300  SPRING  2018
5       0    D      4000  SPRING  2018
6     500    E      1000  SPRING  2018

f = {
        'CODE' : 'nunique', 
        'BUDGET' : 'sum'
}

g = df.groupby(['YEAR', 'SEASON'])
v1 = g.agg(f)
v2 = g.agg(lambda x: x.drop_duplicates('CODE', keep='first').QUANTITY.sum())

df = pd.concat([v1, v2.to_frame('QUANTITY')], 1)

df

             CODE  BUDGET  QUANTITY
YEAR SEASON                        
2017 WINTER     1     300      1000
2018 SPRING     5    5900      8300

回答by jezrael

Use groupby+ applywith a custom function:

groupby+apply与自定义函数一起使用:

def f(x):
   a = x['CODE'].nunique()
   b =  x['BUDGET'].sum()
   c = x.drop_duplicates('CODE').QUANTITY.sum()
   #Or:
   #c = x.groupby('CODE').QUANTITY.first().sum()
   return pd.Series([a,b,c], index=['CODE','BUDGET','QUANTITY'])


print (df.groupby(['YEAR', 'SEASON']).apply(f) )

             CODE  BUDGET  QUANTITY
YEAR SEASON                        
2017 WINTER     1     300      1000
2018 SPRING     4    5400      7300

Another solution:

另一种解决方案:

df1 = df.groupby(['YEAR', 'SEASON']).agg({ 'CODE' : 'nunique', 'BUDGET' : 'sum'})
s = df.drop_duplicates(['YEAR', 'SEASON','CODE']).groupby(['YEAR', 'SEASON'])['QUANTITY'].sum()

df = df1.join(s.rename('QUANTITY'))
print (df)
             BUDGET  CODE  QUANTITY
YEAR SEASON                        
2017 WINTER     300     1      1000
2018 SPRING    5900     5      8300

Timings:

时间

np.random.seed(123)
N = 1000000
a = ['WINTER', 'AUTUMN', 'SUMMER', 'SPRING']
b = list('ABCDEFGHIJKL')
c = range(1990, 2018)

data = {
    'YEAR' :    np.random.choice(c, N),
    'SEASON':   np.random.choice(a, N),
    'CODE':     np.random.choice(b, N),
    'BUDGET':    np.random.randint(1000,size= N),
    'QUANTITY': np.random.randint(1000,size= N)
}

df = pd.DataFrame(data)
print (df.head())
   BUDGET CODE  QUANTITY  SEASON  YEAR
0      92    L        95  SUMMER  2003
1     961    A       696  SPRING  1992
2     481    G       351  WINTER  1992
3     296    A        51  SPRING  1996
4     896    G        58  AUTUMN  2007


def cols(df):
    f = {
        'CODE' : 'nunique', 
        'BUDGET' : 'sum'
        }

    g = df.groupby(['YEAR', 'SEASON'])
    v1 = g.agg(f)
    v2 = g.agg(lambda x: x.drop_duplicates('CODE', keep='first').QUANTITY.sum())

    return pd.concat([v1, v2.to_frame('QUANTITY')], 1)

def jez2(df):
    df1 = df.groupby(['YEAR', 'SEASON']).agg({ 'CODE' : 'nunique', 'BUDGET' : 'sum'})
    s = df.drop_duplicates(['YEAR', 'SEASON','CODE']).groupby(['YEAR', 'SEASON'])['QUANTITY'].sum()
    return df1.join(s.rename('QUANTITY'))


def f(x):
   a = x['CODE'].nunique()
   b =  x['BUDGET'].sum()
   c = x.groupby('CODE').QUANTITY.first().sum()
   return pd.Series([a,b,c], index=['CODE','BUDGET','QUANTITY'])


print (df.groupby(['YEAR', 'SEASON']).apply(f))

print (jez2(df))
print (cols(df))


In [46]: %timeit (df.groupby(['YEAR', 'SEASON']).apply(f))
1 loop, best of 3: 674 ms per loop

In [47]: %timeit (jez2(df))
1 loop, best of 3: 1.31 s per loop

In [48]: %timeit (cols(df))
1 loop, best of 3: 1.88 s per loop