pandas 按组与熊猫相加唯一值

Question

提问by Stefano Frassetto

I got a dataframe like this:

我得到了这样的数据框：

data = {
    'YEAR' :    [2018,2018,2017,2018,2018,2018],
    'SEASON':   ['SPRING', 'SPRING', 'WINTER', 'SPRING', 'SPRING', 'SPRING'],
    'CODE':     ['A', 'A', 'A', 'B', 'C', 'D'],
    'BUDGET':   [500,200,300,4000,700,0],
    'QUANTITY': [1000,1000,1000,2000,300,4000]
}

df = pd.DataFrame(data)

'''
   BUDGET CODE  QUANTITY  SEASON  YEAR
0     500    A      1000  SPRING  2018
1     200    A      1000  SPRING  2018
2     300    A      1000  WINTER  2017
3    4000    B      2000  SPRING  2018
4     700    C       300  SPRING  2018
5       0    D      4000  SPRING  2018
'''

For each CODEI got its correct BUDGETquantity, butunfortunately on the QUANTITYcolumn I got the total quantity for that code within every [Year, Season].

对于每个代码，我都获得了正确的BUDGET数量，但不幸的是，在QUANTITY列中，我获得了每个 [Year, Season] 内该代码的总数量。

I'm working on a function to aggregate my dataframe at different levels given as input: for example I'm giving the function a list like

我正在研究一个函数来聚合我作为输入的不同级别的数据帧：例如，我给函数一个列表

my_list = [
  ['YEAR']
  ['YEAR', 'SEASON']
]

and the function will output a series of dataframes grouped by each sublist.

该函数将输出一系列按每个子列表分组的数据帧。

The problem here is I can aggregate CODEwith pd.Series.nuniqueand I can sumthe BUDGETcolumn, but if I sumalso the QUANTITYcolumn I will obviously sum up more than I want. What I would need is some sort of sumUniquesfunction by uniques YEAR, SEASON, CODE.

这里的问题是，我可以聚集CODE与pd.Series.nunique，我可以总结的预算列，但如果我总结也是QUANTITY专栏中，我显然会总结超过了我希望。我需要的是独特的YEAR、SEASON、CODE 的某种sumUniques函数。

def sumUniques(x):
    return '???'

print(df.groupby(['YEAR', 'SEASON']).agg({
    'CODE': pd.Series.nunique,
    'BUDGET': sum,
    'QUANTITY' : sumUniques
}))

'''
             CODE  BUDGET QUANTITY
YEAR SEASON                       
2017 WINTER     1     300      ???
2018 SPRING     4    5400      ???


--> EXPECTED RESULT:
             CODE  BUDGET QUANTITY
YEAR SEASON                       
2017 WINTER     1     300      1000
2018 SPRING     4    5400      7300

'''

I'm asking myself which could be the best way to achieve this and I came upon Zero's answer to "Pandas: sum values from column to unique values": I've tried it out, by it seems like either I'm not applying it correctly or that does not apply my problem, as it raises a key error:

我在问自己哪一种可能是实现这一目标的最佳方式，我发现Zero 对“Pandas：从列到唯一值的总和值”的回答：我已经尝试过了，似乎要么我没有申请它正确或不适用我的问题，因为它引发了一个关键错误：

print(df.groupby(['YEAR', 'SEASON']).agg({
    'CODE': pd.Series.nunique,
    'BUDGET': sum,
    'QUANTITY' : lambda x: x.groupby('CODE').QUANTITY.first().sum()
}))

'''
KeyError: 'CODE'
'''

I wonder what's the best way to make this work, hope this will be of help for others too!

我想知道使这项工作的最佳方法是什么，希望这对其他人也有帮助！

Answer 1

回答by cs95

Based on your comments, a slightly more involved procedure is required to get your result. The solution for QUANTITYis very similar to how it is in jezrael's answer with apply, so thanks to him.

根据您的意见，需要一个稍微复杂的程序才能获得您的结果。的解决方案QUANTITY与 jezrael 对的回答非常相似apply，所以感谢他。

df

   BUDGET CODE  QUANTITY  SEASON  YEAR
0     500    A      1000  SPRING  2018
1     200    A      1000  SPRING  2018
2     300    A      1000  WINTER  2017
3    4000    B      2000  SPRING  2018
4     700    C       300  SPRING  2018
5       0    D      4000  SPRING  2018
6     500    E      1000  SPRING  2018

f = {
        'CODE' : 'nunique', 
        'BUDGET' : 'sum'
}

g = df.groupby(['YEAR', 'SEASON'])
v1 = g.agg(f)
v2 = g.agg(lambda x: x.drop_duplicates('CODE', keep='first').QUANTITY.sum())

df = pd.concat([v1, v2.to_frame('QUANTITY')], 1)

df

             CODE  BUDGET  QUANTITY
YEAR SEASON                        
2017 WINTER     1     300      1000
2018 SPRING     5    5900      8300

Answer 2

回答by jezrael

Use groupby+ applywith a custom function:

将groupby+apply与自定义函数一起使用：

def f(x):
   a = x['CODE'].nunique()
   b =  x['BUDGET'].sum()
   c = x.drop_duplicates('CODE').QUANTITY.sum()
   #Or:
   #c = x.groupby('CODE').QUANTITY.first().sum()
   return pd.Series([a,b,c], index=['CODE','BUDGET','QUANTITY'])


print (df.groupby(['YEAR', 'SEASON']).apply(f) )

             CODE  BUDGET  QUANTITY
YEAR SEASON                        
2017 WINTER     1     300      1000
2018 SPRING     4    5400      7300

Another solution:

另一种解决方案：

df1 = df.groupby(['YEAR', 'SEASON']).agg({ 'CODE' : 'nunique', 'BUDGET' : 'sum'})
s = df.drop_duplicates(['YEAR', 'SEASON','CODE']).groupby(['YEAR', 'SEASON'])['QUANTITY'].sum()

df = df1.join(s.rename('QUANTITY'))
print (df)
             BUDGET  CODE  QUANTITY
YEAR SEASON                        
2017 WINTER     300     1      1000
2018 SPRING    5900     5      8300

Timings:

时间：

np.random.seed(123)
N = 1000000
a = ['WINTER', 'AUTUMN', 'SUMMER', 'SPRING']
b = list('ABCDEFGHIJKL')
c = range(1990, 2018)

data = {
    'YEAR' :    np.random.choice(c, N),
    'SEASON':   np.random.choice(a, N),
    'CODE':     np.random.choice(b, N),
    'BUDGET':    np.random.randint(1000,size= N),
    'QUANTITY': np.random.randint(1000,size= N)
}

df = pd.DataFrame(data)
print (df.head())
   BUDGET CODE  QUANTITY  SEASON  YEAR
0      92    L        95  SUMMER  2003
1     961    A       696  SPRING  1992
2     481    G       351  WINTER  1992
3     296    A        51  SPRING  1996
4     896    G        58  AUTUMN  2007

def cols(df):
    f = {
        'CODE' : 'nunique', 
        'BUDGET' : 'sum'
        }

    g = df.groupby(['YEAR', 'SEASON'])
    v1 = g.agg(f)
    v2 = g.agg(lambda x: x.drop_duplicates('CODE', keep='first').QUANTITY.sum())

    return pd.concat([v1, v2.to_frame('QUANTITY')], 1)

def jez2(df):
    df1 = df.groupby(['YEAR', 'SEASON']).agg({ 'CODE' : 'nunique', 'BUDGET' : 'sum'})
    s = df.drop_duplicates(['YEAR', 'SEASON','CODE']).groupby(['YEAR', 'SEASON'])['QUANTITY'].sum()
    return df1.join(s.rename('QUANTITY'))


def f(x):
   a = x['CODE'].nunique()
   b =  x['BUDGET'].sum()
   c = x.groupby('CODE').QUANTITY.first().sum()
   return pd.Series([a,b,c], index=['CODE','BUDGET','QUANTITY'])


print (df.groupby(['YEAR', 'SEASON']).apply(f))

print (jez2(df))
print (cols(df))

In [46]: %timeit (df.groupby(['YEAR', 'SEASON']).apply(f))
1 loop, best of 3: 674 ms per loop

In [47]: %timeit (jez2(df))
1 loop, best of 3: 1.31 s per loop

In [48]: %timeit (cols(df))
1 loop, best of 3: 1.88 s per loop

pandas 按组与熊猫相加唯一值

提问by Stefano Frassetto

回答by cs95

回答by jezrael

相关推荐

最近更新

标签

pandas 按组与熊猫相加唯一值

提问by Stefano Frassetto

回答by cs95

回答by jezrael

相关推荐

pandas 将数据从 Python 导出到 Excel

pandas 如何用一个值替换多个值python

pandas 熊猫将列表拆分为带有正则表达式的列

pandas 在python中将多个Excel文件（xlsx）附加在一起

相关推荐

最近更新

标签