分组并聚合 Python 中字典列表的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18066269/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:50:56  来源:igfitidea点击:

Group by and aggregate the values of a list of dictionaries in Python

pythondictionaryitertools

提问by Kyle Getrost

I'm trying to write a function, in an elegant way, that will group a list of dictionaries and aggregate (sum) the values of like-keys.

我正在尝试以一种优雅的方式编写一个函数,它将对字典列表进行分组并聚合(求和)like-keys 的值。

Example:

例子:

my_dataset = [  
    {
        'date': datetime.date(2013, 1, 1),
        'id': 99,
        'value1': 10,
        'value2': 10
    },
    {
        'date': datetime.date(2013, 1, 1),
        'id': 98,
        'value1': 10,
        'value2': 10
    },
    {
        'date': datetime.date(2013, 1, 2),
        'id' 99,
        'value1': 10,
        'value2': 10
    }
]

group_and_sum_dataset(my_dataset, 'date', ['value1', 'value2'])

"""
Should return:
[
    {
        'date': datetime.date(2013, 1, 1),
        'value1': 20,
        'value2': 20
    },
    {
        'date': datetime.date(2013, 1, 2),
        'value1': 10,
        'value2': 10
    }
]
"""

I've tried doing this using itertools for the groupby and summing each like-key value pair, but am missing something here. Here's what my function currently looks like:

我已经尝试使用 itertools 为 groupby 执行此操作并对每个类似键值对求和,但我在这里遗漏了一些东西。这是我的功能目前的样子:

def group_and_sum_dataset(dataset, group_by_key, sum_value_keys):
    keyfunc = operator.itemgetter(group_by_key)
    dataset.sort(key=keyfunc)
    new_dataset = []
    for key, index in itertools.groupby(dataset, keyfunc):
        d = {group_by_key: key}
        d.update({k:sum([item[k] for item in index]) for k in sum_value_keys})
        new_dataset.append(d)
    return new_dataset

采纳答案by Ashwini Chaudhary

You can use collections.Counterand collections.defaultdict.

您可以使用collections.Countercollections.defaultdict

Using a dict this can be done in O(N), while sorting requires O(NlogN)time.

使用 dict 这可以在 中完成O(N),而排序需要O(NlogN)时间。

from collections import defaultdict, Counter
def solve(dataset, group_by_key, sum_value_keys):
    dic = defaultdict(Counter)
    for item in dataset:
        key = item[group_by_key]
        vals = {k:item[k] for k in sum_value_keys}
        dic[key].update(vals)
    return dic
... 
>>> d = solve(my_dataset, 'date', ['value1', 'value2'])
>>> d
defaultdict(<class 'collections.Counter'>,
{
 datetime.date(2013, 1, 2): Counter({'value2': 10, 'value1': 10}),
 datetime.date(2013, 1, 1): Counter({'value2': 20, 'value1': 20})
})

The advantage of Counteris that it'll automatically sum the values of similar keys.:

的优点Counter是它会自动对相似键的值求和。:

Example:

例子:

>>> c = Counter(**{'value1': 10, 'value2': 5})
>>> c.update({'value1': 7, 'value2': 3})
>>> c
Counter({'value1': 17, 'value2': 8})

回答by Kyle Getrost

Thanks, I forgot about Counter. I still wanted to maintain the output format and sorting of my returned dataset, so here's what my final function looks like:

谢谢,我忘记了计数器。我仍然想保持我返回的数据集的输出格式和排序,所以我的最终函数如下所示:

def group_and_sum_dataset(dataset, group_by_key, sum_value_keys):

    container = defaultdict(Counter)

    for item in dataset:
        key = item[group_by_key]
        values = {k:item[k] for k in sum_value_keys}
        container[key].update(values)

    new_dataset = [
        dict([(group_by_key, item[0])] + item[1].items())
            for item in container.items()
    ]
    new_dataset.sort(key=lambda item: item[group_by_key])

    return new_dataset

回答by pylang

Here's an approach using more_itertoolswhere you simply focus on how to construct output.

这是一种使用方法more_itertools,您只需专注于如何构建输出。

Given

给定的

import datetime
import collections as ct

import more_itertools as mit


dataset = [
    {"date": datetime.date(2013, 1, 1), "id": 99, "value1": 10, "value2": 10},
    {"date": datetime.date(2013, 1, 1), "id": 98, "value1": 10, "value2": 10},
    {"date": datetime.date(2013, 1, 2), "id": 99, "value1": 10, "value2": 10}
]

Code

代码

# Step 1: Build helper functions    
kfunc = lambda d: d["date"]
vfunc = lambda d: {k:v for k, v in d.items() if k.startswith("val")}
rfunc = lambda lst: sum((ct.Counter(d) for d in lst), ct.Counter())

# Step 2: Build a dict    
reduced = mit.map_reduce(dataset, keyfunc=kfunc, valuefunc=vfunc, reducefunc=rfunc)
reduced

Output

输出

defaultdict(None,
            {datetime.date(2013, 1, 1): Counter({'value1': 20, 'value2': 20}),
             datetime.date(2013, 1, 2): Counter({'value1': 10, 'value2': 10})})

The items are grouped by date and pertinent values are reduced as Counters.

项目按日期分组,相关值减少为Counters



Details

细节

Steps

脚步

  1. build helper functions to customize construction of keys, valuesand reducedvalues in the final defaultdict. Here we want to:
    • group by date (kfunc)
    • built dicts keeping the "value*" parameters (vfunc)
    • aggregate the dicts (rfunc) by converting to collections.Countersand summing them. See an equivalent rfuncbelow+.
  2. pass in the helper functions to more_itertools.map_reduce.
  1. 构建辅助函数以自定义最终的减少值的构造defaultdict。在这里,我们想:
    • 按日期分组 ( kfunc)
    • 内置字典保留“值*”参数(vfunc
    • rfunc通过转换collections.Counters求和来聚合 dicts( ) 。请参阅rfunc下面的等效项+
  2. 将辅助函数传递给more_itertools.map_reduce.

Simple Groupby

简单分组

... say in that example you wanted to group by id and date?

...在那个例子中说你想按 id 和 date 分组?

No problem.

没问题。

>>> kfunc2 = lambda d: (d["date"], d["id"])
>>> mit.map_reduce(dataset, keyfunc=kfunc2, valuefunc=vfunc, reducefunc=rfunc)
defaultdict(None,
            {(datetime.date(2013, 1, 1),
              99): Counter({'value1': 10, 'value2': 10}),
             (datetime.date(2013, 1, 1),
              98): Counter({'value1': 10, 'value2': 10}),
             (datetime.date(2013, 1, 2),
              99): Counter({'value1': 10, 'value2': 10})})

Customized Output

定制输出

While the resulting data structure clearly and concisely presents the outcome, the OP's expected output can be rebuilt as a simple list of dicts:

虽然生成的数据结构清晰简洁地呈现了结果,但可以将 OP 的预期输出重建为一个简单的 dicts 列表:

>>> [{**dict(date=k), **v} for k, v in reduced.items()]
[{'date': datetime.date(2013, 1, 1), 'value1': 20, 'value2': 20},
 {'date': datetime.date(2013, 1, 2), 'value1': 10, 'value2': 10}]

For more on map_reduce, see the docs. Install via > pip install more_itertools.

有关更多信息map_reduce,请参阅文档。通过> pip install more_itertools.

+An equivalent reducing function:

+等效的归约函数:

def rfunc(lst: typing.List[dict]) -> ct.Counter:
    """Return reduced mappings from map-reduce values."""
    c = ct.Counter()
    for d in lst:
        c += ct.Counter(d)
    return c