Python group by

Question

提问by Hellnar

Assume that I have a set of data pair where index 0is the value and index 1is the type:

假设我有一组数据对，其中索引 0是值，索引 1是类型：

input = [
          ('11013331', 'KAT'), 
          ('9085267',  'NOT'), 
          ('5238761',  'ETH'), 
          ('5349618',  'ETH'), 
          ('11788544', 'NOT'), 
          ('962142',   'ETH'), 
          ('7795297',  'ETH'), 
          ('7341464',  'ETH'), 
          ('9843236',  'KAT'), 
          ('5594916',  'ETH'), 
          ('1550003',  'ETH')
        ]

I want to group them by their type (by the 1st indexed string) as such:

我想按它们的类型（按第一个索引字符串）对它们进行分组，如下所示：

result = [ 
           { 
             type:'KAT', 
             items: ['11013331', '9843236'] 
           },
           {
             type:'NOT', 
             items: ['9085267', '11788544'] 
           },
           {
             type:'ETH', 
             items: ['5238761', '962142', '7795297', '7341464', '5594916', '1550003'] 
           }
         ]

How can I achieve this in an efficient way?

我怎样才能以有效的方式实现这一目标？

Answer 1

采纳答案by kennytm

Do it in 2 steps. First, create a dictionary.

分两步完成。首先，创建一个字典。

>>> input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')]
>>> from collections import defaultdict
>>> res = defaultdict(list)
>>> for v, k in input: res[k].append(v)
...

Then, convert that dictionary into the expected format.

然后，将该字典转换为预期的格式。

>>> [{'type':k, 'items':v} for k,v in res.items()]
[{'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}]

It is also possible with itertools.groupby but it requires the input to be sorted first.

itertools.groupby 也可以，但它需要先对输入进行排序。

>>> sorted_input = sorted(input, key=itemgetter(1))
>>> groups = groupby(sorted_input, key=itemgetter(1))
>>> [{'type':k, 'items':[x[0] for x in v]} for k, v in groups]
[{'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}]

Note both of these do not respect the original order of the keys. You need an OrderedDict if you need to keep the order.

请注意，这两个都不尊重键的原始顺序。如果您需要保留订单，则需要 OrderedDict。

>>> from collections import OrderedDict
>>> res = OrderedDict()
>>> for v, k in input:
...   if k in res: res[k].append(v)
...   else: res[k] = [v]
... 
>>> [{'type':k, 'items':v} for k,v in res.items()]
[{'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}]

Answer 2

回答by PaulMcG

Python's built-in itertoolsmodule actually has a groupbyfunction , but for that the elements to be grouped must first be sorted such that the elements to be grouped are contiguous in the list:

Python 的内置itertools模块实际上有一个groupbyfunction ，但为此必须先对要分组的元素进行排序，以便要分组的元素在列表中是连续的：

from operator import itemgetter
sortkeyfn = itemgetter(1)
input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), 
 ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), 
 ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')] 
input.sort(key=sortkeyfn)

Now input looks like:

现在输入看起来像：

[('5238761', 'ETH'), ('5349618', 'ETH'), ('962142', 'ETH'), ('7795297', 'ETH'),
 ('7341464', 'ETH'), ('5594916', 'ETH'), ('1550003', 'ETH'), ('11013331', 'KAT'),
 ('9843236', 'KAT'), ('9085267', 'NOT'), ('11788544', 'NOT')]

groupbyreturns a sequence of 2-tuples, of the form (key, values_iterator). What we want is to turn this into a list of dicts where the 'type' is the key, and 'items' is a list of the 0'th elements of the tuples returned by the values_iterator. Like this:

groupby返回形式为的 2 元组序列(key, values_iterator)。我们想要的是把它变成一个 dicts 列表，其中 'type' 是键，'items' 是 values_iterator 返回的元组的第 0 个元素的列表。像这样：

from itertools import groupby
result = []
for key,valuesiter in groupby(input, key=sortkeyfn):
    result.append(dict(type=key, items=list(v[0] for v in valuesiter)))

Now resultcontains your desired dict, as stated in your question.

现在result包含您想要的字典，如您的问题所述。

You might consider, though, just making a single dict out of this, keyed by type, and each value containing the list of values. In your current form, to find the values for a particular type, you'll have to iterate over the list to find the dict containing the matching 'type' key, and then get the 'items' element from it. If you use a single dict instead of a list of 1-item dicts, you can find the items for a particular type with a single keyed lookup into the master dict. Using groupby, this would look like:

不过，您可能会考虑仅从中制作一个单独的 dict，按类型键入，每个值都包含值列表。在您当前的表单中，要查找特定类型的值，您必须遍历列表以查找包含匹配“type”键的 dict，然后从中获取“items”元素。如果您使用单个 dict 而不是 1 项 dict 的列表，则可以通过对主 dict 的单键查找来找到特定类型的项。使用groupby，这看起来像：

result = {}
for key,valuesiter in groupby(input, key=sortkeyfn):
    result[key] = list(v[0] for v in valuesiter)

resultnow contains this dict (this is similar to the intermediate resdefaultdict in @KennyTM's answer):

result现在包含这个字典（这类似于res@KennyTM 的答案中的中间defaultdict）：

{'NOT': ['9085267', '11788544'], 
 'ETH': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 
 'KAT': ['11013331', '9843236']}

(If you want to reduce this to a one-liner, you can:

（如果您想将其减少为单行，您可以：

result = dict((key,list(v[0] for v in valuesiter)
              for key,valuesiter in groupby(input, key=sortkeyfn))

or using the newfangled dict-comprehension form:

或使用新奇的 dict-comprehension 形式：

result = {key:list(v[0] for v in valuesiter)
              for key,valuesiter in groupby(input, key=sortkeyfn)}

Answer 3

回答by mmj

The following function will quickly (no sortingrequired) group tuples of any length by a key having any index:

以下函数将通过具有任何索引的键快速（无需排序）对任何长度的元组进行分组：

# given a sequence of tuples like [(3,'c',6),(7,'a',2),(88,'c',4),(45,'a',0)],
# returns a dict grouping tuples by idx-th element - with idx=1 we have:
# if merge is True {'c':(3,6,88,4),     'a':(7,2,45,0)}
# if merge is False {'c':((3,6),(88,4)), 'a':((7,2),(45,0))}
def group_by(seqs,idx=0,merge=True):
    d = dict()
    for seq in seqs:
        k = seq[idx]
        v = d.get(k,tuple()) + (seq[:idx]+seq[idx+1:] if merge else (seq[:idx]+seq[idx+1:],))
        d.update({k:v})
    return d

In the case of your question, the index of key you want to group by is 1, therefore:

对于您的问题，您要分组的键的索引为 1，因此：

group_by(input,1)

gives

给

{'ETH': ('5238761','5349618','962142','7795297','7341464','5594916','1550003'),
 'KAT': ('11013331', '9843236'),
 'NOT': ('9085267', '11788544')}

which is not exactly the output you asked for, but might as well suit your needs.

这不完全是您要求的输出，但也可能适合您的需求。

Answer 4

回答by akiva

I also liked pandas simple grouping. it's powerful, simple and most adequate for large data set

我也喜欢熊猫简单分组。它功能强大，简单，最适合大数据集

result = pandas.DataFrame(input).groupby(1).groups

Answer 5

回答by akiva

result = []
# Make a set of your "types":
input_set = set([tpl[1] for tpl in input])
>>> set(['ETH', 'KAT', 'NOT'])
# Iterate over the input_set
for type_ in input_set:
    # a dict to gather things:
    D = {}
    # filter all tuples from your input with the same type as type_
    tuples = filter(lambda tpl: tpl[1] == type_, input)
    # write them in the D:
    D["type"] = type_
    D["itmes"] = [tpl[0] for tpl in tuples]
    # append D to results:
    result.append(D)

result
>>> [{'itmes': ['9085267', '11788544'], 'type': 'NOT'}, {'itmes': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'itmes': ['11013331', '9843236'], 'type': 'KAT'}]

Answer 6

回答by ronen

This answer is similar to @PaulMcG's answerbut doesn't require sorting the input.

这个答案类似于@PaulMcG 的答案，但不需要对输入进行排序。

For those into functional programming, groupBycan be written in one line (not including imports!), and unlike itertools.groupbyit doesn't require the input to be sorted:

对于那些进入函数式编程的人，groupBy可以写在一行中（不包括导入！），不像itertools.groupby它不需要对输入进行排序：

from functools import reduce # import needed for python3; builtin in python2
from collections import defaultdict

def groupBy(key, seq):
 return reduce(lambda grp, val: grp[key(val)].append(val) or grp, seq, defaultdict(list))

(The reason for ... or grpin the lambdais that for this reduce()to work, the lambdaneeds to return its first argument; because list.append()always returns Nonethe orwill always return grp. I.e. it's a hack to get around python's restriction that a lambda can only evaluate a single expression.)

（原因... or grp的lambda是，为了这个reduce()工作中，lambda需要返回它的第一个参数，因为list.append()总是返回None的or总是会返回grp。也就是说，它是一个黑客绕过Python的限制，即在拉姆达只能计算一个表达式。）

This returns a dict whose keys are found by evaluating the given function and whose values are a list of the original items in the original order. For the OP's example, calling this as groupBy(lambda pair: pair[1], input)will return this dict:

这将返回一个 dict，其键是通过评估给定函数找到的，其值是原始顺序中原始项目的列表。对于 OP 的示例，调用 asgroupBy(lambda pair: pair[1], input)将返回此字典：

{'KAT': [('11013331', 'KAT'), ('9843236', 'KAT')],
 'NOT': [('9085267', 'NOT'), ('11788544', 'NOT')],
 'ETH': [('5238761', 'ETH'), ('5349618', 'ETH'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('5594916', 'ETH'), ('1550003', 'ETH')]}

And as per @PaulMcG's answerthe OP's requested format can be found by wrapping that in a list comprehension. So this will do it:

根据@PaulMcG 的回答，可以通过将其包装在列表理解中来找到 OP 请求的格式。所以这会做到：

result = {key: [pair[0] for pair in values],
          for key, values in groupBy(lambda pair: pair[1], input).items()}

Python group by

提问by Hellnar

采纳答案by kennytm

回答by PaulMcG

回答by mmj

回答by akiva

回答by akiva

回答by ronen

相关推荐

最近更新

标签

Python group by

提问by Hellnar

采纳答案by kennytm

回答by PaulMcG

回答by mmj

回答by akiva

回答by akiva

回答by ronen

相关推荐

Python 使用pip安装pylibmc时出错

Python NameError：名称未定义

Python RegEx匹配换行符

Python 类型错误：“int”对象不支持项目分配

相关推荐

最近更新

标签