python 聚合数组的pythonic方式（numpy与否）

Question

提问by Louis

I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)

我想做一个很好的函数来聚合数组中的数据（它是一个 numpy 记录数组，但它不会改变任何东西）

you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)]and you want to have the mean income per job

您有一组要在一个轴之间聚合的数据：例如dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)]，您希望获得每个工作的平均收入的数组

I did this function, and in the example it should be called as aggregate(data,'job','income',mean)

我做了这个函数，在这个例子中它应该被称为 aggregate(data,'job','income',mean)

def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?

问题是我觉得它不是很好我想把它放在一行中：你有什么想法吗？

Thanks for your answer Louis

谢谢你的回答路易斯

PS: I would like to keep the func in the call so that you can also ask for median, minimum...

PS：我想在调用中保留 func 以便您也可以要求中位数，最小值...

Answer 1

回答by Hank Gay

Your if k not in data_per_key.keys()could be rewritten as if k not in data_per_key, but you can do even better with defaultdict. Here's a version that uses defaultdictto get rid of the existence check:

您if k not in data_per_key.keys()可以重写为if k not in data_per_key，但您可以使用defaultdict. 这是一个defaultdict用于摆脱存在检查的版本：

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

Answer 2

回答by unutbu

Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:

也许您正在寻找的功能是matplotlib.mlab.rec_groupby：

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

yields

产量

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupbyreturns a recarray:

matplotlib.mlab.rec_groupby返回一个重新数组：

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]

You may also be interested in checking out pandas, which has even more versatile facilitiesfor handling group-by operations.

您可能还对查看pandas感兴趣，它具有用于处理group-by 操作的更多功能。

Answer 3

回答by Michael

Hereis a recipe which emulates the functionality of matlabs accumarray quite well. It uses pythons iterators quite nicely, nevertheless, performancewise it sucks compared to the matlab implementation. As I had the same problem, I had written an implementation using scipy.weave. You can find it here: https://github.com/ml31415/accumarray

这是一个很好地模拟 matlabs accumarray 功能的配方。它很好地使用了 python 迭代器，但是，与 matlab 实现相比，它在性能方面很糟糕。由于我遇到了同样的问题，因此我使用scipy.weave. 你可以在这里找到它：https: //github.com/ml31415/accumarray

Answer 4

回答by caiohamamura

Best flexibility and readability is get using pandas:

最好的灵活性和可读性是使用pandas获得的：

import pandas

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

df = pandas.DataFrame(data)
result = df.groupby('job').mean()

Yields to :

产生于：

         income
job
Digger      4.0
Planter     2.5
Waterer     3.0

Pandas DataFrame is a great class to work with, but you can get back your results as you need:

Pandas DataFrame 是一个很好的类，但您可以根据需要获取结果：

result.to_records()
result.to_dict()
result.to_csv()

And so on...

等等...

Answer 5

回答by caiohamamura

Best performance is achieved using ndimage.meanfrom scipy. This will be twice faster than accepted answer for this small dataset, and about 3.5 times faster for larger inputs:

使用scipy的ndimage.mean 可以获得最佳性能。对于这个小数据集，这将比接受的答案快两倍，对于更大的输入，速度大约快 3.5 倍：

from scipy import ndimage

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])

Will yield to:

将屈服于：

array([[['Digger', '4.0'],
        ['Planter', '2.5'],
        ['Waterer', '3.0']]],
      dtype='|S32')

EDIT: with bincount (faster!)

编辑：使用 bincount（更快！）

This is about 5x faster than accepted answer for the small example input, if you repeat the data 100000 times it will be around 8.5x faster:

这比小示例输入的公认答案快约 5 倍，如果您重复数据 100000 次，它将快约 8.5 倍：

unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])

Answer 6

回答by Skylar Saveland

http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method

should help to make it a little prettier, more pythonic, more efficient possibly. I'll come back later to check on your progress. Maybe you can edit the function with this in mind? Also see the next couple of sections.

应该有助于使它更漂亮，更pythonic，更高效。稍后我会回来查看您的进度。也许您可以考虑到这一点来编辑该功能？另请参阅接下来的几个部分。

python 聚合数组的pythonic方式（numpy与否）

提问by Louis

回答by Hank Gay

回答by unutbu

回答by Michael

回答by caiohamamura

回答by caiohamamura

EDIT: with bincount (faster!)

编辑：使用 bincount（更快！）

回答by Skylar Saveland

相关推荐

最近更新

标签

python 聚合数组的pythonic方式（numpy与否）

提问by Louis

回答by Hank Gay

回答by unutbu

回答by Michael

回答by caiohamamura

回答by caiohamamura

EDIT: with bincount (faster!)

编辑：使用 bincount（更快！）

回答by Skylar Saveland

相关推荐

python 如何限制python中的活动线程数？

Python 中的运行平均值

python 结构错误找不到记录器“paramiko.transport”的处理程序

python 使用部分下载 (HTTP) 下载文件

相关推荐

最近更新

标签