python 聚合数组的pythonic方式(numpy与否)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1829340/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pythonic way to aggregate arrays (numpy or not)
提问by Louis
I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)
我想做一个很好的函数来聚合数组中的数据(它是一个 numpy 记录数组,但它不会改变任何东西)
you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)]
and you want to have the mean income per job
您有一组要在一个轴之间聚合的数据:例如dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)]
,您希望获得每个工作的平均收入的数组
I did this function, and in the example it should be called as aggregate(data,'job','income',mean)
我做了这个函数,在这个例子中它应该被称为 aggregate(data,'job','income',mean)
def aggregate(data, key, value, func):
data_per_key = {}
for k,v in zip(data[key], data[value]):
if k not in data_per_key.keys():
data_per_key[k]=[]
data_per_key[k].append(v)
return [(k,func(data_per_key[k])) for k in data_per_key.keys()]
the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?
问题是我觉得它不是很好我想把它放在一行中:你有什么想法吗?
Thanks for your answer Louis
谢谢你的回答路易斯
PS: I would like to keep the func in the call so that you can also ask for median, minimum...
PS:我想在调用中保留 func 以便您也可以要求中位数,最小值...
回答by Hank Gay
Your if k not in data_per_key.keys()
could be rewritten as if k not in data_per_key
, but you can do even better with defaultdict
. Here's a version that uses defaultdict
to get rid of the existence check:
您if k not in data_per_key.keys()
可以重写为if k not in data_per_key
,但您可以使用defaultdict
. 这是一个defaultdict
用于摆脱存在检查的版本:
import collections
def aggregate(data, key, value, func):
data_per_key = collections.defaultdict(list)
for k,v in zip(data[key], data[value]):
data_per_key[k].append(v)
return [(k,func(data_per_key[k])) for k in data_per_key.keys()]
回答by unutbu
Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:
也许您正在寻找的功能是matplotlib.mlab.rec_groupby:
import matplotlib.mlab
data=np.array(
[('Aaron','Digger',1),
('Bill','Planter',2),
('Carl','Waterer',3),
('Darlene','Planter',3),
('Earl','Digger',7)],
dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])
result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))
yields
产量
('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)
matplotlib.mlab.rec_groupby
returns a recarray:
matplotlib.mlab.rec_groupby
返回一个重新数组:
print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]
You may also be interested in checking out pandas, which has even more versatile facilitiesfor handling group-by operations.
您可能还对查看pandas感兴趣,它具有用于处理group-by 操作的更多功能。
回答by Michael
Hereis a recipe which emulates the functionality of matlabs accumarray quite well. It uses pythons iterators quite nicely, nevertheless, performancewise it sucks compared to the matlab implementation. As I had the same problem, I had written an implementation using scipy.weave
. You can find it here: https://github.com/ml31415/accumarray
这是一个很好地模拟 matlabs accumarray 功能的配方。它很好地使用了 python 迭代器,但是,与 matlab 实现相比,它在性能方面很糟糕。由于我遇到了同样的问题,因此我使用scipy.weave
. 你可以在这里找到它:https: //github.com/ml31415/accumarray
回答by caiohamamura
Best flexibility and readability is get using pandas:
最好的灵活性和可读性是使用pandas获得的:
import pandas
data=np.array(
[('Aaron','Digger',1),
('Bill','Planter',2),
('Carl','Waterer',3),
('Darlene','Planter',3),
('Earl','Digger',7)],
dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])
df = pandas.DataFrame(data)
result = df.groupby('job').mean()
Yields to :
产生于:
income
job
Digger 4.0
Planter 2.5
Waterer 3.0
Pandas DataFrame is a great class to work with, but you can get back your results as you need:
Pandas DataFrame 是一个很好的类,但您可以根据需要获取结果:
result.to_records()
result.to_dict()
result.to_csv()
And so on...
等等...
回答by caiohamamura
Best performance is achieved using ndimage.meanfrom scipy. This will be twice faster than accepted answer for this small dataset, and about 3.5 times faster for larger inputs:
使用scipy的ndimage.mean 可以获得最佳性能。对于这个小数据集,这将比接受的答案快两倍,对于更大的输入,速度大约快 3.5 倍:
from scipy import ndimage
data=np.array(
[('Aaron','Digger',1),
('Bill','Planter',2),
('Carl','Waterer',3),
('Darlene','Planter',3),
('Earl','Digger',7)],
dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])
unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])
Will yield to:
将屈服于:
array([[['Digger', '4.0'],
['Planter', '2.5'],
['Waterer', '3.0']]],
dtype='|S32')
EDIT: with bincount (faster!)
编辑:使用 bincount(更快!)
This is about 5x faster than accepted answer for the small example input, if you repeat the data 100000 times it will be around 8.5x faster:
这比小示例输入的公认答案快约 5 倍,如果您重复数据 100000 次,它将快约 8.5 倍:
unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])
回答by Skylar Saveland
http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method
http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method
should help to make it a little prettier, more pythonic, more efficient possibly. I'll come back later to check on your progress. Maybe you can edit the function with this in mind? Also see the next couple of sections.
应该有助于使它更漂亮,更pythonic,更高效。稍后我会回来查看您的进度。也许您可以考虑到这一点来编辑该功能?另请参阅接下来的几个部分。