python 从字典创建 NumPy 数组的最佳方法？

Question

提问by Parand

I'm just starting with NumPy so I may be missing some core concepts...

我刚开始使用 NumPy，所以我可能会遗漏一些核心概念......

What's the best way to create a NumPy array from a dictionary whose values are lists?

从值为列表的字典创建 NumPy 数组的最佳方法是什么？

Something like this:

像这样的东西：

d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }

Should turn into something like:

应该变成这样：

data = [
  [10,20,30,?,?],
  [50,60,?,?,?],
  [100,200,300,400,500]
]

I'm going to do some basic statistics on each row, eg:

我将对每一行做一些基本的统计，例如：

deviations = numpy.std(data, axis=1)

Questions:

问题：

What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.
The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?

从字典中创建 numpy.array 的最佳/最有效方法是什么？字典很大；几百万个键，每个键有大约 20 个项目。
每个“行”的值数量不同。如果我理解正确 numpy 想要统一的大小，那么我应该为缺失的项目填写什么以使 std() 满意？

Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.

更新：我忘记提及的一件事 - 虽然 python 技术是合理的（例如，循环数百万个项目很快），但它仅限于单个 CPU。Numpy 操作可以很好地扩展到硬件并影响所有 CPU，因此它们很有吸引力。

Answer 1

采纳答案by Mapad

You don't need to create numpy arrays to call numpy.std(). You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.

您不需要创建 numpy 数组来调用 numpy.std()。您可以在字典的所有值上循环调用 numpy.std()。该列表将即时转换为 numpy 数组以计算标准变化。

The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.

这种方法的缺点是主循环将在 python 中而不是在 C 中。但我想这应该足够快：你仍然会以 C 的速度计算 std，并且你将节省大量内存，因为你不会必须在具有可变大小数组的地方存储 0 个值。

If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
if you find that this is still too slow, try to use psycho to optimize the python loop.
if this is still too slow, try using Cythontogether with the numpy module. This Tutorialclaims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see thisfor benchmarks and examples with sum function )
An alternative to Cython would be to use SWIGwith numpy.i.
if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.

如果你想进一步优化它，你可以将你的值存储到一个 numpy 数组列表中，这样你就可以只进行一次 python 列表 -> numpy 数组转换。
如果你发现这样还是太慢了，可以尝试使用psycho来优化python循环。
如果这仍然太慢，请尝试将Cython与 numpy 模块一起使用。本教程声称图像处理的速度得到了令人印象深刻的改进。或者简单地在 Cython 中对整个 std 函数进行编程（有关sum 函数的基准测试和示例，请参阅this）
Cython 的替代方法是将SWIG与numpy.i一起使用。
如果您只想使用 numpy 并在 C 级别计算所有内容，请尝试将所有相同大小的记录分组到不同的数组中，并对每个记录调用 numpy.std()。它应该类似于以下示例。

example with O(N) complexity:

复杂度为 O(N) 的示例：

import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
    if len(row) == 1:
      list_size_1.append(row)
    elif len(row) == 2:
      list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)

Answer 2

回答by Maleev

While there are already some pretty reasonable ideas present here, I believe following is worth mentioning.

虽然这里已经有一些非常合理的想法，但我相信以下值得一提。

Filling missing data with any default value would spoil the statistical characteristics (std, etc). Evidently that's why Mapad proposed the nice trick with grouping same sized records. The problem with it (assuming there isn't any a priori data on records lengths is at hand) is that it involves even more computations than the straightforward solution:

用任何默认值填充缺失的数据会破坏统计特征（标准等）。显然，这就是 Mapad 提出将相同大小的记录分组的好技巧的原因。它的问题（假设手头没有任何关于记录长度的先验数据）是它比直接解决方案涉及更多的计算：

at least O(N*logN)'len' calls and comparisons for sorting with an effective algorithm
O(N)checks on the second way through the list to obtain groups(their beginning and end indexes on the 'vertical' axis)

至少O(N*logN)'len' 调用和比较以使用有效算法进行排序
O(N)检查通过列表的第二种方式以获得组（它们在“垂直”轴上的开始和结束索引）

Using Psyco is a good idea (it's strikingly easy to use, so be sure to give it a try).

使用 Psyco 是个好主意（它非常容易使用，所以一定要试一试）。

It seems that the optimal way is to take the strategy described by Mapad in bullet #1, but with a modification - not to generate the whole list, but iterate through the dictionary converting each row into numpy.array and performing required computations. Like this:

似乎最佳方法是采用 Mapad 在第 1 项中描述的策略，但要进行修改 - 不是生成整个列表，而是遍历字典，将每一行转换为 numpy.array 并执行所需的计算。像这样：

for row in data.itervalues():
    np_row = numpy.array(row)    
    this_row_std = numpy.std(np_row)
    # compute any other statistic descriptors needed and then save to some list

In any case a few million loops in python won't take as long as one might expect. Besides this doesn't look like a routine computation, so who cares if it takes extra second/minute if it is run once in a while or even just once.

在任何情况下，python 中的几百万个循环都不会像人们预期的那么长。此外，这看起来不像是例行计算，所以谁在乎它是否需要额外的秒/分钟，如果它偶尔运行一次甚至只运行一次。

A generalized variant of what was suggested by Mapad:

Mapad 建议的通用变体：

from numpy import array, mean, std

def get_statistical_descriptors(a):
    if ax = len(shape(a))-1
    functions = [mean, std]
    return f(a, axis = ax) for f in functions


def process_long_list_stats(data):
    import numpy

    groups = {}

    for key, row in data.iteritems():
        size = len(row)
        try:
            groups[size].append(key)
        except KeyError:
            groups[size] = ([key])

    results = []

    for gr_keys in groups.itervalues():             
        gr_rows = numpy.array([data[k] for k in gr_keys])       
        stats = get_statistical_descriptors(gr_rows)                
        results.extend( zip(gr_keys, zip(*stats)) )

    return dict(results)

Answer 3

回答by Davoud Taghawi-Nejad

numpy dictionary

numpy 字典

You can use a structured array to preserve the ability to address a numpy object by a key, like a dictionary.

您可以使用结构化数组来保留通过键（如字典）寻址 numpy 对象的能力。

import numpy as np


dd = {'a':1,'b':2,'c':3}
dtype = eval('[' + ','.join(["('%s', float)" % key for key in dd.keys()]) + ']')
values = [tuple(dd.values())]
numpy_dict = np.array(values, dtype=dtype)

numpy_dict['c']

will now output

现在将输出

array([ 3.])

python 从字典创建 NumPy 数组的最佳方法？

提问by Parand

采纳答案by Mapad

回答by Maleev

回答by Davoud Taghawi-Nejad

相关推荐

最近更新

标签

python 从字典创建 NumPy 数组的最佳方法？

提问by Parand

采纳答案by Mapad

回答by Maleev

回答by Davoud Taghawi-Nejad

相关推荐

python 在vim中突出显示不匹配的括号

Python 字符串格式化

python SQLAlchemy 和空列

使用 Python 隐藏文件夹/文件

相关推荐

最近更新

标签