在 python 和 numpy 中处理大数据，内存不足，如何将部分结果保存在光盘上？

Question

提问by Ekgren

I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx and other usefull libraries. I want to perform operations such as pairwise distance between all of the points and do clustering on all of the points. I have implemented working algorithms that perform what I want with reasonable complexity but when I try to scale them to all of my data I run out of ram. Of course I do, creating the matrix for pairwise distances on 200k+ data takes alot of memory.

我正在尝试在 python 中实现具有 200k+ 数据点的 1000 维数据的算法。我想使用 numpy、scipy、sklearn、networkx 和其他有用的库。我想执行所有点之间的成对距离等操作并对所有点进行聚类。我已经实现了以合理的复杂性执行我想要的工作的算法，但是当我尝试将它们扩展到我的所有数据时，我的内存用完了。我当然知道，为 200k+ 数据的成对距离创建矩阵需要大量内存。

Here comes the catch: I would really like to do this on crappy computers with low amounts of ram.

问题来了：我真的很想在具有少量 ram 的蹩脚计算机上执行此操作。

Is there a feasible way for me to make this work without the constraints of low ram. That it will take a much longer time is really not a problem, as long as the time reqs don't go to infinity!

有没有可行的方法让我在没有低内存限制的情况下完成这项工作。需要更长的时间真的不是问题，只要时间reqs不去无穷大！

I would like to be able to put my algorithms to work and then come back an hour or five later and not have it stuck because it ran out of ram! I would like to implement this in python, and be able to use the numpy, scipy, sklearn and networkx libraries. I would like to be able to calculate the pairwise distance to all my points etc

我希望能够让我的算法工作，然后在一五个小时后回来，并且不会因为它的内存不足而卡住！我想在 python 中实现它，并且能够使用 numpy、scipy、sklearn 和 networkx 库。我希望能够计算到我所有点等的成对距离

Is this feasible? And how would I go about it, what can I start to read up on?

这可行吗？我该怎么做，我可以开始阅读什么？

Best regards // Mesmer

最好的问候 // Mesmer

Answer 1

采纳答案by Saullo G. P. Castro

Using numpy.memmapyou create arrays directly mapped into a file:

使用numpy.memmap您创建直接映射到文件的数组：

import numpy
a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000))
# here you will see a 762MB file created in your working directory

You can treat it as a conventional array: a += 1000.

您可以将其视为常规数组：a += 1000。

It is possible even to assign more arrays to the same file, controlling it from mutually sources if needed. But I've experiences some tricky things here. To open the full array you have to "close" the previous one first, using del:

甚至可以将更多数组分配给同一个文件，如果需要，可以从相互来源控制它。但我在这里遇到了一些棘手的事情。要打开完整数组，您必须先“关闭”前一个数组，使用del：

del a    
b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000))

But openning only some part of the array makes it possible to achieve the simultaneous control:

但是只打开数组的一部分就可以实现同时控制：

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000))
b[1,5] = 123456.
print a[1,5]
#123456.0

Great! awas changed together with b. And the changes are already written on disk.

伟大的！a与一起更改b。并且更改已经写入磁盘。

The other important thing worth commenting is the offset. Suppose you want to take not the first 2 lines in b, but lines 150000 and 150001.

另一个值得评论的重要事情是offset. 假设您不想取中的前 2 行b，而是取第 150000 和 150001 行。

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000),
                 offset=150000*1000*32/8)
b[1,2] = 999999.
print a[150001,2]
#999999.0

Now you can access and update any part of the array in simultaneous operations. Note the byte-size going in the offset calculation. So for a 'float64' this example would be 150000*1000*64/8.

现在，您可以在同步操作中访问和更新阵列的任何部分。请注意偏移量计算中的字节大小。因此，对于“float64”，此示例将是 150000*1000*64/8。

Other references:

其他参考：

在 python 和 numpy 中处理大数据，内存不足，如何将部分结果保存在光盘上？

提问by Ekgren

采纳答案by Saullo G. P. Castro

相关推荐

最近更新

标签

在 python 和 numpy 中处理大数据，内存不足，如何将部分结果保存在光盘上？

提问by Ekgren

采纳答案by Saullo G. P. Castro

相关推荐

Python Pandas：两个数据帧的元素相乘

Python 访问 JSON 元素

用于创建多个列表的 Python 列表推导式

为什么我在 Python 中收到错误“连接被拒绝”？(插座)

相关推荐

最近更新

标签