python 大数组的 Numpy 直方图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2464871/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:43:24  来源:igfitidea点击:

Numpy histogram of large arrays

pythonnumpyscipyhistogram

提问by garageàtrois

I have a bunch of csv datasets, about 10Gb in size each. I'd like to generate histograms from their columns. But it seems like the only way to do this in numpy is to first load the entire column into a numpy array and then call numpy.histogramon that array. This consumes an unnecessary amount of memory.

我有一堆 csv 数据集,每个大约 10Gb。我想从他们的列中生成直方图。但似乎在 numpy 中执行此操作的唯一方法是首先将整个列加载到 numpy 数组中,然后调用numpy.histogram该数组。这会消耗不必要的内存量。

Does numpy support online binning? I'm hoping for something that iterates over my csv line by line and bins values as it reads them. This way at most one line is in memory at any one time.

numpy 是否支持在线分箱?我希望在读取它们时逐行迭代我的 csv 和 bin 值。这种方式在任何时候最多有一行在内存中。

Wouldn't be hard to roll my own, but wondering if someone already invented this wheel.

自己动手不难,但想知道是否有人已经发明了这个轮子。

采纳答案by mtrw

As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:

正如你所说,自己动手并不难。您需要自己设置 bin 并在迭代文件时重复使用它们。以下应该是一个不错的起点:

import numpy as np
datamin = -5
datamax = 5
numbins = 20
mybins = np.linspace(datamin, datamax, numbins)
myhist = np.zeros(numbins-1, dtype='int32')
for i in range(100):
    d = np.random.randn(1000,1)
    htemp, jnk = np.histogram(d, mybins)
    myhist += htemp

I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. @doug's suggestionof a generator seems like a good way to address that problem.

我猜这样大文件的性能将是一个问题,并且在每一行上调用直方图的开销可能太慢了。 @doug对生成器的建议似乎是解决该问题的好方法。

回答by doug

Here's a way to bin your values directly:

这是一种直接对值进行分类的方法:

import numpy as NP

column_of_values = NP.random.randint(10, 99, 10)

# set the bin values:
bins = NP.array([0.0, 20.0, 50.0, 75.0])

binned_values = NP.digitize(column_of_values, bins)

'binned_values' is an index array, containing the index of the bin to which each value in column_of_values belongs.

'binned_values' 是一个索引数组,包含 column_of_values 中每个值所属的 bin 的索引。

'bincount' will give you (obviously) the bin counts:

'bincount' 会给你(显然)bin 计数:

NP.bincount(binned_values)

Given the size of your data set, using Numpy's 'loadtxt' to build a generator, might be useful:

鉴于您的数据集的大小,使用 Numpy 的“loadtxt”构建生成器可能很有用:

data_array = NP.loadtxt(data_file.txt, delimiter=",")
def fnx() :
  for i in range(0, data_array.shape[1]) :
    yield dx[:,i]

回答by Dan H

Binning with a Fenwick Tree(very large dataset; percentile boundaries needed)

使用 Fenwick 树进行分箱(非常大的数据集;需要百分位数边界)

I'm posting a second answer to the same question since this approach is very different, and addresses different issues.

我发布了同一问题的第二个答案,因为这种方法非常不同,并且解决了不同的问题。

What if you have a VERY large dataset (billions of samples), and you don't know ahead of time WHERE your bin boundaries should be? For example, maybe you want to bin things up in to quartiles or deciles.

如果您有一个非常大的数据集(数十亿个样本),并且您不知道您的 bin 边界应该在哪里,该怎么办?例如,也许您想将事物分成四分位数或十分位数。

For small datasets, the answer is easy: load the data in to an array, then sort, then read off the values at any given percentile by jumping to the index that percentage of the way through the array.

对于小数据集,答案很简单:将数据加载到数组中,然后排序,然后通过跳转到数组中该百分比的索引读取任何给定百分位数的值。

For large datasets where the memory size to hold the array is not practical (not to mention the time to sort)... then consider using a Fenwick Tree, aka a "Binary Indexed Tree".

对于容纳数组的内存大小不切实际的大型数据集(更不用说排序的时间了)......然后考虑使用 Fenwick 树,又名“二进制索引树”。

I think these only work for positive integer data, so you'll at least need to know enough about your dataset to shift (and possibly scale) your data before you tabulate it in the Fenwick Tree.

我认为这些仅适用于正整数数据,因此您至少需要对数据集有足够的了解才能在 Fenwick 树中制表之前移动(并可能缩放)数据。

I've used this to find the median of a 100 billion sample dataset, in reasonable time and very comfortable memory limits. (Consider using generators to open and read the files, as per my other answer; that's still useful.)

我已经用它在合理的时间和非常舒适的内存限制内找到了 1000 亿个样本数据集的中位数。(根据我的其他答案,考虑使用生成器打开和读取文件;这仍然很有用。)

More on Fenwick Trees:

有关 Fenwick 树的更多信息:

回答by Dan H

Binning with Generators(large dataset; fixed-width bins; float data)

使用生成器进行分大型数据集;固定宽度的分箱;浮点数据

If you know the width of your desired bins ahead of time -- even if there are hundreds or thousands of buckets -- then I think rolling your own solution would be fast (both to write, and to run). Here's some Python that assumes you have a iterator that gives you the next value from the file:

如果你提前知道你想要的 bin 的宽度——即使有成百上千个桶——那么我认为滚动你自己的解决方案会很快(无论是编写还是运行)。下面是一些 Python,它假设您有一个迭代器,可以为您提供文件中的下一个值:

from math import floor
binwidth = 20
counts = dict()
filename = "mydata.csv"
for val in next_value_from_file(filename):
   binname = int(floor(val/binwidth)*binwidth)
   if binname not in counts:
      counts[binname] = 0
   counts[binname] += 1
print counts

The values can be floats, but this is assuming you use an integer binwidth; you may need to tweak this a bit if you want to use a binwidth of some float value.

这些值可以是浮点数,但这是假设您使用整数 binwidth;如果您想使用某个浮点值的 binwidth,您可能需要稍微调整一下。

As for next_value_from_file(), as mentioned earlier, you'll probably want to write a custom generator or object with an iter() method do do this efficiently. The pseudocode for such a generator would be this:

至于next_value_from_file(),如前所述,您可能希望使用iter() 方法编写自定义生成器或对象来有效地执行此操作。这种生成器的伪代码是这样的:

def next_value_from_file(filename):
  f = open(filename)
  for line in f:
     # parse out from the line the value or values you need
     val = parse_the_value_from_the_line(line)
     yield val

If a given line has multiple values, then make parse_the_value_from_the_line()either return a list or itself be a generator, and use this pseudocode:

如果给定的行有多个值,那么parse_the_value_from_the_line()要么返回一个列表,要么本身成为一个生成器,并使用以下伪代码:

def next_value_from_file(filename):
  f = open(filename)
  for line in f:
     for val in parse_the_values_from_the_line(line):
       yield val