C++ 搜索快速/高效的直方图算法(使用预先指定的 bin)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4515874/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Searching for a fast/efficient histogram algorithm (with pre-specified bins)
提问by ggkmath
I don't do much coding outside of Matlab, but I have a need to export my Matlab code to another language, most likely C. My Matlab code includes a histogram function, histc(), that places my input data (which is double-precision, not integer) into a specified array of bins, to form a histogram.
我在 Matlab 之外没有做太多编码,但我需要将我的 Matlab 代码导出到另一种语言,最有可能是 C。我的 Matlab 代码包含一个直方图函数 histc(),它将我的输入数据(这是双-精度,而不是整数)放入指定的 bin 数组中,以形成直方图。
I'm sure I can piece together a couple nested loops to generate a histogram function, but I need this function to be fast and memory-light, as it will be accessed repeatedly and often.
我确信我可以将几个嵌套循环拼凑在一起来生成直方图函数,但我需要这个函数快速且节省内存,因为它会被重复和经常访问。
To avoid re-inventing the wheel, anyone know if C language has any existing histogram function(s) available for use, or whether people needing such a thing generally create it themselves?
为了避免重新发明轮子,有人知道 C 语言是否有任何现有的直方图函数可供使用,或者需要这样东西的人通常自己创建它吗?
Anyone know an efficient algorithm for creating a histogram? Pseudo-code is fine.
有人知道创建直方图的有效算法吗?伪代码没问题。
Thanks in advance.
提前致谢。
采纳答案by Kyle Lutz
GSL (GNU Scientific Library) contains a histogram implementation.
GSL(GNU 科学库)包含一个直方图实现。
Here is the documentation: http://www.gnu.org/software/gsl/manual/html_node/Histograms.html.
这是文档:http: //www.gnu.org/software/gsl/manual/html_node/Histograms.html。
And here is an example use: http://www.gnu.org/software/gsl/manual/html_node/Example-programs-for-histograms.html.
这是一个使用示例:http: //www.gnu.org/software/gsl/manual/html_node/Example-programs-for-histograms.html。
回答by Tom
The "ideal" histogram algorithm will depend upon the range you expect to capture. Generally any histogram algorithm will look like this:
“理想的”直方图算法将取决于您希望捕获的范围。通常,任何直方图算法都将如下所示:
const int NSAMPLES = whatever;
double samples[NSAMPLES] = { 1.0, 3.93, 1e30, ... }; // your data set
const int NBUCKETS = 10; // or whatever
int counts[NBUCKETS] = { 0 };
for (int i = 0; i != NSAMPLES; ++i) {
counts[TRANSFER(samples[i])]++;
}
where TRANSFER()
is some function that maps your inputs to a bin (0th or Nth bin mapping to "out of range" of applicable).
哪里TRANSFER()
有一些函数可以将您的输入映射到一个 bin(第 0 个或第 N 个 bin 映射到适用的“超出范围”)。
The exact implementation of TRANSFER()
depends a lot on the expected distribution of your sample and where you are interested in detail. Some common approaches I have seen:
的确切实现TRANSFER()
很大程度上取决于您的样本的预期分布以及您对细节感兴趣的地方。我见过的一些常见方法:
- uniform distribution in range [a,b] (requires linear transform)
- logarithmic distribution of unsigned integer values (best when combined with some bit twiddling hacksto quickly determine the nearest power-of-two or similar).
- [a,b] 范围内的均匀分布(需要线性变换)
- 无符号整数值的对数分布(最好结合一些比特摆弄技巧来快速确定最接近的 2 的幂或类似值)。
If you don't know the distribution up-front, then you really can't have an efficient mechanism to bin them effectively: you'll either have to guess (biased or uninformative results) or store everything and sort it at the end, binning into equal-sized buckets (poor performance).
如果你事先不知道分布,那么你真的不能有一个有效的机制来有效地对它们进行分类:你要么必须猜测(有偏见或无信息的结果),要么存储所有内容并在最后对其进行排序,分箱到相同大小的桶中(性能不佳)。
回答by dwc
I've written my own histogram code in C, as it's simple enough that I didn't even think to look for a library. Normally you just need to create an array to contain the number of bins that you want [num_bins = (int)(val_max - val_min + 1);
], and as you encounter each sample you can divide by the number of bins [bin_idx = (int)((value - val_min) / bin_width);
] (where bin_width = (max-min)/num_bins
) to find where it belongs and then increment the bin counter. This is an easy, fast, single pass through the data. Do check my arithmetic above for edge cases.
我已经用 C 编写了自己的直方图代码,因为它足够简单,我什至没想过要寻找一个库。通常你只需要创建一个数组来包含你想要的 bin 数量 [ num_bins = (int)(val_max - val_min + 1);
],当你遇到每个样本时,你可以除以 bin 的数量 [ bin_idx = (int)((value - val_min) / bin_width);
] (where bin_width = (max-min)/num_bins
) 来找到它所属的位置,然后增加 bin 计数器. 这是一个简单、快速、单一的数据传递。请检查我上面的算术是否有边缘情况。
The problem you might encounter is that the domain of your input might not be known. Having 100 bins over the whole range of double
isn't going to be much good if all your data is within only a small fraction of that. The solution is to make a first pass over the data to find the min/max of your range. There's really no quick fix to this and most libraries will ask for min/max up front.
您可能会遇到的问题是您输入的域可能未知。double
如果您的所有数据仅在其中的一小部分内,那么在整个范围内拥有 100 个 bin不会很好。解决方案是首先遍历数据以找到范围的最小值/最大值。对此确实没有快速解决方案,大多数图书馆都会预先要求最小值/最大值。