database HyperLogLog 算法是如何工作的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12327004/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 08:42:23  来源:igfitidea点击:

How does the HyperLogLog algorithm work?

databasealgorithmmathdata-structureshyperloglog

提问by K2xL

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.

我最近在业余时间学习了不同的算法,我遇到的一个看起来很有趣的算法叫做 HyperLogLog 算法——它估计列表中有多少独特的项目。

This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).

这对我来说特别有趣,因为当我看到“基数”值(直到最近我一直认为它是计算而不是估计的)时,它让我回到了我的 MySQL 时代。

So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:

所以我知道如何在O( n) 中编写一个算法来计算数组中有多少唯一项。我用 JavaScript 写了这个:

function countUniqueAlgo1(arr) {
    var Table = {};
    var numUnique = 0;
    var numDataPoints = arr.length;
    for (var j = 0; j < numDataPoints; j++) {
        var val = arr[j];
        if (Table[val] != null) {
            continue;
        }
        Table[val] = 1;
        numUnique++;
    }
    return numUnique;
}

But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).

但问题是我的算法虽然O( n) 使用了大量内存(在 中存储值Table)。

I've been reading this paperabout how to count duplicates in a list in O(n) time and using minimal memory.

我一直在阅读这篇关于如何在O( n) 时间内计算列表中重复项并使用最少内存的论文。

It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.

它解释了通过散列和计数位或可以在一定概率(假设列表均匀分布)内估计列表中唯一项目的数量。

I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.

我读过论文,但我似乎无法理解。有人可以给出更外行的解释吗?我知道哈希是什么,但我不明白它们在这个 HyperLogLog 算法中是如何使用的。

回答by Juan Lopes

The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).

该算法背后的主要技巧是,如果您观察随机整数流,看到二进制表示以某个已知前缀开头的整数,则流的基数很有可能是 2^(前缀的大小) .

That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.

也就是说,在随机整数流中,约 50% 的数字(二进制​​)以“1”开头,25% 以“01”开头,12.5% 以“001”开头。这意味着,如果您观察随机流并看到“001”,则该流的基数为 8 的可能性更高。

(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)

(前缀“00..1”没有特殊含义。它存在只是因为在大多数处理器中很容易找到二进制数中的最高有效位)

Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.

当然,如果你只观察一个整数,这个值出错的可能性就很高。这就是为什么该算法将流划分为“m”个独立子流并保持每个子流的可见前缀“00...1”的最大长度。然后,通过取每个子流的平均值来估计最终值。

That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.

这就是这个算法的主要思想。有一些缺失的细节(例如,对低估计值的校正),但在论文中写得很好。抱歉糟糕的英语。

回答by Salvador Dali

A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.

HyperLogLog 是一种概率数据结构。它计算列表中不同元素的数量。但是与直接的方法(拥有一个集合并向集合中添加元素)相比,它以近似的方式完成此操作。

Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements)of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.

在查看 HyperLogLog 算法如何做到这一点之前,必须先了解您为什么需要它。一个简单的方法的问题是它消耗O(distinct elements)空间。为什么这里有一个大 O 符号而不是不同的元素?这是因为元素可以具有不同的大小。一个元素可以是1另一个元素"is this big string"。因此,如果您有一个巨大的列表(或一个巨大的元素流),它将占用大量内存。



Probabilistic Counting

概率计数

How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length mwhich consists of {0, 1}with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4and 1/2^k. This means that if you have encountered a string with kzeros, you have approximately looked through 2^kelements. So this is a good starting point. Having a list of elements that are evenly distributed between 0and 2^k - 1you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.

如何对许多独特元素进行合理估计?假设您有一个长度为 的字符串,m{0, 1}概率相等。它从 0 开始,有 2 个零,有 k 个零的概率是多少?它是1/21/4并且1/2^k。这意味着,如果您遇到一个带k零的字符串,您就大致浏览了2^k元素。所以这是一个很好的起点。具有之间有均匀分布的元素的列表02^k - 1你可以指望零的二进制表示的最大前缀的最大数量,这会给你一个合理的估计。

The problem is that the assumption of having evenly distributed numbers from 0t 2^k-1is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing functionyou can assume that the output bits would be evenly distributed and most hashing function have outputs between 0and 2^k - 1(SHA1give you values between 0and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of kbits by storing only one number of size log(k)bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic countingpaper (it is a little bit smarter with the estimate, but still we are close).

问题是从0t得到均匀分布的数字的假设2^k-1太难以实现(我们遇到的数据大多不是数字,几乎从不均匀分布,并且可以在任何值之间。但是使用一个好的散列函数你可以假设输出位将均匀分布,并且大多数散列函数在0和之间具有输出2^k - 1SHA1为您提供0和之间的值2^160)。因此,到目前为止我们所取得的成果是,我们可以k通过仅存储来估计具有最大位基数的唯一元素的数量一个大小log(k)位。缺点是我们的估计有很大的差异。我们几乎创造了一个很酷的东西1984 年的概率计数论文(估计更聪明一点,但我们仍然接近)。

LogLog

日志日志

Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paperused a slightly different approach (probably because hashing is kind of expensive).

在进一步研究之前,我们必须了解为什么我们的第一个估计不是那么好。其背后的原因是随机出现的高频 0 前缀元素可以破坏一切。改进它的一种方法是使用许多散列函数,为每个散列函数计算最大值,最后将它们平均。这是一个很好的想法,它将改进估计,但LogLog 论文使用了一种稍微不同的方法(可能是因为散列有点昂贵)。

They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0to 2^10produced 2 values: 344and 387. You decided to have 16 buckets. So you have:

他们使用了一个哈希,但将其分为两部分。一个称为桶(桶的总数为2^x),另一个 - 与我们的哈希基本相同。我很难理解发生了什么,所以我举个例子。假设你有两个元素和你的哈希函数,它给出02^10产生 2 个值的值形式:344387。你决定有 16 个桶。所以你有了:

0101 011000  bucket 5 will store 1
0110 000011  bucket 6 will store 4

By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).

通过有更多的桶,你可以减少方差(你使用的空间稍微多一些,但它仍然很小)。使用数学技能,他们能够量化错误(即1.3/sqrt(number of buckets))。

HyperLogLog

超级日志日志

HyperLogLogdoes not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.

HyperLogLog并没有引入任何新的想法,但主要是使用了大量的数学来改进之前的估计。研究人员发现,如果您从桶中删除 30% 的最大数字,您可以显着提高估计值。他们还使用了另一种算法来平均数字。这篇论文是数学重的。



And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm(up until now I didn't have time to fully understand it, but maybe later I will improve this answer).

我想以最近的一篇论文结束,它展示了hyperLogLog 算法改进版本(直到现在我没有时间完全理解它,但也许以后我会改进这个答案)。

回答by Wai Yip Tung

The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).

直觉是,如果您的输入是一大组随机数(例如散列值),它们应该均匀分布在一个范围内。假设范围最多为 10 位以表示最多 1024 的值。然后观察最小值。假设它是 10。那么基数估计约为 100(10 × 100 ≈ 1024)。

Read the paper for the real logic of course.

当然,阅读论文以了解真正的逻辑。

Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

另一个很好的示例代码解释可以在这里找到:
Damn Cool Algorithms: Cardinality Estimation - Nick 的博客