java 布隆过滤器实现

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4546447/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 06:53:46  来源:igfitidea点击:

Bloom Filter Implementation

javaalgorithmdata-structuresspace-complexitybloom-filter

提问by UVM

Using Bloom filter, we will be getting space optimization. The cassandra framework also has an implementation of Bloom Filter. But in detail, how is this space optimization achieved?

使用布隆过滤器,我们将获得空间优化。cassandra 框架也有布隆过滤器的实现。但具体来说,这种空间优化是如何实现的呢?

采纳答案by SyntaxT3rr0r

A bloom filter isn't a "framework". It's really more like simply an algorithm. The implementation ain't very long.

布隆过滤器不是“框架”。它更像是一个简单的算法。实施时间不长。

Here's one in Java I've tried (.jar, source code and JavaDoc being all available):

这是我尝试过的 Java 中的一个(.jar,源代码和 JavaDoc 都可用):

"Stand alone Java implementations of Cuckoo Hashing and Bloom Filters"(you may want to Google for this in case the following link ain't working anymore):

“Cuckoo Hashing 和 Bloom Filters 的独立 Java 实现”(如果以下链接不再起作用,您可能需要谷歌搜索):

http://lmonson.com/blog/?page_id=99

http://lmonson.com/blog/?page_id=99

回答by Tarun

You can understand how it saves space using this example : Lets say I work for Google, in the Chrome team, and I want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. So I have a dataset of about 1 million malicious URLs, the size of this file being around 25MB. Since the size is quite big, (big in comparison to the size of the browser itself), I store this data on a remote server.

您可以使用此示例了解它如何节省空间:假设我在 Chrome 团队的 Google 工作,并且我想向浏览器添加一项功能,该功能会通知用户他输入的 url 是否为恶意 URL。所以我有一个大约 100 万个恶意 URL 的数据集,这个文件的大小约为 25MB。由于大小相当大(与浏览器本身的大小相比大),我将这些数据存储在远程服务器上。

Case 1 : I use a hash function with a hash table. I decide on an efficient hashing function, and run all the 1 million urls through the hashing function to get hash keys. I then make a hash table (an array), where the hash key would give me the index to place that URL. So now once I have hashed and filled the hashing table, I check its size. I have stored all 1 million URLs in the hash table along with they're keys. So the size is at least 25 MB. This hash table, due to its size will be stored on a remote server. When a user comes along and enters a url in the address bar, I need to check if it malicious. Thus I run the url through the hash function (the browser itself can do this) and I get a hash key for that URL. I now have to make a request to my remote server with that hash key, to check the if the particular URL in my hash table with that particular key, is the same as what the user has entered. If yes then it is malicious and if no then it is not malicious. Thus every time the user enters a URL, a request to the remote server has to be made to check if it is a malicious URL. This would take a lot of time and thus make my browser slow.

案例 1:我使用带有哈希表的哈希函数。我决定使用一个高效的散列函数,并通过散列函数运行所有 100 万个 url 以获取散列键。然后我创建一个哈希表(一个数组),其中哈希键会给我放置该 URL 的索引。所以现在一旦我散列并填充了哈希表,我就会检查它的大小。我已将所有 100 万个 URL 连同它们的键一起存储在哈希表中。所以大小至少为 25 MB。这个哈希表,由于其大小将被存储在远程服务器上。当用户出现并在地址栏中输入 url 时,我需要检查它是否是恶意的。因此,我通过哈希函数运行 url(浏览器本身可以执行此操作),并获得该 URL 的哈希键。我现在必须使用该哈希键向我的远程服务器发出请求,检查具有该特定键的哈希表中的特定 URL 是否与用户输入的相同。如果是,那么它是恶意的,如果不是,那么它不是恶意的。因此,每次用户输入 URL 时,都必须向远程服务器发出请求以检查它是否是恶意 URL。这会花费很多时间,从而使我的浏览器变慢。

Case 2 : I use a bloom filter. The entire list of 1 million URLs are run through the bloom filter using multiple hash functions and the respective positions are marked as 1, in a huge array of 0s. Lets say we want a false positive rate of 1%, using a bloom filter calculator (http://hur.st/bloomfilter?n=1000000&p=0.01) , we get the size of the bloom filter required as only 1.13 MB. This small size is expected as, even though the size of the array is huge, we are only storing 1s or 0s and not the URLs as in case of the hash table.This array can be treated as a bit array. That is, since we have only two values 1 and 0, we can set individual bits instead of bytes. This would reduce the space taken by 8 times. This 1.13 MB bloom filter, due to its small size, can be stored in the web browser itself !! Thus when a user comes along and enters a URL, we simply apply the required hash functions (in browser itself), and check all the positions in the bloom filter (which is stored in the browser). A value of 0 in any of the positions tells us that this URL is DEFINITELY NOT in the list of malicious URLs and the user can proceed freely. Thus we did not make a call to the server and hence saved time. A value of 1 tells us that the url MIGHT be in the list of malicious URLS. In these cases we make a call to the remote server and over there we can use some other hash function with some hash table as in the first case to retrieve and check if the url is actually present. Since most of the times, a url is not likely to be a malicious one, the small bloom filter in the browser figures that out and hence saves time by avoiding calls to the remote server. Only in some cases, if the bloom filter tells us that the url MIGHT be malicious , only in those cases we make a call to the server. That 'MIGHT' is 99% right.

案例 2:我使用布隆过滤器。整个 100 万个 URL 列表使用多个哈希函数通过布隆过滤器运行,各自的位置标记为 1,在一个巨大的 0 数组中。假设我们想要 1% 的误报率,使用布隆过滤器计算器(http://hur.st/bloomfilter?n=1000000&p=0.01) ,我们得到所需的布隆过滤器的大小仅为 1.13 MB。这个小尺寸是预期的,因为即使数组的大小很大,我们也只存储 1 或 0 而不是哈希表中的 URL。这个数组可以被视为一个位数组。也就是说,由于我们只有两个值 1 和 0,我们可以设置单独的位而不是字节。这将减少 8 倍的空间占用。这个 1.13 MB 的布隆过滤器由于体积小,可以存储在 Web 浏览器本身中!!因此,当用户出现并输入 URL 时,我们只需应用所需的哈希函数(在浏览器本身中),并检查布隆过滤器(存储在浏览器中)中的所有位置。任何位置的值 0 告诉我们此 URL 绝对不在恶意 URL 列表中,用户可以自由继续。因此,我们没有调用服务器,因此节省了时间。值 1 告诉我们该 url 可能在恶意 URL 列表中。在这些情况下,我们调用远程服务器,在那里我们可以使用一些其他散列函数和一些散列表,如第一种情况一样,以检索和检查 url 是否实际存在。由于大多数情况下,一个 url 不太可能是恶意的,浏览器中的小型布隆过滤器会发现这一点,从而通过避免调用远程服务器来节省时间。只有在某些情况下,如果布隆过滤器告诉我们 url 可能是恶意的,只有在这些情况下我们才会调用服务器。那个“可能”是 99% 正确的。在这些情况下,我们调用远程服务器,在那里我们可以使用一些其他散列函数和一些散列表,如第一种情况一样,以检索和检查 url 是否实际存在。由于大多数情况下,一个 url 不太可能是恶意的,浏览器中的小型布隆过滤器会发现这一点,从而通过避免调用远程服务器来节省时间。只有在某些情况下,如果布隆过滤器告诉我们 url 可能是恶意的,只有在这些情况下我们才会调用服务器。那个“可能”是 99% 正确的。在这些情况下,我们调用远程服务器,在那里我们可以使用一些其他散列函数和一些散列表,如第一种情况一样,以检索和检查 url 是否实际存在。由于大多数情况下,一个 url 不太可能是恶意的,浏览器中的小型布隆过滤器会发现这一点,从而通过避免调用远程服务器来节省时间。只有在某些情况下,如果布隆过滤器告诉我们 url 可能是恶意的,只有在这些情况下我们才会调用服务器。那个“可能”是 99% 正确的。浏览器中的小型布隆过滤器可以计算出这一点,从而通过避免调用远程服务器来节省时间。只有在某些情况下,如果布隆过滤器告诉我们 url 可能是恶意的,只有在这些情况下我们才会调用服务器。那个“可能”是 99% 正确的。浏览器中的小型布隆过滤器可以计算出这一点,从而通过避免调用远程服务器来节省时间。只有在某些情况下,如果布隆过滤器告诉我们 url 可能是恶意的,只有在这些情况下我们才会调用服务器。那个“可能”是 99% 正确的。

So by using a small bloom filter in the browser, we have saved a lot of time as we do not need to make server calls for every url entered.

因此,通过在浏览器中使用小型布隆过滤器,我们节省了大量时间,因为我们不需要为输入的每个 url 进行服务器调用。

回答by siemanko

So I have seen this question before, and I used advice above and it turned out to be way to slow for me. So I wrote my own. It is not fully general, but I am sure if somebody is desperate for performance like I am they will make it more general by themselves :)

所以我之前看过这个问题,我使用了上面的建议,结果证明它对我来说很慢。所以我写了我自己的。这并不完全通用,但我相信如果有人像我一样渴望性能,他们会自己让它变得更通用:)

I used Murmur hash implementation that you can download here: http://d3s.mff.cuni.cz/~holub/sw/javamurmurhash/

我使用了 Murmur 哈希实现,你可以在这里下载:http: //d3s.mff.cuni.cz/~holub/sw/javamurmurhash/

The code: package uk.ac.cam.cl.ss958.SpringBoardSimulation;

代码:包uk.ac.cam.cl.ss958.SpringBoardSimulation;

    import ie.ucd.murmur.MurmurHash;

    import java.util.BitSet;
    import java.util.Random;

    public class FastBloomFilter {

        private final BitSet bs;

        final int [] hashSeeds;

        final int capacity;

        public FastBloomFilter(int slots, int hashFunctions) {
            bs = new BitSet(slots);
            Random r = new Random(System.currentTimeMillis());
            hashSeeds = new int[hashFunctions];
            for (int i=0; i<hashFunctions; ++i) {
                hashSeeds[i] = r.nextInt();
            }
            capacity = slots;
        }

        public void add(int value) {
            byte [] b = new byte[] {
                    (byte)(value >>> 24),
                    (byte)(value >>> 16),
                    (byte)(value >>> 8),
                    (byte)value};
            for (int i=0; i<hashSeeds.length; ++i) {
                int h = MurmurHash.hash32(b, 4, hashSeeds[i]);
                bs.set(Math.abs(h)%capacity, true);
            }
        }

        public void clear() {
            bs.clear();
        }

        public boolean mightContain(int value) {
            byte [] b = new byte[] {
                    (byte)(value >>> 24),
                    (byte)(value >>> 16),
                    (byte)(value >>> 8),
                    (byte)value};
            for (int i=0; i<hashSeeds.length; ++i) {
                int h = MurmurHash.hash32(b, 4, hashSeeds[i]);

                if(!bs.get(Math.abs(h)%capacity)) {
                    return false;


            }

            return true;
        }


        public static void main(String [] args) {
            FastBloomFilter bf = new FastBloomFilter(1000, 10);
            System.out.println("Query for 2000: " + bf.mightContain(2000));
            System.out.println("Adding 2000");
            bf.add(2000);
            System.out.println("Query for 2000: " + bf.mightContain(2000));


        }
    }

回答by Nikita Koksharov

You can use Bloom filter based on Redisserver with Redissonlib. Based on 128-bits HighwayHash. Here is an example:

您可以使用基于Redis服务器的Bloom 过滤器和Redisson库。基于 128 位HighwayHash。下面是一个例子:

RBloomFilter<SomeObject> bloomFilter = redisson.getBloomFilter("sample");

// initialize bloom filter once with 
// expectedInsertions = 55000000
// falseProbability = 0.03
bloomFilter.tryInit(55000000L, 0.03);

bloomFilter.add(new SomeObject(someStateHere1));
bloomFilter.add(new SomeObject(someStateHere2));
// does it contain object?
bloomFilter.contains(new SomeObject(someStateHere3));

回答by Nikita Koksharov

I wrote a short postabout implementing a bloom filter using Java 8 features, that I hope is relevant to the issue of space savings. I went a bit furtherto discuss how to bit slice a collection of bloom filters, when some information retrieval systems would do this, which is relevant to efficiencies when you have lots of bloom filters.

我写了一篇关于使用 Java 8 特性实现布隆过滤器的简短文章,我希望它与节省空间的问题有关。我进一步讨论了如何对布隆过滤器集合进行位切片,当某些信息检索系统会这样做时,这与当您拥有大量布隆过滤器时的效率相关。