Java中基本滑动窗口算法的实现

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18745892/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 10:40:39  来源:igfitidea点击:

Implementation of Basic Sliding Window Algorithm in Java

javaalgorithm

提问by littleK

I am attempting to implement the following Basic Sliding Window algorithm in Java. I get the basic idea of it, but I am a bit confused by some the wording, specifically the sentence in bold:

我正在尝试在 Java 中实现以下基本滑动窗口算法。我明白了它的基本概念,但我对一些措辞感到有些困惑,特别是粗体的句子:

A sliding window of ?xed width w is moved across the ?le, and at every position k in the ?le, the ?ngerprint of its content is computed. Let k be a chunk boundary (i.e., Fk mod n = 0). Instead of taking the hash of the entire chunk, we choose the numerically smallest ?ngerprint of a sliding window within this chunk.Then we compute a hash of this randomly chosen window within the chunk. Intuitively, this approach would permit small edits within the chunks to have less impact on the similarity computation. This method produces a variable length document signature, where the number of ?ngerprints in the signature is proportional to the document length.

固定宽度 w 的滑动窗口在文件中移动,并且在文件中的每个位置 k 处,计算其内容的指纹。令 k 为块边界(即 Fk mod n = 0)。我们不是采用整个块的哈希,而是选择该块内滑动窗口的数字最小指纹。然后我们计算块内这个随机选择的窗口的哈希值。直观地说,这种方法将允许块内的小编辑对相似性计算的影响较小。该方法产生可变长度的文档签名,其中签名中的指纹数量与文档长度成正比。

Please see my code/results below. Am I understanding the basic idea of the algorithm? As per the text in bold, what does it mean to "choose the numerically smallest fingerprint of a sliding window within its chunk"? I am currently just hashing the entire chunk.

请在下面查看我的代码/结果。我是否理解算法的基本思想?根据粗体文本,“在其块内选择滑动窗口的数字最小指纹”是什么意思?我目前只是散列整个块。

code:

代码:

    public class BSW {

    /**
     * @param args
     */
    public static void main(String[] args) {
        int w = 15; // fixed width of sliding window
        char[] chars = "Once upon a time there lived in a certain village a little             
            country girl, the prettiest creature who was ever seen. Her mother was 
            excessively fond of her; and her grandmother doted on her still more. This 
            good woman had a little red riding hood made for her. It suited the girl so 
            extremely well that everybody called her Little Red Riding Hood."
                .toCharArray();

        List<String> fingerprints = new ArrayList<String>();

        for (int i = 0; i < chars.length; i = i + w) {

            StringBuffer sb = new StringBuffer();

            if (i + w < chars.length) {
                sb.append(chars, i, w);
                System.out.println(i + ". " + sb.toString());
            } else {
                sb.append(chars, i, chars.length - i);
                System.out.println(i + ". " + sb.toString());
            }

            fingerprints.add(hash(sb));

        }

    }

    private static String hash(StringBuffer sb) {
        // Implement hash (MD5)
        return sb.toString();
    }

}

results:

结果:

0. Once upon a tim
15. e there lived i
30. n a certain vil
45. lage a little c
60. ountry girl, th
75. e prettiest cre
90. ature who was e
105. ver seen. Her m
120. other was exces
135. sively fond of 
150. her; and her gr
165. andmother doted
180.  on her still m
195. ore. This good 
210. woman had a lit
225. tle red riding 
240. hood made for h
255. er. It suited t
270. he girl so extr
285. emely well that
300.  everybody call
315. ed her Little R
330. ed Riding Hood.

采纳答案by Summer_More_More_Tea

The simple answer is NO per my understanding (I once studied sliding window algorithm years ago, so I just remember the principles, while cannot remember some details. Correct me if you have more insightful understanding).

根据我的理解,简单的答案是否定的(我几年前曾经研究过滑动窗口算法,所以我只记得原理,而有些细节不记得了。如果您有更深入的理解,请纠正我)。

As the name of the algorithm 'Sliding Window', your window should be sliding not jumping as it says

作为“滑动窗口”算法的名称,您的窗口应该像它所说的那样滑动而不是跳跃

at every position k in the ?le, the ?ngerprint of its content is computed

in your quotes. That is to say the window slides one character each time.

在你的报价中。也就是说窗口每次滑动一个字符。

Per my knowledge, the concept of chunks and windows should be distinguished. So should be fingerprint and hash, although they could be the same. Given it too expense to compute hash as fingerprint, I think Rabin fingerprintis a more proper choice. The chunk is a large block of text in the document and a window highlight a small portion in a chunk. IIRC, the sliding windows algorithm works like this:

据我所知,应该区分块和窗口的概念。指纹和哈希也应该如此,尽管它们可能相同。鉴于计算哈希作为指纹的成本太高,我认为Rabin 指纹是一个更合适的选择。块是文档中的一大块文本,窗口突出显示块中的一小部分。IIRC,滑动窗口算法是这样工作的:

  1. The text file is chunked at first;
  2. For each chunk, you slide the window (a 15-char block in your running case) and compute their fingerprint for each window of text;
  3. You now have the fingerprint of the chunk, whose length is proportional to the length of chunk.
  1. 文本文件首先被分块;
  2. 对于每个块,你滑动窗口(在你的运行案例中是一个 15 个字符的块)并为每个文本窗口计算它们的指纹;
  3. 您现在拥有块的指纹,其长度与块的长度成正比。

The next is how you use the fingerprint to compute the similarity between different documents, which is out of my knowledge. Could you please give us the pointer to the article you referred in the OP. As an exchange, I recommend you this paper, which introduce a variance of sliding window algorithm to compute document similarity.

接下来是如何使用指纹来计算不同文档之间的相似度,这是我不知道的。您能否向我们提供指向您在 OP 中引用的文章的指针。作为交流,我向您推荐这篇论文,它介绍了一种滑动窗口算法的方差来计算文档相似度。

Winnowing: local algorithms for document fingerprinting

Winnowing:用于文档指纹识别的本地算法

Another application you can refer to is rsync, which is a data synchronisation tool with block-level (corresponding to your chunk) deduplication. See this short article for how it works.

另一个你可以参考的应用是rsync,它是一个具有块级(对应于你的块)重复数据删除的数据同步工具。请参阅这篇简短的文章,了解它的工作原理

回答by Jim Garrison

That is not a sliding window. All you have done is break up the input into disjoint chunks. An example of a sliding window would be

那不是滑动窗口。您所做的就是将输入分解为不相交的块。滑动窗口的一个例子是

Once upon a time
upon a time there
a time there lived
etc. 

回答by umang shukla

package com.perturbation;

import java.util.ArrayList;
import java.util.List;

public class BSW {

    /**
     * @param args
     */
    public static void main(String[] args) {
        int w = 2; // fixed width of sliding window
        char[] chars = "umang shukla"
                .toCharArray();

        List<String> fingerprints = new ArrayList<String>();

        for (int i = 0; i < chars.length+w; i++) {

            StringBuffer sb = new StringBuffer();

            if (i + w < chars.length) {
                sb.append(chars, i, w);
                System.out.println(i + ". " + sb.toString());
            } else {
                sb.append(chars, i, chars.length - i);
                System.out.println(i + ". " + sb.toString());
            }

            fingerprints.add(hash(sb));

        }

    }

    private static String hash(StringBuffer sb) {
        // Implement hash (MD5)
        return sb.toString();
    }

}

this program may help you. and please try to make more efficent

这个程序可能会帮助你。请尽量提高效率