C语言如何理解局部敏感哈希？

Question

提问by WoooHaaaa

I noticed that LSH seems a good way to find similar items with high-dimension properties.

我注意到 LSH 似乎是找到具有高维属性的类似项目的好方法。

After reading the paper http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf, I'm still confused with those formulas.

阅读论文http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf 后，我仍然对这些公式感到困惑。

Does anyone know a blog or article that explains that the easy way?

有没有人知道解释这种简单方法的博客或文章？

Answer 1

回答by greeness

The best tutorial I have seen for LSH is in the book: Mining of Massive Datasets. Check Chapter 3 - Finding Similar Items http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

我见过的关于 LSH 的最好的教程是这本书：Mining of Massive Datasets。检查第 3 章 - 查找类似项目 http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

Also I recommend the below slide: http://www.cs.jhu.edu/%7Evandurme/papers/VanDurmeLallACL10-slides.pdf. The example in the slide helps me a lot in understanding the hashing for cosine similarity.

我还推荐以下幻灯片：http: //www.cs.jhu.edu/%7Evandurme/papers/VanDurmeLallACL10-slides.pdf。幻灯片中的示例对我理解余弦相似度的散列有很大帮助。

I borrow two slides from Benjamin Van Durme & Ashwin Lall, ACL2010and try to explain the intuitions of LSH Families for Cosine Distance a bit. enter image description here

我从ACL2010 的 Benjamin Van Durme 和 Ashwin Lall借用了两张幻灯片，并尝试解释一下 LSH 族对余弦距离的直觉。在此处输入图片说明

In the figure, there are two circles w/ redand yellowcolored, representing two two-dimensional data points. We are trying to find their cosine similarityusing LSH.
The gray lines are some uniformly randomly picked planes.
Depending on whether the data point locates above or below a gray line, we mark this relation as 0/1.
On the upper-left corner, there are two rows of white/black squares, representing the signature of the two data points respectively. Each square is corresponding to a bit 0(white) or 1(black).
So once you have a pool of planes, you can encode the data points with their location respective to the planes. Imagine that when we have more planes in the pool, the angular difference encoded in the signature is closer to the actual difference. Because only planes that resides between the two points will give the two data different bit value.

图中有两个红色和黄色的圆圈，代表两个二维数据点。我们正在尝试使用 LSH找到它们的余弦相似度。
灰线是一些均匀随机选取的平面。
根据数据点位于灰线上方还是下方，我们将此关系标记为 0/1。
在左上角，有两排白/黑方块，分别代表两个数据点的签名。每个方块对应一个位 0（白色）或 1（黑色）。
所以一旦你有了一个平面池，你就可以用它们相对于平面的位置对数据点进行编码。想象一下，当我们在池中有更多的平面时，签名中编码的角度差异更接近实际差异。因为只有位于两点之间的平面才会赋予两个数据不同的位值。

enter image description here

在此处输入图片说明

Now we look at the signature of the two data points. As in the example, we use only 6 bits(squares) to represent each data. This is the LSH hash for the original data we have.
The hamming distance between the two hashed value is 1, because their signatures only differ by 1 bit.
Considering the length of the signature, we can calculate their angular similarity as shown in the graph.

现在我们看看两个数据点的签名。在示例中，我们仅使用 6 位（平方）来表示每个数据。这是我们拥有的原始数据的 LSH 哈希。
两个散列值之间的汉明距离为 1，因为它们的签名仅相差 1 位。
考虑到签名的长度，我们可以计算它们的角度相似度，如图所示。

I have some sample code (just 50 lines) in python here which is using cosine similarity. https://gist.github.com/94a3d425009be0f94751

我在 python 中有一些示例代码（只有 50 行），它使用了余弦相似度。 https://gist.github.com/94a3d425009be0f94751

Answer 2

回答by mvogiatzis

Tweets in vector space can be a great example of high dimensional data.

矢量空间中的推文是高维数据的一个很好的例子。

Check out my blog post on applying Locality Sensitive Hashing to tweets to find similar ones.

查看我的博客文章，将局部敏感哈希应用于推文以查找相似的推文。

http://micvog.com/2013/09/08/storm-first-story-detection/

And because one picture is a thousand words check the picture below:

因为一张图片是一千个字，请查看下面的图片：

enter image description here http://micvog.files.wordpress.com/2013/08/lsh1.png

在此处输入图片说明 http://micvog.files.wordpress.com/2013/08/lsh1.png

Hope it helps. @mvogiatzis

希望能帮助到你。@mvogiatzis

Answer 3

回答by nilsi

Here's a presentation from Stanford that explains it. It made a big difference for me. Part two is more about LSH, but part one covers it as well.

这是斯坦福大学的一个演讲，解释了它。它对我产生了很大的影响。第二部分更多关于 LSH，但第一部分也涵盖了它。

A picture of the overview (There are much more in the slides):

概览图片（幻灯片中有更多内容）：

enter image description here

在此处输入图片说明

Near Neighbor Search in High Dimensional Data - Part1: http://www.stanford.edu/class/cs345a/slides/04-highdim.pdf

高维数据中的近邻搜索 - 第 1 部分：http: //www.stanford.edu/class/cs345a/slides/04-highdim.pdf

Near Neighbor Search in High Dimensional Data - Part2: http://www.stanford.edu/class/cs345a/slides/05-LSH.pdf

高维数据中的近邻搜索 - 第 2 部分：http: //www.stanford.edu/class/cs345a/slides/05-LSH.pdf

Answer 4

回答by Carlos Teixeira

LSH is a procedure that takes as input a set of documents/images/objects and outputs a kind of Hash Table.
The indexes of this table contain the documents such that documents that are on the same index are considered similarand those on different indexes are "dissimilar".
Where similardepends on the metric system and also on a threshold similarity swhich acts like a global parameter of LSH.
It is up to you to define what the adequate threshold sis for your problem.

LSH 是一种将一组文档/图像/对象作为输入并输出一种哈希表的过程。
该表的索引包含这样的文档，即在同一索引上的文档被认为是相似的，而在不同索引上的文档被认为是“不同的”。
其中相似性取决于度量系统，也取决于阈值相似性s，其作用类似于 LSH 的全局参数。
由您来定义适合您的问题的阈值s。

It is important to underline that different similarity measures have different implementations of LSH.

重要的是要强调不同的相似性度量有不同的 LSH 实现。

In my blog, I tried to thoroughly explain LSH for the cases of minHashing( jaccard similarity measure) and simHashing (cosine distance measure). I hope you find it useful: https://aerodatablog.wordpress.com/2017/11/29/locality-sensitive-hashing-lsh/

在我的博客中，我尝试针对 minHashing（jaccard 相似性度量）和 simHashing（余弦距离度量）的情况彻底解释 LSH。我希望你觉得它有用：https: //aerodatablog.wordpress.com/2017/11/29/locality-sensitive-hashing-lsh/

Answer 5

回答by Philippe Ombredanne

I am a visual person. Here is what works for me as an intuition.

我是一个视觉化的人。这是对我有用的直觉。

Say each of the things you want to search for approximately are physical objects such as an apple, a cube, a chair.

假设您要搜索的每个事物大约都是物理对象，例如苹果、立方体、椅子。

My intuition for an LSH is that it is similar to take the shadows of these objects. Like if you take the shadow of a 3D cube you get a 2D square-like on a piece of paper, or a 3D sphere will get you a circle-like shadow on a piece of paper.

我对 LSH 的直觉是，它类似于拍摄这些物体的阴影。就像如果你拍摄 3D 立方体的阴影，你会在一张纸上得到一个 2D 正方形，或者 3D 球体会在一张纸上得到一个类似圆形的阴影。

Eventually, there are many more than three dimensions in a search problem (where each word in a text could be one dimension) but the shadowanalogy is still very useful to me.

最终，搜索问题中的维度不止三个（文本中的每个词都可能是一个维度），但影子类比对我来说仍然非常有用。

Now we can efficiently compare strings of bits in software. A fixed length bit string is kinda, more or less, like a line in a single dimension.

现在我们可以有效地比较软件中的位串。固定长度的位串或多或少有点像一维中的一条线。

So with an LSH, I project the shadows of objects eventually as points (0 or 1) on a single fixed length line/bit string.

因此，使用 LSH，我最终将对象的阴影投影为单个固定长度的线/位串上的点（0 或 1）。

The whole trick is to take the shadows such that they still make sense in the lower dimension e.g. they resemble the original object in a good enough way that can be recognized.

整个技巧是使阴影在较低维度中仍然有意义，例如，它们以足够好的方式与原始对象相似，可以被识别。

A 2D drawing of a cube in perspective tells me this is a cube. But I cannot distinguish easily a 2D square from a 3D cube shadow without perspective: they both looks like a square to me.

立体立方体的 2D 绘图告诉我这是一个立方体。但是我无法轻易地区分 2D 方块和没有透视的 3D 立方体阴影：它们在我看来都像一个方块。

How I present my object to the light will determine if I get a good recognizable shadow or not. So I think of a "good" LSH as the one that will turn my objects in front of a light such that their shadow is best recognizable as representing my object.

我如何将我的物体呈现给光线将决定我是否得到一个很好的可识别的阴影。所以我认为“好”的 LSH 是将我的物体在灯光前转动，这样它们的阴影最容易被识别为代表我的物体。

So to recap: I think of things to index with an LSH as physical objects like a cube, a table, or chair, and I project their shadows in 2D and eventually along a line (a bit string). And a "good" LSH "function" is how I present my objects in front of a light to get an approximately distinguishable shape in the 2D flatland and later my bit string.

总结一下：我认为用 LSH 索引的东西是像立方体、桌子或椅子这样的物理对象，我在 2D 中投影它们的阴影，并最终沿着一条线（一个位串）投影。一个“好的”LSH“功能”是我如何将我的物体呈现在灯光前，以便在二维平面和后来的位串中获得大致可区分的形状。

Finally when I want to search if an object I have is similar to some objects that I indexed, I take the shadows of this "query" object using the same way to present my object in front of the light (eventually ending up with a bit string too). And now I can compare how similar is that bit string with all my other indexed bit strings which is a proxy for searching for my whole objects if I found a good and recognizable way to present my objects to my light.

最后，当我想搜索我拥有的对象是否与我索引的某些对象相似时，我使用相同的方式获取这个“查询”对象的阴影，以在灯光前呈现我的对象（最终以一点结束）字符串）。现在我可以比较该位串与我所有其他索引位串的相似程度，如果我找到了一种好的且可识别的方式将我的对象呈现给我的灯光，则它是搜索我的整个对象的代理。

Answer 6

回答by Guillaume Chevalier

As a very short, tldranswer:

作为一个非常简短的tldr回答：

An example of locality sensitive hashing could be to first set planes randomly (with a rotation and offset) in your space of inputs to hash, and then to drop your points to hash in the space, and for each plane you measure if the point is above or below it (e.g.: 0 or 1), and the answer is the hash. So points similar in space will have a similar hash if measured with the cosine distance before or after.

局部敏感散列的一个例子可能是首先在您的输入空间中随机设置平面（带有旋转和偏移）以进行散列，然后将您的点放在空间中进行散列，并且对于每个平面，您测量该点是否为高于或低于它（例如：0 或 1），答案是散列。因此，如果用之前或之后的余弦距离测量，空间中相似的点将具有相似的散列。

You could read this example using scikit-learn: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer

您可以使用 scikit-learn 阅读此示例：https: //github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer

C语言如何理解局部敏感哈希？

提问by WoooHaaaa

回答by greeness

回答by mvogiatzis

回答by nilsi

回答by Carlos Teixeira

回答by Philippe Ombredanne

回答by Guillaume Chevalier

相关推荐

最近更新

标签

C语言 如何理解局部敏感哈希？

提问by WoooHaaaa

回答by greeness

回答by mvogiatzis

回答by nilsi

回答by Carlos Teixeira

回答by Philippe Ombredanne

回答by Guillaume Chevalier

相关推荐

C语言 在内存中执行机器码

C语言 如何从 C 程序输出主机的 IP 地址？

C语言 更改 scanf() 分隔符

C语言 malloc(0) 的意义是什么？

相关推荐

最近更新

标签

C语言如何理解局部敏感哈希？

C语言在内存中执行机器码

C语言如何从 C 程序输出主机的 IP 地址？

C语言更改 scanf() 分隔符