Java 如何计算两个向量的余弦相似度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/520241/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I calculate the cosine similarity of two vectors?
提问by
How do I find the cosine similarity between vectors?
如何找到向量之间的余弦相似度?
I need to find the similarity to measure the relatedness between two lines of text.
我需要找到相似度来衡量两行文本之间的相关性。
For example, I have two sentences like:
例如,我有两个句子,例如:
system for user interface
user interface machine
用户界面系统
用户界面机
…?and their respective vectors after tF-idf, followed by normalisation using LSI, for example
[1,0.5]
and [0.5,1]
.
...? 和它们各自在 tF-idf 之后的向量,然后使用 LSI 进行归一化,例如
[1,0.5]
和[0.5,1]
。
How do I measure the smiliarity between these vectors?
如何衡量这些向量之间的相似度?
采纳答案by Mark Davidson
public class CosineSimilarity extends AbstractSimilarity {
@Override
protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) {
double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1();
double eucledianDist = sourceDoc.normF() * targetDoc.normF();
return dotProduct / eucledianDist;
}
}
I did some tf-idf stuff recently for my Information Retrieval unit at University. I used this Cosine Similarity method which uses Jama: Java Matrix Package.
我最近为我在大学的信息检索单元做了一些 tf-idf 的东西。我使用了这个 Cosine Similarity 方法,它使用Jama: Java Matrix Package。
For the full source code see IR Math with Java : Similarity Measures, really good resource that covers a good few different similarity measurements.
对于完整的源代码,请参阅IR Math with Java : Similarity Measures,非常好的资源,涵盖了一些不同的相似度测量。
回答by Toon Krijthe
Have a look at: http://en.wikipedia.org/wiki/Cosine_similarity.
看看:http: //en.wikipedia.org/wiki/Cosine_similarity。
If you have vectors A and B.
如果你有向量 A 和 B。
The similarity is defined as:
相似度定义为:
cosine(theta) = A . B / ||A|| ||B||
For a vector A = (a1, a2), ||A|| is defined as sqrt(a1^2 + a2^2)
For vector A = (a1, a2) and B = (b1, b2), A . B is defined as a1 b1 + a2 b2;
So for vector A = (a1, a2) and B = (b1, b2), the cosine similarity is given as:
(a1 b1 + a2 b2) / sqrt(a1^2 + a2^2) sqrt(b1^2 + b2^2)
Example:
例子:
A = (1, 0.5), B = (0.5, 1)
cosine(theta) = (0.5 + 0.5) / sqrt(5/4) sqrt(5/4) = 4/5
回答by Anonymous
When I was working with text mining some time ago, I was using the SimMetricslibrary which provides an extensive range of different metrics in Java. If it happened that you need more, then there is always R and CRANto look at.
前段时间我从事文本挖掘工作时,我使用了SimMetrics库,该库在 Java 中提供了广泛的不同指标。如果碰巧您需要更多,那么总有R 和 CRAN可以查看。
But coding it from the description in the Wikipedia is rather trivial task, and can be a nice exercise.
但是根据维基百科中的描述对其进行编码是一项相当微不足道的任务,并且可以是一个很好的练习。
回答by Nick Fortescue
For matrix code in Java I'd recommend using the Coltlibrary. If you have this, the code looks like (not tested or even compiled):
对于 Java 中的矩阵代码,我建议使用Colt库。如果你有这个,代码看起来像(未经测试甚至编译):
DoubleMatrix1D a = new DenseDoubleMatrix1D(new double[]{1,0.5}});
DoubleMatrix1D b = new DenseDoubleMatrix1D(new double[]{0.5,1}});
double cosineDistance = a.zDotProduct(b)/Math.sqrt(a.zDotProduct(a)*b.zDotProduct(b))
The code above could also be altered to use one of the Blas.dnrm2()
methods or Algebra.DEFAULT.norm2()
for the norm calculation. Exactly the same result, which is more readable depends on taste.
也可以更改上面的代码以使用其中一种Blas.dnrm2()
方法或Algebra.DEFAULT.norm2()
用于范数计算。完全相同的结果,哪个更具可读性取决于品味。
回答by Alphaaa
If you want to avoid relying on third-party libraries for such a simple task, here is a plain Java implementation:
如果你想避免依赖第三方库来完成这样一个简单的任务,这里是一个普通的 Java 实现:
public static double cosineSimilarity(double[] vectorA, double[] vectorB) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
for (int i = 0; i < vectorA.length; i++) {
dotProduct += vectorA[i] * vectorB[i];
normA += Math.pow(vectorA[i], 2);
normB += Math.pow(vectorB[i], 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
Note that the function assumes that the two vectors have the same length. You may want to explictly check it for safety.
请注意,该函数假定两个向量具有相同的长度。为了安全起见,您可能需要明确检查它。
回答by Thamme Gowda
For the sparse representation of vectors using Map(dimension -> magnitude)
Here is a scala version (You can do similar stuff in Java 8)
对于使用Map(dimension -> magnitude)
这里是scala版本的向量的稀疏表示(您可以在Java 8中做类似的事情)
def cosineSim(vec1:Map[Int,Int],
vec2:Map[Int,Int]): Double ={
val dotProduct:Double = vec1.keySet.intersect(vec2.keySet).toList
.map(dim => vec1(dim) * vec2(dim)).sum
val norm1:Double = vec1.values.map(mag => mag * mag).sum
val norm2:Double = vec2.values.map(mag => mag * mag).sum
return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2))
}
回答by u10437407
def cosineSimilarity(vectorA: Vector[Double], vectorB: Vector[Double]):Double={
var dotProduct = 0.0
var normA = 0.0
var normB = 0.0
var i = 0
for(i <- vectorA.indices){
dotProduct += vectorA(i) * vectorB(i)
normA += Math.pow(vectorA(i), 2)
normB += Math.pow(vectorB(i), 2)
}
dotProduct / (Math.sqrt(normA) * Math.sqrt(normB))
}
def main(args: Array[String]): Unit = {
val vectorA = Array(1.0,2.0,3.0).toVector
val vectorB = Array(4.0,5.0,6.0).toVector
println(cosineSimilarity(vectorA, vectorA))
println(cosineSimilarity(vectorA, vectorB))
}
scala version
Scala版本