java 余弦相似度
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1997750/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Cosine Similarity
提问by user238384
I calculated tf/idf values of two documents. The following are the tf/idf values:
我计算了两个文档的 tf/idf 值。以下是 tf/idf 值:
1.txt
0.0
0.5
2.txt
0.0
0.5
The documents are like:
文件是这样的:
1.txt = > dog cat
2.txt = > cat elephant
How can I use these values to calculate cosine similarity?
如何使用这些值来计算余弦相似度?
I know that I should calculate the dot product, then find distance and divide dot product by it. How can I calculate this using my values?
我知道我应该计算点积,然后找到距离并将点积除以它。我如何使用我的值来计算这个?
One more question: Is it important that both documents should have same number of words?
再问一个问题:两份文件的字数应该一样重要吗?
回答by Yin Zhu
a * b
sim(a,b) =--------
|a|*|b|
a*b is dot product
a*b 是点积
some details:
一些细节:
def dot(a,b):
n = length(a)
sum = 0
for i in xrange(n):
sum += a[i] * b[i];
return sum
def norm(a):
n = length(a)
for i in xrange(n):
sum += a[i] * a[i]
return math.sqrt(sum)
def cossim(a,b):
return dot(a,b) / (norm(a) * norm(b))
yes. to some extent, a and b must have the same length. but a and b usually have sparse representation, you only need to store non-zero entries and you can calculate norm and dot more fast.
是的。在某种程度上,a 和 b 必须具有相同的长度。但是 a 和 b 通常具有稀疏表示,您只需要存储非零条目即可更快地计算范数和点。
回答by yura
simple java code implementation:
简单的java代码实现:
static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
Set<String> both = Sets.newHashSet(v1.keySet());
both.retainAll(v2.keySet());
double sclar = 0, norm1 = 0, norm2 = 0;
for (String k : both) sclar += v1.get(k) * v2.get(k);
for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
return sclar / Math.sqrt(norm1 * norm2);
}
回答by Yogesh Yadav
1) Calculate tf-idf ( Generally better than t-f alone but completely depends on your data set and requirement)
1) 计算 tf-idf(通常比单独使用 tf 更好,但完全取决于您的数据集和要求)
From wiki( regarding idf )
来自维基(关于 idf )
An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.
引入了逆文档频率因子,它减少了文档集中出现频率很高的术语的权重,并增加了很少出现的术语的权重。
2) No , it is not important that both the documents have same number of words.
2) 不,两个文档的字数相同并不重要。
3) You can find tf-idfor cosine-similarityin any language now days by invoking some machine learning library function. I prefer python
3)你可以找到tf-idf或cosine-similarity在任何语言现在通过调用一些机器学习库函数天。我更喜欢蟒蛇
Python code to calculate tf-idfand cosine-similarity( using scikit-learn 0.18.2)
计算tf-idf和余弦相似度的Python 代码(使用scikit-learn 0.18.2)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# example dataset
from sklearn.datasets import fetch_20newsgroups
# replace with your method to get data
example_data = fetch_20newsgroups(subset='all').data
max_features_for_tfidf = 10000
is_idf = True
vectorizer = TfidfVectorizer(max_df=0.5, max_features=max_features_for_tf_idf,
min_df=2, stop_words='english',
use_idf=is_idf)
X_Mat = vectorizer.fit_transform(example_data)
# calculate cosine similarity between samples in X with samples in Y
cosine_sim = cosine_similarity(X=X_Mat, Y=X_Mat)
4) You might be interested in truncated Singular Value Decomposition (SVD)
4) 您可能对截断奇异值分解 (SVD)感兴趣

