java 余弦相似度

Question

提问by user238384

I calculated tf/idf values of two documents. The following are the tf/idf values:

我计算了两个文档的 tf/idf 值。以下是 tf/idf 值：

1.txt
0.0
0.5
2.txt
0.0
0.5

The documents are like:

文件是这样的：

1.txt = > dog cat
2.txt = > cat elephant

How can I use these values to calculate cosine similarity?

如何使用这些值来计算余弦相似度？

I know that I should calculate the dot product, then find distance and divide dot product by it. How can I calculate this using my values?

我知道我应该计算点积，然后找到距离并将点积除以它。我如何使用我的值来计算这个？

One more question: Is it important that both documents should have same number of words?

再问一个问题：两份文件的字数应该一样重要吗？

Answer 1

回答by Yin Zhu

            a * b
sim(a,b) =--------
           |a|*|b|

a*b is dot product

a*b 是点积

some details:

一些细节：

def dot(a,b):
  n = length(a)
  sum = 0
  for i in xrange(n):
    sum += a[i] * b[i];
  return sum

def norm(a):
  n = length(a)
  for i in xrange(n):
    sum += a[i] * a[i]
  return math.sqrt(sum)

def cossim(a,b):
  return dot(a,b) / (norm(a) * norm(b))

yes. to some extent, a and b must have the same length. but a and b usually have sparse representation, you only need to store non-zero entries and you can calculate norm and dot more fast.

是的。在某种程度上，a 和 b 必须具有相同的长度。但是 a 和 b 通常具有稀疏表示，您只需要存储非零条目即可更快地计算范数和点。

Answer 2

回答by yura

simple java code implementation:

简单的java代码实现：

  static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
            Set<String> both = Sets.newHashSet(v1.keySet());
            both.retainAll(v2.keySet());
            double sclar = 0, norm1 = 0, norm2 = 0;
            for (String k : both) sclar += v1.get(k) * v2.get(k);
            for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
            for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
            return sclar / Math.sqrt(norm1 * norm2);
    }

Answer 3

回答by Yogesh Yadav

1) Calculate tf-idf ( Generally better than t-f alone but completely depends on your data set and requirement)

1) 计算 tf-idf（通常比单独使用 tf 更好，但完全取决于您的数据集和要求）

From wiki( regarding idf )

来自维基（关于 idf ）

An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

引入了逆文档频率因子，它减少了文档集中出现频率很高的术语的权重，并增加了很少出现的术语的权重。

2) No , it is not important that both the documents have same number of words.

2) 不，两个文档的字数相同并不重要。

3) You can find tf-idfor cosine-similarityin any language now days by invoking some machine learning library function. I prefer python

3）你可以找到tf-idf或cosine-similarity在任何语言现在通过调用一些机器学习库函数天。我更喜欢蟒蛇

Python code to calculate tf-idfand cosine-similarity( using scikit-learn 0.18.2)

计算tf-idf和余弦相似度的Python 代码（使用scikit-learn 0.18.2）

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# example dataset
from sklearn.datasets import fetch_20newsgroups

# replace with your method to get data
example_data = fetch_20newsgroups(subset='all').data

max_features_for_tfidf = 10000
is_idf = True 

vectorizer = TfidfVectorizer(max_df=0.5, max_features=max_features_for_tf_idf,
                             min_df=2, stop_words='english',
                             use_idf=is_idf)


X_Mat = vectorizer.fit_transform(example_data)

# calculate cosine similarity between samples in X with samples in Y
cosine_sim = cosine_similarity(X=X_Mat, Y=X_Mat)

4) You might be interested in truncated Singular Value Decomposition (SVD)

4) 您可能对截断奇异值分解 (SVD)感兴趣

java 余弦相似度

提问by user238384

回答by Yin Zhu

回答by yura

回答by Yogesh Yadav

相关推荐

最近更新

标签

java 余弦相似度

提问by user238384

回答by Yin Zhu

回答by yura

回答by Yogesh Yadav

相关推荐

java rich:dataScroller 不刷新 Rich:dataTable 在 JSF 中

java 复制实体集合并保留在 Hibernate/JPA 中

java 如何使用 Stripes Framework 在 Web 浏览器中显示 JFreeChart

java 如何使用扫描仪和 for 循环（无数组）查找第二大数

相关推荐

最近更新

标签