Java 中 Tf Idf 的任何教程或代码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1960333/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Any tutorial or code for Tf Idf in java
提问by user238384
I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help ! Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :) OR If you can tell me some good java tutorial for this. Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :( Please also do not refer me to Lucene :(
我正在寻找一个可以计算 tf-idf 计算的简单 java 类。我想对 2 个文档进行相似性测试。我发现了很多使用 tf-idf 类的 BIG API。我不想使用大的 jar 文件,只是为了做我的简单测试。请帮忙 !或者至少有人可以告诉我如何找到TF?和以色列国防军?我会计算结果 :) 或者如果你能告诉我一些好的 Java 教程。请不要告诉我寻找谷歌,我已经做了 3 天,但找不到任何东西:( 也请不要将我推荐给 Lucene :(
回答by danben
Term Frequency is the square root of the number of times a term occurs in a particular document.
术语频率是术语在特定文档中出现的次数的平方根。
Inverse Document Frequency is (the log of (the total number of documents divided by the number of documents containing the term)) plus one in case the term occurs zero times -- if it does, obviously don't try to divide by zero.
逆文档频率是((文档总数除以包含该词的文档数)的对数)在该词出现零次的情况下加一——如果出现零次,显然不要试图除以零。
If it isn't clear from that answer, there is a TF per term per document, and an IDF per term.
如果该答案不清楚,则每个文档的每个术语都有一个 TF,每个术语都有一个 IDF。
And then TF-IDF(term, document) = TF(term, document) * IDF(term)
然后 TF-IDF(term, document) = TF(term, document) * IDF(term)
Finally, you use the vector space model to compare documents, where each term is a new dimension and the "length" of the part of the vector pointing in that dimension is the TF-IDF calculation. Each document is a vector, so compute the two vectors and then compute the distance between them.
最后,您使用向量空间模型来比较文档,其中每个术语是一个新维度,指向该维度的向量部分的“长度”是 TF-IDF 计算。每个文档都是一个向量,因此计算两个向量,然后计算它们之间的距离。
So to do this in Java, read the file in one line at a time with a FileReader or something, and split on spaces or whatever other delimiters you want to use - each word is a term. Count the number of times each term appears in each file, and the number of files each term appears in. Then you have everything you need to do the above calculations.
因此,要在 Java 中执行此操作,请使用 FileReader 或其他东西一次读取一行文件,并在空格或您想要使用的任何其他分隔符上进行拆分——每个单词都是一个术语。计算每个术语在每个文件中出现的次数,以及每个术语出现在的文件数量。然后您就拥有了进行上述计算所需的一切。
And since I have nothing else to do, I looked up the vector distance formula. Here you go:
由于我无事可做,我查了一下矢量距离公式。干得好:
D=sqrt((x2-x1)^2+(y2-y1)^2+...+(n2-n1)^2)
For this purpose, x1 is the TF-IDF for term x in document 1.
为此,x1 是文档 1 中术语 x 的 TF-IDF。
Edit: in response to your question about how to count the words in a document:
编辑:回答您关于如何计算文档中的字数的问题:
- Read the file in line by line with a reader, like
new BufferedReader(new FileReader(filename))- you can callBufferedReader.readLine()in a while loop, checking for null each time. - For each line, call
line.split("\\s")- that will split your line on whitespace and give you an array of all of the words. - For each word, add 1 to the word's count for the current document. This could be done using a
HashMap.
- 使用阅读器逐行读取文件,例如
new BufferedReader(new FileReader(filename))- 您可以BufferedReader.readLine()在 while 循环中调用,每次检查是否为 null。 - 对于每一行,调用
line.split("\\s")- 这将在空白处拆分您的行并为您提供所有单词的数组。 - 对于每个单词,将当前文档的单词计数加 1。这可以使用
HashMap.
Now, after computing D for each document, you will have X values where X is the number of documents. To compare all documents against each other is to do only X^2 comparisons - this shouldn't take particularly long for 10,000. Remember that two documents are MORE similar if the absolute value of the difference between their D values is lower. So then you could compute the difference between the Ds of every pair of documents and store that in a priority queue or some other sorted structure such that the most similar documents bubble up to the top. Make sense?
现在,在为每个文档计算 D 后,您将有 X 值,其中 X 是文档数。将所有文档相互比较就是只进行 X^2 次比较——这对于 10,000 次来说应该不会特别长。请记住,如果两个文档的 D 值之间差异的绝对值较低,则它们更相似。因此,您可以计算每对文档的 D 之间的差异,并将其存储在优先级队列或其他排序结构中,以便最相似的文档向上冒泡。说得通?
回答by Shashikant Kore
While you specifically asked not to refer Lucene, please allow me to point to you the exact class. The class you are looking for is DefaultSimilarity. It has an extremely simple API to calculate TF and IDF. See the java code here. Or you could just implement yourself as specified in the DefaultSimilarity documentation.
虽然您特别要求不要提及 Lucene,但请允许我向您指出确切的类。您正在寻找的类是DefaultSimilarity。它有一个非常简单的 API 来计算 TF 和 IDF。请参阅此处的 java 代码。或者您可以按照 DefaultSimilarity 文档中的说明自行实现。
TF = sqrt(freq)
and
和
IDF = log(numDocs/(docFreq+1)) + 1.
The log and sqrt functions are used to damp the actual values. Using the raw values can skew results dramatically.
log 和 sqrt 函数用于抑制实际值。使用原始值可能会显着扭曲结果。
回答by Yuval F
agazerboy, Sujit Pal's blog postgives a thorough description of calculating TF and IDF. WRT verifying results, I suggest you start with a small corpus (say 100 documents) so that you can see easily whether you are correct. For 10000 documents, using Lucene begins to look like a really rational choice.
agazerboy,Sujit Pal 的博客文章详细描述了计算 TF 和 IDF。WRT验证结果,建议你先从一个小语料库开始(比如100个文档),这样你就可以很容易地看出你是否正确。对于 10000 个文档,使用 Lucene 开始看起来是一个非常合理的选择。

