java - tf*idf 实现?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10210856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 00:03:08  来源:igfitidea点击:

java - tf*idf implementation?

javarelevancetf-idf

提问by Aravind Chinta

I am basically creating a search engine and I want to implement tf*idf to rank my xml documents based on a search query. How do I implement it? How do I start it? Any help appreciated.

我基本上是在创建一个搜索引擎,我想实现 tf*idf 来根据搜索查询对我的 xml 文档进行排名。我该如何实施?我该如何开始?任何帮助表示赞赏。

回答by daveb

I did this in the past, and I used Luceneto get the TD*IDF data.

过去我是这样做的,我使用Lucene来获取TD*IDF数据。

It took fair amount of fiddling aound though, so if there are other solutions people know are easier, then use them.

尽管如此,这需要大量的摆弄,所以如果有其他人们知道更容易的解决方案,那么就使用它们。

Start by looking at TermFreqVectorand other classes in org.apache.lucene.index.

首先查看org.apache.lucene.index中的TermFreqVector和其他类。

回答by W.P. McNeill

tfidfis a standalone Java package that calculates Tf-Idf.

tfidf是一个独立的 Java 包,用于计算 Tf-Idf。

回答by shark8me

Surprising that the Weka library hasn't been mentioned here. Weka's StringToWordVector classimplements TF-IDF.

令人惊讶的是这里没有提到 Weka 库。Weka 的StringToWordVector 类实现了 TF-IDF。

回答by Sridhar Sarnobat

Apache Mahout:

Apache Mahout:

https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/vectorizer/TFIDF.java

https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/vectorizer/TFIDF.java

I believe it requires a Hadoop File System, which is a bit of extra work. But it works great.

我相信它需要一个 Hadoop 文件系统,这是一些额外的工作。但效果很好。