java 获取 Lucene 中的词频

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/667389/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 13:19:00  来源:igfitidea点击:

Get term frequencies in Lucene

javafull-text-searchlucene

提问by Ilija

Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequenciesclass, since that takes an awful lot of time for large collections?

有没有一种快速简便的方法可以从 Lucene 索引中获取词频,而无需通过TermVectorFrequencies类来完成,因为对于大型集合来说这需要大量时间?

What I mean is, is there something like TermEnumwhich has not just the document frequency but term frequency as well?

我的意思是,有没有类似的东西TermEnum不仅有文档频率,还有词频?

UPDATE: Using TermDocs is way too slow.

更新:使用 TermDocs 太慢了。

回答by erickson

Use TermDocsto get the term frequency for a given document. Like the document frequency, you get the term documents from an IndexReader, using the term of interest.

使用TermDocs以获得长期的频率给定文档。与文档频率一样,您可以IndexReader使用感兴趣的术语从 中获取术语文档。



You won't find a faster method than TermDocswithout losing some generality. TermDocsreads directly from the ".frq" file in an index segment, where each term frequency is listed in document order.

你找不到比TermDocs不失一般性更快的方法了。TermDocs直接从索引段中的“.frq”文件读取,其中每个词频按文档顺序列出。

If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment. Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).

如果这“太慢”,请确保您已优化索引以将多个段合并为一个段。按顺序遍历文档(跳过是可以的,但你不能有效地在文档列表中来回跳转)。

Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData. Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory, or to give to the OS for use on its own file-caching system.

您的下一步可能是额外的处理,以创建一个更专业的文件结构,将SkipData. 就我个人而言,我会寻找更好的算法来实现我的目标,或者提供更好的硬件——大量内存,要么用来保存RAMDirectory.

回答by Michael McCandless

The trunk version of Lucene (to be 4.0, eventually) now exposes the totalTermFreq() for each term from the TermsEnum. This is the total number of times this term appeared in all content (but, like docFreq, does not take into account deletions).

Lucene 的主干版本(最终是 4.0)现在公开了来自 termenum 的每个术语的 totalTermFreq()。这是该术语在所有内容中出现的总次数(但与 docFreq 一样,不考虑删除)。

回答by Kai Chan

TermDocsgives the TF of a given term in each document that contains the term. You can get the DF by iterating through each <document, frequency> pair and counting the number of pairs, although TermEnums should be faster. IndexReaderhas a termDocs(Term) methodthat returns a TermDocs for the given Term and index.

TermDocs给出每个包含该术语的文档中给定术语的 TF。您可以通过迭代每个 <document, frequency> 对并计算对的数量来获得 DF,尽管 TermEnums 应该更快。IndexReader有一个termDocs(Term) 方法,该方法返回给定 Term 和索引的 TermDocs。