java 从 Lucene 索引中获取最高频率项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2821903/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 23:00:02  来源:igfitidea点击:

Get highest frequency terms from Lucene index

javalucenefull-text-searchindexingfrequency

提问by Julia

i need to extract terms with highest frequencies from several lucene indexes, to use them for some semantic analysis.

我需要从几个 lucene 索引中提取频率最高的术语,以将它们用于一些语义分析。

So, I want to get maybe top 30 most occuring terms(still did not decide on threshold, i will analyze results) and their per-index counts. I am aware that I might lose some precision because of potentionally dropped duplicates, but for now, lets say i am ok with that.

所以,我想获得前 30 个最常出现的术语(仍然没有决定阈值,我将分析结果)及其每个索引的计数。我知道我可能会因为潜在地删除重复而失去一些精度,但是现在,可以说我对此没有意见。

So for the proposed solutions, (needless to say maybe) speed is not important, since I would do static analysis, I would put accent on simplicityof implementation because im not so skilled with Lucene and cant wrap my mind around some concepts of it..

所以对于提出的解决方案,(不用说也许)速度并不重要,因为我会做静态分析,我会强调实现的简单性,因为我对 Lucene 不太熟练,无法围绕它的一些概念。 .

I can not find any code samples from something similar, so all concrete advices (code, pseudocode, links to code samples...) Appreciate all the advices!

我无法从类似的东西中找到任何代码示例,所以所有具体的建议(代码、伪代码、代码示例的链接......)感谢所有的建议!

Thank you!

谢谢!

采纳答案by mindas

Have a look at this: http://sujitpal.blogspot.com/2009/02/summarization-with-lucene.html

看看这个:http: //sujitpal.blogspot.com/2009/02/summarization-with-lucene.html

The class in this page hascomputeTopTermQuerymethod which you should be easily able to retrofit for going over multiple indexes.

此页面中的类具有computeTopTermQuery方法,您应该能够轻松地对其进行改造以遍历多个索引。

回答by Pascal Dimassimo

A very simple way would be to use Luke. On the 'Overview' tab, there is a 'Show top terms' button that can be used for what you need.

一个非常简单的方法是使用Luke。在“概览”选项卡上,有一个“显示热门术语”按钮,可用于满足您的需要。