java 从 Lucene 索引中获取最高频率项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2821903/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get highest frequency terms from Lucene index
提问by Julia
i need to extract terms with highest frequencies from several lucene indexes, to use them for some semantic analysis.
我需要从几个 lucene 索引中提取频率最高的术语,以将它们用于一些语义分析。
So, I want to get maybe top 30 most occuring terms(still did not decide on threshold, i will analyze results) and their per-index counts. I am aware that I might lose some precision because of potentionally dropped duplicates, but for now, lets say i am ok with that.
所以,我想获得前 30 个最常出现的术语(仍然没有决定阈值,我将分析结果)及其每个索引的计数。我知道我可能会因为潜在地删除重复而失去一些精度,但是现在,可以说我对此没有意见。
So for the proposed solutions, (needless to say maybe) speed is not important, since I would do static analysis, I would put accent on simplicityof implementation because im not so skilled with Lucene and cant wrap my mind around some concepts of it..
所以对于提出的解决方案,(不用说也许)速度并不重要,因为我会做静态分析,我会强调实现的简单性,因为我对 Lucene 不太熟练,无法围绕它的一些概念。 .
I can not find any code samples from something similar, so all concrete advices (code, pseudocode, links to code samples...) Appreciate all the advices!
我无法从类似的东西中找到任何代码示例,所以所有具体的建议(代码、伪代码、代码示例的链接......)感谢所有的建议!
Thank you!
谢谢!
采纳答案by mindas
Have a look at this: http://sujitpal.blogspot.com/2009/02/summarization-with-lucene.html
看看这个:http: //sujitpal.blogspot.com/2009/02/summarization-with-lucene.html
The class in this page hascomputeTopTermQuerymethod which you should be easily able to retrofit for going over multiple indexes.
此页面中的类具有computeTopTermQuery方法,您应该能够轻松地对其进行改造以遍历多个索引。

