java 从 Lucene 索引中获取最高频率项

Question

提问by Julia

i need to extract terms with highest frequencies from several lucene indexes, to use them for some semantic analysis.

我需要从几个 lucene 索引中提取频率最高的术语，以将它们用于一些语义分析。

So, I want to get maybe top 30 most occuring terms(still did not decide on threshold, i will analyze results) and their per-index counts. I am aware that I might lose some precision because of potentionally dropped duplicates, but for now, lets say i am ok with that.

所以，我想获得前 30 个最常出现的术语（仍然没有决定阈值，我将分析结果）及其每个索引的计数。我知道我可能会因为潜在地删除重复而失去一些精度，但是现在，可以说我对此没有意见。

So for the proposed solutions, (needless to say maybe) speed is not important, since I would do static analysis, I would put accent on simplicityof implementation because im not so skilled with Lucene and cant wrap my mind around some concepts of it..

所以对于提出的解决方案，（不用说也许）速度并不重要，因为我会做静态分析，我会强调实现的简单性，因为我对 Lucene 不太熟练，无法围绕它的一些概念。 .

I can not find any code samples from something similar, so all concrete advices (code, pseudocode, links to code samples...) Appreciate all the advices!

我无法从类似的东西中找到任何代码示例，所以所有具体的建议（代码、伪代码、代码示例的链接......）感谢所有的建议！

Thank you!

谢谢！

Answer 1

采纳答案by mindas

Have a look at this: http://sujitpal.blogspot.com/2009/02/summarization-with-lucene.html

看看这个：http: //sujitpal.blogspot.com/2009/02/summarization-with-lucene.html

The class in this page hascomputeTopTermQuerymethod which you should be easily able to retrofit for going over multiple indexes.

此页面中的类具有computeTopTermQuery方法，您应该能够轻松地对其进行改造以遍历多个索引。

Answer 2

回答by Pascal Dimassimo

A very simple way would be to use Luke. On the 'Overview' tab, there is a 'Show top terms' button that can be used for what you need.

一个非常简单的方法是使用Luke。在“概览”选项卡上，有一个“显示热门术语”按钮，可用于满足您的需要。

java 从 Lucene 索引中获取最高频率项

提问by Julia

采纳答案by mindas

回答by Pascal Dimassimo

相关推荐

最近更新

标签

java 从 Lucene 索引中获取最高频率项

提问by Julia

采纳答案by mindas

回答by Pascal Dimassimo

相关推荐

java 正确使用 Classloader（尤其是在 Android 中）

Java/Hibernate 在实体上使用接口

java spring - 构造函数注入和覆盖嵌套 bean 的父定义

java 使用 'super' 关键字限定泛型

相关推荐

最近更新

标签