用于从输入文本中提取关键字的 Java 库
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17447045/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java library for keywords extraction from input text
提问by Shay
I'm looking for a Java library to extract keywords from a block of text.
我正在寻找一个 Java 库来从文本块中提取关键字。
The process should be as follows:
该过程应如下所示:
stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.
停止词清理 -> 词干提取 -> 根据英语语言学统计信息搜索关键字 - 意思是如果一个词在文本中出现的次数比在英语中出现的次数多于它的候选关键词。
Is there a library that performs this task?
是否有执行此任务的库?
回答by sp00m
Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the /lucene-core-x.x.x.jar
, don't forget to add the /contrib/analyzers/common/lucene-analyzers-x.x.x.jar
from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).
这是使用Apache Lucene的可能解决方案。我没有使用上一个版本,而是使用了3.6.2版本,因为这是我最了解的版本。除了/lucene-core-x.x.x.jar
,不要忘记/contrib/analyzers/common/lucene-analyzers-x.x.x.jar
从下载的档案中添加到您的项目:它包含特定于语言的分析器(尤其是在您的情况下的英语分析器)。
Note that this will onlyfind the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answermay help by the way).
请注意,这只会根据它们各自的词干找到输入文本单词的频率。之后将这些频率与英语统计数据进行比较(这个答案可能会有所帮助)。
The data model
数据模型
One keyword for one stem. Different words may have the same stem, hence the terms
set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).
一字一字。不同的词可能具有相同的词干,因此是terms
集合。每次找到新术语时,关键字频率都会增加(即使已经找到了 - 一组会自动删除重复项)。
public class Keyword implements Comparable<Keyword> {
private final String stem;
private final Set<String> terms = new HashSet<String>();
private int frequency = 0;
public Keyword(String stem) {
this.stem = stem;
}
public void add(String term) {
terms.add(term);
frequency++;
}
@Override
public int compareTo(Keyword o) {
// descending order
return Integer.valueOf(o.frequency).compareTo(frequency);
}
@Override
public boolean equals(Object obj) {
if (this == obj) {
return true;
} else if (!(obj instanceof Keyword)) {
return false;
} else {
return stem.equals(((Keyword) obj).stem);
}
}
@Override
public int hashCode() {
return Arrays.hashCode(new Object[] { stem });
}
public String getStem() {
return stem;
}
public Set<String> getTerms() {
return terms;
}
public int getFrequency() {
return frequency;
}
}
Utilities
公用事业
To stem a word:
词干:
public static String stem(String term) throws IOException {
TokenStream tokenStream = null;
try {
// tokenize
tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));
// stem
tokenStream = new PorterStemFilter(tokenStream);
// add each token in a set, so that duplicates are removed
Set<String> stems = new HashSet<String>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
stems.add(token.toString());
}
// if no stem or 2+ stems have been found, return null
if (stems.size() != 1) {
return null;
}
String stem = stems.iterator().next();
// if the stem has non-alphanumerical chars, return null
if (!stem.matches("[a-zA-Z0-9-]+")) {
return null;
}
return stem;
} finally {
if (tokenStream != null) {
tokenStream.close();
}
}
}
To search into a collection (will be used by the list of potential keywords):
要搜索集合(将由潜在关键字列表使用):
public static <T> T find(Collection<T> collection, T example) {
for (T element : collection) {
if (element.equals(example)) {
return element;
}
}
collection.add(example);
return example;
}
Core
核
Here is the main input method:
下面是主要的输入法:
public static List<Keyword> guessFromString(String input) throws IOException {
TokenStream tokenStream = null;
try {
// hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
input = input.replaceAll("-+", "-0");
// replace any punctuation char but apostrophes and dashes by a space
input = input.replaceAll("[\p{Punct}&&[^'-]]+", " ");
// replace most common english contractions
input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\b", "");
// tokenize input
tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));
// to lowercase
tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);
// remove dots from acronyms (and "'s" but already done manually above)
tokenStream = new ClassicFilter(tokenStream);
// convert any char to ASCII
tokenStream = new ASCIIFoldingFilter(tokenStream);
// remove english stop words
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());
List<Keyword> keywords = new LinkedList<Keyword>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String term = token.toString();
// stem each term
String stem = stem(term);
if (stem != null) {
// create the keyword or get the existing one if any
Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
// add its corresponding initial token
keyword.add(term.replaceAll("-0", "-"));
}
}
// reverse sort by frequency
Collections.sort(keywords);
return keywords;
} finally {
if (tokenStream != null) {
tokenStream.close();
}
}
}
Example
例子
Using the guessFromString
method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:
使用guessFromString
的方法的Java Wikipedia文章引言部分,这里是第10个最常见的关键字(即茎)中发现:
java x12 [java]
compil x5 [compiled, compiler, compilers]
sun x5 [sun]
develop x4 [developed, developers]
languag x3 [languages, language]
implement x3 [implementation, implementations]
applic x3 [application, applications]
run x3 [run]
origin x3 [originally, original]
gnu x3 [gnu]
Iterate over the output list to know which were the original found wordsfor each stem by getting the terms
sets (displayed between brackets [...]
in the above example).
通过获取集合(在上面的示例中显示在括号之间),迭代输出列表以了解每个词干的原始发现单词。terms
[...]
What's next
下一步是什么
Compare the stem frequency / frequencies sumratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too :)
将词干频率/频率总和比率与英语统计数据进行比较,如果您管理它,请让我参与其中:我也可能非常感兴趣:)
回答by Mike B.
An updated and ready-to-use version of the code proposed above.
This code is compatible with Apache Lucene
5.x…6.x.
上面建议的代码的更新和随时可用的版本。
此代码与Apache Lucene
5.x…6.x兼容。
CardKeywordclass:
CardKeyword类:
import java.util.HashSet;
import java.util.Set;
/**
* Keyword card with stem form, terms dictionary and frequency rank
*/
class CardKeyword implements Comparable<CardKeyword> {
/**
* Stem form of the keyword
*/
private final String stem;
/**
* Terms dictionary
*/
private final Set<String> terms = new HashSet<>();
/**
* Frequency rank
*/
private int frequency;
/**
* Build keyword card with stem form
*
* @param stem
*/
public CardKeyword(String stem) {
this.stem = stem;
}
/**
* Add term to the dictionary and update its frequency rank
*
* @param term
*/
public void add(String term) {
this.terms.add(term);
this.frequency++;
}
/**
* Compare two keywords by frequency rank
*
* @param keyword
* @return int, which contains comparison results
*/
@Override
public int compareTo(CardKeyword keyword) {
return Integer.valueOf(keyword.frequency).compareTo(this.frequency);
}
/**
* Get stem's hashcode
*
* @return int, which contains stem's hashcode
*/
@Override
public int hashCode() {
return this.getStem().hashCode();
}
/**
* Check if two stems are equal
*
* @param o
* @return boolean, true if two stems are equal
*/
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (!(o instanceof CardKeyword)) return false;
CardKeyword that = (CardKeyword) o;
return this.getStem().equals(that.getStem());
}
/**
* Get stem form of keyword
*
* @return String, which contains getStemForm form
*/
public String getStem() {
return this.stem;
}
/**
* Get terms dictionary of the stem
*
* @return Set<String>, which contains set of terms of the getStemForm
*/
public Set<String> getTerms() {
return this.terms;
}
/**
* Get stem frequency rank
*
* @return int, which contains getStemForm frequency
*/
public int getFrequency() {
return this.frequency;
}
}
KeywordsExtractorclass:
关键字提取器类:
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
import org.apache.lucene.analysis.standard.ClassicFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import java.io.IOException;
import java.io.StringReader;
import java.util.*;
/**
* Keywords extractor functionality handler
*/
class KeywordsExtractor {
/**
* Get list of keywords with stem form, frequency rank, and terms dictionary
*
* @param fullText
* @return List<CardKeyword>, which contains keywords cards
* @throws IOException
*/
static List<CardKeyword> getKeywordsList(String fullText) throws IOException {
TokenStream tokenStream = null;
try {
// treat the dashed words, don't let separate them during the processing
fullText = fullText.replaceAll("-+", "-0");
// replace any punctuation char but apostrophes and dashes with a space
fullText = fullText.replaceAll("[\p{Punct}&&[^'-]]+", " ");
// replace most common English contractions
fullText = fullText.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\b", "");
StandardTokenizer stdToken = new StandardTokenizer();
stdToken.setReader(new StringReader(fullText));
tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();
List<CardKeyword> cardKeywords = new LinkedList<>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
String term = token.toString();
String stem = getStemForm(term);
if (stem != null) {
CardKeyword cardKeyword = find(cardKeywords, new CardKeyword(stem.replaceAll("-0", "-")));
// treat the dashed words back, let look them pretty
cardKeyword.add(term.replaceAll("-0", "-"));
}
}
// reverse sort by frequency
Collections.sort(cardKeywords);
return cardKeywords;
} finally {
if (tokenStream != null) {
try {
tokenStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
/**
* Get stem form of the term
*
* @param term
* @return String, which contains the stemmed form of the term
* @throws IOException
*/
private static String getStemForm(String term) throws IOException {
TokenStream tokenStream = null;
try {
StandardTokenizer stdToken = new StandardTokenizer();
stdToken.setReader(new StringReader(term));
tokenStream = new PorterStemFilter(stdToken);
tokenStream.reset();
// eliminate duplicate tokens by adding them to a set
Set<String> stems = new HashSet<>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
stems.add(token.toString());
}
// if stem form was not found or more than 2 stems have been found, return null
if (stems.size() != 1) {
return null;
}
String stem = stems.iterator().next();
// if the stem form has non-alphanumerical chars, return null
if (!stem.matches("[a-zA-Z0-9-]+")) {
return null;
}
return stem;
} finally {
if (tokenStream != null) {
try {
tokenStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
/**
* Find sample in collection
*
* @param collection
* @param sample
* @param <T>
* @return <T> T, which contains the found object within collection if exists, otherwise the initially searched object
*/
private static <T> T find(Collection<T> collection, T sample) {
for (T element : collection) {
if (element.equals(sample)) {
return element;
}
}
collection.add(sample);
return sample;
}
}
The call of function:
函数调用:
String text = "…";
List<CardKeyword> keywordsList = KeywordsExtractor.getKeywordsList(text);