使用 Lucene 和 Java 标记化、删除停用词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17625385/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Tokenize, remove stop words using Lucene with Java
提问by whyname
I am trying to tokenize and remove stop words from a txt file with Lucene. I have this:
我正在尝试使用 Lucene 从 txt 文件中标记和删除停用词。我有这个:
public String removeStopWords(String string) throws IOException {
Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("an");
stopWords.add("I");
stopWords.add("the");
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);
StringBuilder sb = new StringBuilder();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
if (sb.length() > 0) {
sb.append(" ");
}
sb.append(token.toString());
System.out.println(sb);
}
return sb.toString();
}}
My main looks like this:
我的主要看起来像这样:
String file = "..../datatest.txt";
TestFileReader fr = new TestFileReader();
fr.imports(file);
System.out.println(fr.content);
String text = fr.content;
Stopwords stopwords = new Stopwords();
stopwords.removeStopWords(text);
System.out.println(stopwords.removeStopWords(text));
This is giving me an error but I can't figure out why.
这给了我一个错误,但我不知道为什么。
回答by user692704
I had The same problem. To remove stop-words using Lucene
you could either use their Default Stop Set using the method EnglishAnalyzer.getDefaultStopSet();
. Otherwise, you could create your own custom stop-words list.
我有同样的问题。要使用删除停用词,Lucene
您可以使用方法使用它们的默认停止集EnglishAnalyzer.getDefaultStopSet();
。否则,您可以创建自己的自定义停用词列表。
The code below shows the correct version of your removeStopWords()
:
下面的代码显示了您的正确版本removeStopWords()
:
public static String removeStopWords(String textFile) throws Exception {
CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));
tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
StringBuilder sb = new StringBuilder();
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String term = charTermAttribute.toString();
sb.append(term + " ");
}
return sb.toString();
}
To use a custom list of stop words use the following:
要使用自定义的停用词列表,请使用以下内容:
//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set
final List<String> stop_Words = Arrays.asList("fox", "the");
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);
回答by user3370153
you may try to call tokenStream.reset() before calling tokenStream.incrementToken()
您可以尝试在调用 tokenStream.incrementToken() 之前调用 tokenStream.reset()