使用 Lucene 和 Java 标记化、删除停用词

Question

提问by whyname

I am trying to tokenize and remove stop words from a txt file with Lucene. I have this:

我正在尝试使用 Lucene 从 txt 文件中标记和删除停用词。我有这个：

public String removeStopWords(String string) throws IOException {

Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("an");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
    tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);

    StringBuilder sb = new StringBuilder();

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(token.toString());
    System.out.println(sb);    
    }
    return sb.toString();
}}

My main looks like this:

我的主要看起来像这样：

    String file = "..../datatest.txt";

    TestFileReader fr = new TestFileReader();
    fr.imports(file);
    System.out.println(fr.content);

    String text = fr.content;

    Stopwords stopwords = new Stopwords();
    stopwords.removeStopWords(text);
    System.out.println(stopwords.removeStopWords(text));

This is giving me an error but I can't figure out why.

这给了我一个错误，但我不知道为什么。

Answer 1

回答by user692704

I had The same problem. To remove stop-words using Luceneyou could either use their Default Stop Set using the method EnglishAnalyzer.getDefaultStopSet();. Otherwise, you could create your own custom stop-words list.

我有同样的问题。要使用删除停用词，Lucene您可以使用方法使用它们的默认停止集EnglishAnalyzer.getDefaultStopSet();。否则，您可以创建自己的自定义停用词列表。

The code below shows the correct version of your removeStopWords():

下面的代码显示了您的正确版本removeStopWords()：

public static String removeStopWords(String textFile) throws Exception {
    CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));

    tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
    StringBuilder sb = new StringBuilder();
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        String term = charTermAttribute.toString();
        sb.append(term + " ");
    }
    return sb.toString();
}

To use a custom list of stop words use the following:

要使用自定义的停用词列表，请使用以下内容：

//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set 
final List<String> stop_Words = Arrays.asList("fox", "the");
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);

Answer 2

回答by user3370153

you may try to call tokenStream.reset() before calling tokenStream.incrementToken()

您可以尝试在调用 tokenStream.incrementToken() 之前调用 tokenStream.reset()

使用 Lucene 和 Java 标记化、删除停用词

提问by whyname

回答by user692704

回答by user3370153

相关推荐

最近更新

标签

使用 Lucene 和 Java 标记化、删除停用词

提问by whyname

回答by user692704

回答by user3370153

相关推荐

java 在托管 bean 之间传递参数

如何从 Gradle 调用静态 Java 方法

java 获取资源的本地文件路径

java 趋势线（回归、曲线拟合）java库

相关推荐

最近更新

标签