Java 中的分词器、停止词删除、词干

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1664489/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 18:52:35  来源:igfitidea点击:

Tokenizer, Stop Word Removal, Stemming in Java

javatokenizestemmingstop-words

提问by Phil

I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system.

我正在寻找一个类或方法,它需要一个由数百个单词组成的长字符串并进行标记,删除停用词和词干以在 IR 系统中使用。

For example:

例如:

"The big fat cat, said 'your funniest guy i know' to the kangaroo..."

“大肥猫对袋鼠说‘我认识的你最有趣的家伙’……”

the tokenizer would remove the punctuation and return an ArrayListof words

分词器将删除标点符号并返回一个ArrayList单词

the stop word remover would remove words like "the", "to", etc

停用词移除器会移除“the”、“to”等词

the stemmer would reduce each word the their 'root', for example 'funniest' would become funny

词干分析器会减少每个词的“词根”,例如“最有趣”会变得有趣

Many thanks in advance.

提前谢谢了。

采纳答案by jitter

AFAIK Lucenecan do what you want. With StandardAnalyzerand StopAnalyzeryou can to the stop word removal. In combination with the Lucene contrib-snowball(which includes work from Snowball) project you can do the stemming too.

AFAIK Lucene可以做你想做的事。用StandardAnalyzerStopAnalyzer你可以去除停用词。结合Lucene contrib-snowball(包括来自Snowball 的工作)项目,您也可以进行词干提取。

But for stemming also consider this answer to: Stemming algorithm that produces real words

但是对于词干也可以考虑这个答案:产生真实单词的词干算法

回答by peter.murray.rust

These are standard requirements in Natural Language Processing so I would look in such toolkits. Since you require Java I'd start with OpenNLP: http://opennlp.sourceforge.net/

这些是自然语言处理中的标准要求,因此我会查看此类工具包。由于您需要 Java,我将从 OpenNLP 开始:http://opennlp.sourceforge.net/

If you can look at other languages there is also NLTK (Python)

如果您可以查看其他语言,还有 NLTK (Python)

Note that "your funniest guy i know" is not standard syntax and this makes it harder to process than "You're the funniest guy I know". Not impossible, but much harder. I don't know of any system that would equate "your" to "you are".

请注意,“你是我认识的最有趣的人”不是标准语法,这比“你是我认识的最有趣的人”更难处理。并非不可能,但要困难得多。我不知道有任何系统可以将“您的”等同于“您是”。

回答by msha

Here is comprehensive list of NLP tools. Sometime it makes sense to create these yourself as they will be lighter and you would have more control to the inner workings: use simple regular expression for tokenizations. For stop words just push the list below or some other list to a HashSet:

这是NLP 工具的综合列表。有时自己创建这些是有意义的,因为它们会更轻,并且您可以更好地控制内部工作:使用简单的正则表达式进行标记。对于停用词,只需将下面的列表或其他一些列表推送到 HashSet:

common-english-words.txt

通用英语单词.txt

Here is one of many Java implementation of porter stemer).

这是porter stemer的许多Java 实现之一)。

回答by demongolem

I have dealt with the issue on a number of tasks I have worked with, so let me give a tokenizer suggestion. As I do not see it given directly as an answer, I often use edu.northwestern.at.utils.corpuslinguistics.tokenizer.*as my family of tokenizers. I see a number of cases where I used the PennTreebankTokenizerclass. Here is how you use it:

我已经处理了我处理过的许多任务的问题,所以让我给出一个标记器建议。由于我不认为它直接作为答案给出,因此我经常将其edu.northwestern.at.utils.corpuslinguistics.tokenizer.*用作我的标记器系列。我看到了许多使用PennTreebankTokenizer该类的案例。以下是您如何使用它:

    WordTokenizer wordTokenizer = new PennTreebankTokenizer();
    List<String> words = wordTokenizer.extractWords(text);

The link to this work is here. Just a disclaimer, I have no affiliation with Northwestern, the group, or the work they do. I am just someone who uses the code occasionally.

这项工作的链接在这里。只是免责声明,我与西北大学、该小组或他们所做的工作没有任何关系。我只是偶尔使用代码的人。