java java中的停止词和词干分析器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6122545/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Stop words and stemmer in java
提问by N00programmer
I'm thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement)
我正在考虑在我的相似性程序中放置一个停用词,然后是一个词干分析器(选择搬运工 1 或 2 取决于最容易实现的内容)
I was wondering that since I read my text from files as whole lines and save them as a long string, so if I got two strings ex.
我想知道,因为我从文件中读取我的文本作为整行并将它们保存为一个长字符串,所以如果我有两个字符串 ex.
String one = "I decided buy something from the shop.";
String two = "Nevertheless I decidedly bought something from a shop.";
Now that I got those strings
现在我得到了那些字符串
Stemming: Can I just use the stemmer algoritmen directly on it, save it as a String and then continue working on the similarity like I did before implementing the stemmer in the program, like running one.stem(); kind of thing?
Stemming:我可以直接在它上面使用词干分析器,将它保存为一个字符串,然后像在程序中实现词干分析器之前那样继续处理相似性,比如运行 one.stem(); 之类的事情?
Stop word: How does this work out? O.o Do I just use; one.replaceall("I", ""); or is there some specific way to use for this proces? I want to keep working with the string and get a string before using the similarity algorithms on it to get the similarity. Wiki doesn't say a lot.
停止词:这是如何工作的?Oo 我只是使用吗?one.replaceall("I", ""); 或者有什么特定的方法可以用于这个过程?我想继续使用字符串并在使用相似性算法获得相似性之前获得一个字符串。维基没有说太多。
Hope you can help me out! Thanks.
希望你能帮帮我!谢谢。
Edit: It is for a school-related project where I'm writing a paper on similarity between different algorithms so I don't think I'm allowed to use lucene or other libraries that does the work for me. Plus I would like to try and understand how it works before I start using the libraries like Lucene and co. Hope it's not too much a bother ^^
编辑:这是一个与学校相关的项目,我正在写一篇关于不同算法之间相似性的论文,所以我认为我不允许使用 lucene 或其他对我来说有效的库。另外,在开始使用 Lucene 等库之前,我想尝试了解它是如何工作的。希望不要太麻烦^^
回答by WhiteFang34
If you're not implementing this for academic reasons you should consider using the Lucenelibrary. In either case it might be good for reference. It has classes for tokenization, stop word filtering, stemming and similarity. Here's a quick example using Lucene 3.0 to remove stop words and stem an input string:
如果您不是出于学术原因实现这一点,您应该考虑使用Lucene库。在任何一种情况下,它都可能有利于参考。它具有用于标记化、停用词过滤、词干提取和相似性的类。下面是一个使用 Lucene 3.0 删除停用词和词干输入字符串的快速示例:
public static String removeStopWordsAndStem(String input) throws IOException {
Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("I");
stopWords.add("the");
TokenStream tokenStream = new StandardTokenizer(
Version.LUCENE_30, new StringReader(input));
tokenStream = new StopFilter(true, tokenStream, stopWords);
tokenStream = new PorterStemFilter(tokenStream);
StringBuilder sb = new StringBuilder();
TermAttribute termAttr = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
if (sb.length() > 0) {
sb.append(" ");
}
sb.append(termAttr.term());
}
return sb.toString();
}
Which if used on your strings like this:
如果在您的字符串上使用,如下所示:
public static void main(String[] args) throws IOException {
String one = "I decided buy something from the shop.";
String two = "Nevertheless I decidedly bought something from a shop.";
System.out.println(removeStopWordsAndStem(one));
System.out.println(removeStopWordsAndStem(two));
}
Yields this output:
产生这个输出:
decid bui someth from shop
Nevertheless decidedli bought someth from shop
回答by tucuxi
Yes, you can wrap any stemmer so that you can write something like
是的,您可以包装任何词干提取器,以便您可以编写类似
String stemmedString = stemmer.stemAndRemoveStopwords(inputString, stopWordList);
Internally, your stemAndRemoveStopwords would
在内部,您的 stemAndRemoveStopwords 会
- place all stopWords in a Map for fast reference
- initialize an empty StringBuilder to holde the output string
- iterate over all words in the input string, and for each word
- search for it in the stopWordList; if found, continue to top of loop
- otherwise, stem it using your preferred stemmer, and add it to to the output string
- return the output string
- 将所有停用词放在地图中以供快速参考
- 初始化一个空的 StringBuilder 来保存输出字符串
- 迭代输入字符串中的所有单词,并针对每个单词
- 在 stopWordList 中搜索它;如果找到,继续循环顶部
- 否则,使用您喜欢的词干提取器将其词干,并将其添加到输出字符串中
- 返回输出字符串
回答by Eser Aygün
You don't have to deal with the whole text. Just split it, apply your stopword filter and stemming algorithm, then build the string again using a StringBuilder
:
您不必处理整个文本。只需拆分它,应用您的停用词过滤器和词干算法,然后使用以下命令再次构建字符串StringBuilder
:
StrinBuilder builder = new StringBuilder(text.length());
String[] words = text.split("\s+");
for (String word : words) {
if (stopwordFilter.check(word)) { // Apply stopword filter.
word = stemmer.stem(word); // Apply stemming algorithm.
builder.append(word);
}
}
text = builder.toString();