Python 中最好的词干提取方法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24647400/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the best stemming method in Python?
提问by PeYoTlL
I tried all the nltk methods for stemming but it gives me weird results with some words.
我尝试了所有的 nltk 方法来进行词干提取,但它给了我一些词的奇怪结果。
Examples
例子
It often cut end of words when it shouldn't do it :
它经常在不应该这样做的时候切断词尾:
- poodle => poodl
- article articl
- 贵宾犬 => 贵宾犬
- 文章文章
or doesn't stem very good :
或者不是很好:
- easily and easy are not stemmed in the same word
- leaves, grows, fairly are not stemmed
- 容易和容易不是同一个词
- 叶子, 生长, 公平地没有茎
Do you know other stemming libs in python, or a good dictionary?
你知道 python 中的其他词干库,还是一本好的字典?
Thank you
谢谢
采纳答案by Stephen Lin
Python implementations of the Porter, Porter2, Paice-Husk, and Lovins stemming algorithms for English are available in the stemming package
用于英语的 Porter、Porter2、Paice-Husk 和 Lovins 词干算法的 Python 实现可在词干包中获得
回答by Spaceghost
The results you are getting are (generally) expected for a stemmer in English. You say you tried "all the nltk methods" but when I try your examples, that doesn't seem to be the case.
您得到的结果(通常)是英语词干分析器的预期结果。您说您尝试了“所有 nltk 方法”,但是当我尝试您的示例时,情况似乎并非如此。
Here are some examples using the PorterStemmer
以下是使用 PorterStemmer 的一些示例
import nltk
ps = nltk.stemmer.PorterStemmer()
ps.stem('grows')
'grow'
ps.stem('leaves')
'leav'
ps.stem('fairly')
'fairli'
The results are 'grow', 'leav' and 'fairli' which, even if they are what you wanted, are stemmed versions of the original word.
结果是“grow”、“leav”和“fairli”,即使它们是您想要的,也是原始单词的词干版本。
If we switch to the Snowball stemmer, we have to provide the language as a parameter.
如果我们切换到 Snowball 词干分析器,我们必须提供语言作为参数。
import nltk
sno = nltk.stem.SnowballStemmer('english')
sno.stem('grows')
'grow'
sno.stem('leaves')
'leav'
sno.stem('fairly')
'fair'
The results are as before for 'grows' and 'leaves' but 'fairly' is stemmed to 'fair'
'grows' 和 'leaves' 的结果和以前一样,但 'fairly' 被限制为 'fair'
So in both cases (and there are more than two stemmers available in nltk), words that you say are not stemmed, in fact, are. The LancasterStemmer will return 'easy' when provided with 'easily' or 'easy' as input.
所以在这两种情况下(nltk 中有两个以上的词干分析器),你说的词实际上是没有词干的。当提供 'easy' 或 'easy' 作为输入时,LancasterStemmer 将返回 'easy'。
Maybe you really wanted a lemmatizer? That would return 'article' and 'poodle' unchanged.
也许你真的想要一个 lemmatizer?这将返回 'article' 和 'poodle' 不变。
import nltk
lemma = nltk.wordnet.WordNetLemmatizer()
lemma.lemmatize('article')
'article'
lemma.lemmatize('leaves')
'leaf'
回答by 0xF
All these stemmers that have been discussed here are algorithmic stemmer,hence they can always produce unexpected results such as
这里讨论的所有这些词干分析器都是算法词干分析器,因此它们总是可以产生意想不到的结果,例如
In [3]: from nltk.stem.porter import *
In [4]: stemmer = PorterStemmer()
In [5]: stemmer.stem('identified')
Out[5]: u'identifi'
In [6]: stemmer.stem('nonsensical')
Out[6]: u'nonsens'
To correctly get the root words one need a dictionary based stemmer such as Hunspell Stemmer.Here is a python implementation of it in the following link. Example code is here
要正确获取词根,需要一个基于字典的词干分析器,例如 Hunspell Stemmer 。以下链接中是它的 Python 实现。示例代码在这里
>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff')
>>> hobj.spell('spookie')
False
>>> hobj.suggest('spookie')
['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill']
>>> hobj.spell('spooky')
True
>>> hobj.analyze('linked')
[' st:link fl:D']
>>> hobj.stem('linked')
['link']
回答by sarvesh Kumar
In my chatbot project I have used PorterStemmer However LancasterStemmer also serves the purpose. Ultimate objective is to stem the word to its root so that we can search and compare with the search words inputs.
在我的聊天机器人项目中,我使用了 PorterStemmer 但是 LancasterStemmer 也用于此目的。最终目标是将词干到其词根,以便我们可以搜索并与搜索词输入进行比较。
For Example: from nltk.stem import PorterStemmer ps = PorterStemmer()
例如: from nltk.stem import PorterStemmer ps = PorterStemmer()
def SrchpattrnStmmed(self):
KeyWords =[]
SrchpattrnTkn = word_tokenize(self.input)
for token in SrchpattrnTkn:
if token not in stop_words:
KeyWords.append(ps.stem(token))
continue
#print(KeyWords)
return KeyWords
Hope this will help..
希望这会有所帮助..
回答by Ritveak
Stemming is all about removing suffixes(usually only suffixes, as far as I have tried none of the nltk stemmers could remove a prefix, forget about infixes). So we can clearly call stemming as a dumb/ not so intelligent program. It doesn't check if a word has a meaning before or after stemming. For eg. If u try to stem "xqaing", although not a word, it will remove "-ing" and give u "xqa".
Stemming 就是删除后缀(通常只有后缀,据我尝试,没有一个 nltk 词干分析器可以删除前缀,忘记中缀)。所以我们可以清楚地将词干称为一个愚蠢的/不那么智能的程序。它不会检查一个词在词干之前或之后是否有意义。例如。如果你试图去掉“xqaing”,虽然不是一个词,它会删除“-ing”并给你“xqa”。
So, in order to use a smarter system, one can use lemmatizers. Lemmatizers uses well-formed lemmas (words) in form of wordnet and dictionaries. So it always returns and takes a proper word. However, it is slow because it goes through all words in order to find the relevant one.
因此,为了使用更智能的系统,可以使用词形还原器。Lemmatizers 以 wordnet 和字典的形式使用格式良好的引理(单词)。所以它总是返回并使用适当的词。但是,它很慢,因为它会遍历所有单词以找到相关单词。
回答by Daniel Mahler
Stemmers vary in their aggressiveness. Porter is one of the monst aggressive stemmer for English. I find it usually hurts more than it helps. On the lighter side you can either use a lemmatizer instead as already suggested, or a lighter algorithmic stemmer. The limitation of lemmatizers is that they cannot handle unknown words.
Stemmers 的攻击性各不相同。Porter 是最具侵略性的英语词干分析器之一。我发现它通常伤害大于帮助。在较轻的方面,您可以使用已经建议的 lemmatizer,或者使用较轻的算法词干分析器。lemmatizers 的局限性在于它们无法处理未知单词。
Personally I like the Krovetz stemmer which is a hybrid solution, combing a dictionary lemmatizer and a light weight stemmer for out of vocabulary words. Krovetz also kstem
or light_stemmer
option in Elasticsearch. There is a python implementation on pypi https://pypi.org/project/KrovetzStemmer/, though that is not the one that I have used.
我个人喜欢 Krovetz 词干分析器,它是一种混合解决方案,结合了字典词形还原器和轻量级词干分析器来处理词汇不足的单词。Krovetz 也kstem
或light_stemmer
Elasticsearch 中的选项。pypi https://pypi.org/project/KrovetzStemmer/上有一个 python 实现,尽管这不是我使用过的。
Another option is the the lemmatizer in spaCy
. Afte processing with spaCy
every token has a lemma_
attribute. (note the underscore lemma
hold a numerical identifier of the lemma_
) - https://spacy.io/api/token
另一种选择是spaCy
. 对spaCy
每个令牌进行处理后都有一个lemma_
属性。(注意下划线lemma
包含 的数字标识符lemma_
) - https://spacy.io/api/token
Here are some papers comparing various stemming algorithms:
以下是一些比较各种词干算法的论文:
- https://www.semanticscholar.org/paper/A-Comparative-Study-of-Stemming-Algorithms-Ms-.-Jivani/1c0c0fa35d4ff8a2f925eb955e48d655494bd167
- https://www.semanticscholar.org/paper/Stemming-Algorithms%3A-A-Comparative-Study-and-their-Sharma/c3efc7d586e242d6a11d047a25b67ecc0f1cce0c?navId=citing-papers
- https://www.semanticscholar.org/paper/Comparative-Analysis-of-Stemming-Algorithms-for-Web/3e598cda5d076552f4a9f89aaa9d79f237882afd
- https://scholar.google.com/scholar?q=related:MhDEzHAUtZ8J:scholar.google.com/&scioq=comparative+stemmers&hl=en&as_sdt=0,5
- https://www.semanticscholar.org/paper/A-Comparative-Study-of-Stemming-Algorithms-Ms-.-Jivani/1c0c0fa35d4ff8a2f925eb955e48d655494bd167
- https://www.semanticscholar.org/paper/Stemming-Algorithms%3A-A-Comparative-Study-and-their-Sharma/c3efc7d586e242d6a11d047a25b67ecc0f1cce0c?navId=citing-papers
- https://www.semanticscholar.org/paper/Comparative-Analysis-of-Stemming-Algorithms-for-Web/3e598cda5d076552f4a9f89aaa9d79f237882afd
- https://scholar.google.com/scholar?q=related:MhDEzHAUtZ8J:scholar.google.com/&scioq=comparative+stemmers&hl=en&as_sdt=0,5