Python NLTK WordNet Lemmatizer:它不应该对一个词的所有变形进行词形还原吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25534214/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
NLTK WordNet Lemmatizer: Shouldn't it lemmatize all inflections of a word?
提问by sanjeev mk
I'm using the NLTK WordNet Lemmatizer for a Part-of-Speech tagging project by first modifying each word in the training corpus to its stem (in place modification), and then training only on the new corpus. However, I found that the lemmatizer is not functioning as I expected it to.
我将 NLTK WordNet Lemmatizer 用于词性标注项目,首先将训练语料库中的每个单词修改为其词干(就地修改),然后仅在新语料库上进行训练。但是,我发现 lemmatizer 并没有像我预期的那样运行。
For example, the word lovesis lemmatized to lovewhich is correct, but the word lovingremains lovingeven after lemmatization. Here lovingis as in the sentence "I'm loving it".
例如,词loves形还原love为正确的词,但词形还原后loving仍保留loving。这loving就像“我爱它”这句话。
Isn't lovethe stem of the inflected word loving? Similarly, many other 'ing' forms remain as they are after lemmatization. Is this the correct behavior?
不是love屈折词的词干吗loving?同样,许多其他“ing”形式在词形还原后保持原样。这是正确的行为吗?
What are some other lemmatizers that are accurate? (need not be in NLTK) Are there morphology analyzers or lemmatizers that also take into account a word's Part Of Speech tag, in deciding the word stem? For example, the word killingshould have killas the stem if killingis used as a verb, but it should have killingas the stem if it is used as a noun (as in the killing was done by xyz).
还有哪些其他准确的词形还原法?(不必在 NLTK 中)在确定词干时,是否有词法分析器或词形还原器也考虑了词的词性标签?例如,如果用作动词,则单词killing应kill作为killing词干,但killing如果用作名词(如the killing was done by xyz),则应作为词干。
采纳答案by Fred Foo
The WordNet lemmatizer doestake the POS tag into account, but it doesn't magically determine it:
WordNet lemmatizer确实考虑了 POS 标签,但它并没有神奇地确定它:
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving')
'loving'
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v')
u'love'
Without a POS tag, it assumes everything you feed it is a noun. So here it thinks you're passing it the noun "loving" (as in "sweet loving").
如果没有 POS 标签,它会假设您输入的所有内容都是名词。所以在这里它认为你在传递名词“爱”(如“甜蜜的爱”)。
回答by bogs
The best way to troubleshoot this is to actually look in Wordnet. Take a look here: Loving in wordnet. As you can see, there is actually an adjective "loving" present in Wordnet. As a matter of fact, there is even the adverb "lovingly": lovingly in Wordnet. Because wordnet doesn't actually know what part of speech you actually want, it defaults to noun ('n' in Wordnet). If you are using Penn Treebank tag set, here's some handy function for transforming Penn to WN tags:
解决此问题的最佳方法是实际查看 Wordnet。看看这里:爱在 wordnet。如您所见,Wordnet 中实际上存在一个形容词“爱”。事实上,Wordnet 中甚至还有副词“lovelyly”:lovelyly。因为 wordnet 实际上并不知道您真正想要什么词性,所以它默认为名词(Wordnet 中的“n”)。如果您正在使用 Penn Treebank 标签集,这里有一些将 Penn 转换为 WN 标签的方便函数:
from nltk.corpus import wordnet as wn
def is_noun(tag):
return tag in ['NN', 'NNS', 'NNP', 'NNPS']
def is_verb(tag):
return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
def is_adverb(tag):
return tag in ['RB', 'RBR', 'RBS']
def is_adjective(tag):
return tag in ['JJ', 'JJR', 'JJS']
def penn_to_wn(tag):
if is_adjective(tag):
return wn.ADJ
elif is_noun(tag):
return wn.NOUN
elif is_adverb(tag):
return wn.ADV
elif is_verb(tag):
return wn.VERB
return None
Hope this helps.
希望这可以帮助。
回答by Joe Zhow
it's clearer and more effective than enumeration:
比枚举更清晰有效:
from nltk.corpus import wordnet
def get_wordnet_pos(self, treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
def penn_to_wn(tag):
return get_wordnet_pos(tag)
回答by Kiran Racherla
As an extension to the accepted answer from @Fred Fooabove;
作为@Fred Foo上述已接受答案的扩展;
from nltk import WordNetLemmatizer, pos_tag, word_tokenize
lem = WordNetLemmatizer()
word = input("Enter word:\t")
# Get the single character pos constant from pos_tag like this:
pos_label = (pos_tag(word_tokenize(word))[0][1][0]).lower()
# pos_refs = {'n': ['NN', 'NNS', 'NNP', 'NNPS'],
# 'v': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],
# 'r': ['RB', 'RBR', 'RBS'],
# 'a': ['JJ', 'JJR', 'JJS']}
if pos_label == 'j': pos_label = 'a' # 'j' <--> 'a' reassignment
if pos_label in ['r']: # For adverbs it's a bit different
print(wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name())
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
print(lem.lemmatize(word, pos=pos_label))
else: # For nouns and everything else as it is the default kwarg
print(lem.lemmatize(word))

