Python NLTK WordNet Lemmatizer：它不应该对一个词的所有变形进行词形还原吗？

Question

提问by sanjeev mk

I'm using the NLTK WordNet Lemmatizer for a Part-of-Speech tagging project by first modifying each word in the training corpus to its stem (in place modification), and then training only on the new corpus. However, I found that the lemmatizer is not functioning as I expected it to.

我将 NLTK WordNet Lemmatizer 用于词性标注项目，首先将训练语料库中的每个单词修改为其词干（就地修改），然后仅在新语料库上进行训练。但是，我发现 lemmatizer 并没有像我预期的那样运行。

For example, the word lovesis lemmatized to lovewhich is correct, but the word lovingremains lovingeven after lemmatization. Here lovingis as in the sentence "I'm loving it".

例如，词loves形还原love为正确的词，但词形还原后loving仍保留loving。这loving就像“我爱它”这句话。

Isn't lovethe stem of the inflected word loving? Similarly, many other 'ing' forms remain as they are after lemmatization. Is this the correct behavior?

不是love屈折词的词干吗loving？同样，许多其他“ing”形式在词形还原后保持原样。这是正确的行为吗？

What are some other lemmatizers that are accurate? (need not be in NLTK) Are there morphology analyzers or lemmatizers that also take into account a word's Part Of Speech tag, in deciding the word stem? For example, the word killingshould have killas the stem if killingis used as a verb, but it should have killingas the stem if it is used as a noun (as in the killing was done by xyz).

还有哪些其他准确的词形还原法？（不必在 NLTK 中）在确定词干时，是否有词法分析器或词形还原器也考虑了词的词性标签？例如，如果用作动词，则单词killing应kill作为killing词干，但killing如果用作名词（如the killing was done by xyz），则应作为词干。

Answer 1

采纳答案by Fred Foo

The WordNet lemmatizer doestake the POS tag into account, but it doesn't magically determine it:

WordNet lemmatizer确实考虑了 POS 标签，但它并没有神奇地确定它：

>>> nltk.stem.WordNetLemmatizer().lemmatize('loving')
'loving'
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v')
u'love'

Without a POS tag, it assumes everything you feed it is a noun. So here it thinks you're passing it the noun "loving" (as in "sweet loving").

如果没有 POS 标签，它会假设您输入的所有内容都是名词。所以在这里它认为你在传递名词“爱”（如“甜蜜的爱”）。

Answer 2

回答by bogs

The best way to troubleshoot this is to actually look in Wordnet. Take a look here: Loving in wordnet. As you can see, there is actually an adjective "loving" present in Wordnet. As a matter of fact, there is even the adverb "lovingly": lovingly in Wordnet. Because wordnet doesn't actually know what part of speech you actually want, it defaults to noun ('n' in Wordnet). If you are using Penn Treebank tag set, here's some handy function for transforming Penn to WN tags:

解决此问题的最佳方法是实际查看 Wordnet。看看这里：爱在 wordnet。如您所见，Wordnet 中实际上存在一个形容词“爱”。事实上，Wordnet 中甚至还有副词“lovelyly”：lovelyly。因为 wordnet 实际上并不知道您真正想要什么词性，所以它默认为名词（Wordnet 中的“n”）。如果您正在使用 Penn Treebank 标签集，这里有一些将 Penn 转换为 WN 标签的方便函数：

from nltk.corpus import wordnet as wn

def is_noun(tag):
    return tag in ['NN', 'NNS', 'NNP', 'NNPS']


def is_verb(tag):
    return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']


def is_adverb(tag):
    return tag in ['RB', 'RBR', 'RBS']


def is_adjective(tag):
    return tag in ['JJ', 'JJR', 'JJS']


def penn_to_wn(tag):
    if is_adjective(tag):
        return wn.ADJ
    elif is_noun(tag):
        return wn.NOUN
    elif is_adverb(tag):
        return wn.ADV
    elif is_verb(tag):
        return wn.VERB
    return None

Hope this helps.

希望这可以帮助。

Answer 3

回答by Joe Zhow

it's clearer and more effective than enumeration：

比枚举更清晰有效：

from nltk.corpus import wordnet

def get_wordnet_pos(self, treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def penn_to_wn(tag):
    return get_wordnet_pos(tag)

Answer 4

回答by Kiran Racherla

As an extension to the accepted answer from @Fred Fooabove;

作为@Fred Foo上述已接受答案的扩展；

from nltk import WordNetLemmatizer, pos_tag, word_tokenize

lem = WordNetLemmatizer()
word = input("Enter word:\t")

# Get the single character pos constant from pos_tag like this:
pos_label = (pos_tag(word_tokenize(word))[0][1][0]).lower()

# pos_refs = {'n': ['NN', 'NNS', 'NNP', 'NNPS'],
#            'v': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],
#            'r': ['RB', 'RBR', 'RBS'],
#            'a': ['JJ', 'JJR', 'JJS']}

if pos_label == 'j': pos_label = 'a'    # 'j' <--> 'a' reassignment

if pos_label in ['r']:  # For adverbs it's a bit different
    print(wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name())
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
    print(lem.lemmatize(word, pos=pos_label))
else:   # For nouns and everything else as it is the default kwarg
    print(lem.lemmatize(word))

Python NLTK WordNet Lemmatizer：它不应该对一个词的所有变形进行词形还原吗？

提问by sanjeev mk

采纳答案by Fred Foo

回答by bogs

回答by Joe Zhow

回答by Kiran Racherla

相关推荐

最近更新

标签

Python NLTK WordNet Lemmatizer：它不应该对一个词的所有变形进行词形还原吗？

提问by sanjeev mk

采纳答案by Fred Foo

回答by bogs

回答by Joe Zhow

回答by Kiran Racherla

相关推荐

Python 使用 join 在 Pandas 中进行 vlookup

Python 如何使用 matplotlib.pyplot 更改表格的字体大小？

打印未显示在 ipython 笔记本中

Python Pandas 按标签选择有时返回系列，有时返回 DataFrame

相关推荐

最近更新

标签