python中的wordnet词形还原和pos标记

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15586721/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:28:53  来源:igfitidea点击:

wordnet lemmatization and pos tagging in python

pythonnltkwordnetlemmatization

提问by user1946217

I wanted to use wordnet lemmatizer in python and I have learnt that the default pos tag is NOUN and that it does not output the correct lemma for a verb, unless the pos tag is explicitly specified as VERB.

我想在 python 中使用 wordnet lemmatizer,我了解到默认的 pos 标签是 NOUN 并且它不会为动词输出正确的引理,除非 pos 标签明确指定为 VERB。

My question is what is the best shot inorder to perform the above lemmatization accurately?

我的问题是,为了准确地执行上述词形还原,最好的镜头是什么?

I did the pos tagging using nltk.pos_tagand I am lost in integrating the tree bank pos tags to wordnet compatible pos tags. Please help

我使用 pos 标记做了 pos 标记nltk.pos_tag,我迷失在将树银行 pos 标签集成到 wordnet 兼容的 pos 标签中。请帮忙

from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)

I get the output tags in NN,JJ,VB,RB. How do I change these to wordnet compatible tags?

我在 NN、JJ、VB、RB 中得到输出标签。如何将这些更改为与 wordnet 兼容的标签?

Also do I have to train nltk.pos_tag()with a tagged corpus or can I use it directly on my data to evaluate?

另外,我是否必须nltk.pos_tag()使用带标签的语料库进行训练,还是可以直接在我的数据上使用它来进行评估?

采纳答案by Suzana

First of all, you can use nltk.pos_tag()directly without training it. The function will load a pretrained tagger from a file. You can see the file name with nltk.tag._POS_TAGGER:

首先,nltk.pos_tag()不用训练就可以直接使用。该函数将从文件中加载一个预训练的标记器。您可以使用以下命令查看文件名nltk.tag._POS_TAGGER

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle' 

As it was trained with the Treebank corpus, it also uses the Treebank tag set.

因为它是用 Treebank 语料库训练的,所以它也使用Treebank 标签集

The following function would map the treebank tags to WordNet part of speech names:

以下函数将树库标签映射到 WordNet 词性名称:

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

You can then use the return value with the lemmatizer:

然后,您可以将返回值与 lemmatizer 一起使用:

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'

Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError.

在将返回值传递给 Lemmatizer 之前检查返回值,因为空字符串会给出KeyError.

回答by pg2455

As in the source code of nltk.corpus.reader.wordnet (http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html)

如 nltk.corpus.reader.wordnet 的源代码(http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html

#{ Part-of-speech constants
 ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
#}
POS_LIST = [NOUN, VERB, ADJ, ADV]

回答by Haha TTpro

@Suzana_K was working. But I there are some case result in KeyError as @ Clock Slave mention.

@Suzana_K 正在工作。但是我在 KeyError 中有一些情况导致 @ Clock Slave 提到。

Convert treebank tags to Wordnet tag

将树库标签转换为 Wordnet 标签

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None # for easy if-statement 

Now, we only input pos into lemmatize function only if we have wordnet tag

现在,只有当我们有 wordnet 标签时,我们才将 pos 输入到 lemmatize 函数中

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)
for word, tag in tagged:
    wntag = get_wordnet_pos(tag)
    if wntag is None:# not supply tag in case of None
        lemma = lemmatizer.lemmatize(word) 
    else:
        lemma = lemmatizer.lemmatize(word, pos=wntag) 

回答by Deepak

Steps to convert : Document->Sentences->Tokens->POS->Lemmas

转换步骤:文档->句子->令牌->POS->引理

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#example text text = 'What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad'

class Splitter(object):
    """
    split the document into sentences and tokenize each sentence
    """
    def __init__(self):
        self.splitter = nltk.data.load('tokenizers/punkt/english.pickle')
        self.tokenizer = nltk.tokenize.TreebankWordTokenizer()

    def split(self,text):
        """
        out : ['What', 'can', 'I', 'say', 'about', 'this', 'place', '.']
        """
        # split into single sentence
        sentences = self.splitter.tokenize(text)
        # tokenization in each sentences
        tokens = [self.tokenizer.tokenize(sent) for sent in sentences]
        return tokens


class LemmatizationWithPOSTagger(object):
    def __init__(self):
        pass
    def get_wordnet_pos(self,treebank_tag):
        """
        return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            # As default pos in lemmatization is Noun
            return wordnet.NOUN

    def pos_tag(self,tokens):
        # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') ....
        pos_tokens = [nltk.pos_tag(token) for token in tokens]

        # lemmatization using pos tagg   
        # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag]
        pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens]
        return pos_tokens

lemmatizer = WordNetLemmatizer()
splitter = Splitter()
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger()

#step 1 split document into sentence followed by tokenization
tokens = splitter.split(text)

#step 2 lemmatization using pos tagger 
lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens)
print(lemma_pos_token)

回答by wordsforthewise

You can do this in one line:

您可以在一行中执行此操作:

wnpos = lambda e: ('a' if e[0].lower() == 'j' else e[0].lower()) if e[0].lower() in ['n', 'r', 'v'] else 'n'

Then use wnpos(nltk_pos)to get the POS to give to .lemmatize(). In your case, lmtzr.lemmatize(word=tagged[0][0], pos=wnpos(tagged[0][1])).

然后使用wnpos(nltk_pos)获取 POS 给 .lemmatize()。在你的情况下,lmtzr.lemmatize(word=tagged[0][0], pos=wnpos(tagged[0][1])).

回答by Shuchita Banthia

You can create a map using the python default dict and take advantage of the fact that for the lemmatizer the default tag is Noun.

您可以使用 python 默认字典创建地图,并利用词形还原器的默认标签是名词这一事实。

from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = "Another way of achieving this task"
tokens = word_tokenize(text)
lmtzr = WordNetLemmatizer()

for token, tag in pos_tag(tokens):
    lemma = lmtzr.lemmatize(token, tag_map[tag[0]])
    print(token, "=>", lemma)

回答by Marco Ottina

After searching from internet, I've found this solution: from sentence to "bag of words" derived after splitting, pos_tagging, lemmatizing and cleaning (from punctuation and "stopping words") operations. Here's my code:

在网上搜索后,我找到了这个解决方案:从句子到“词袋”,经过拆分、pos_tagging、词形还原和清理(来自标点符号和“停止词”)操作。这是我的代码:

from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

punctuation = u",.?!()-_\"\'\\n\r\t;:+*<>@#§^$%&|/"
stop_words_eng = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tag_dict = {"J": wn.ADJ,
            "N": wn.NOUN,
            "V": wn.VERB,
            "R": wn.ADV}

def extract_wnpostag_from_postag(tag):
    #take the first letter of the tag
    #the second parameter is an "optional" in case of missing key in the dictionary 
    return tag_dict.get(tag[0].upper(), None)

def lemmatize_tupla_word_postag(tupla):
    """
    giving a tupla of the form (wordString, posTagString) like ('guitar', 'NN'), return the lemmatized word
    """
    tag = extract_wnpostag_from_postag(tupla[1])    
    return lemmatizer.lemmatize(tupla[0], tag) if tag is not None else tupla[0]

def bag_of_words(sentence, stop_words=None):
    if stop_words is None:
        stop_words = stop_words_eng
    original_words = word_tokenize(sentence)
    tagged_words = nltk.pos_tag(original_words) #returns a list of tuples: (word, tagString) like ('And', 'CC')
    original_words = None
    lemmatized_words = [ lemmatize_tupla_word_postag(ow) for ow in tagged_words ]
    tagged_words = None
    cleaned_words = [ w for w in lemmatized_words if (w not in punctuation) and (w not in stop_words) ]
    lemmatized_words = None
    return cleaned_words

sentence = "Two electric guitar rocks players, and also a better bass player, are standing off to two sides reading corpora while walking"
print(sentence, "\n\n bag of words:\n", bag_of_words(sentence) )