Python NLTK 使用语料库标记西班牙语单词

Question

提问by dm03514

I am trying to learn how to tag spanish words using NLTK.

我正在尝试学习如何使用 NLTK 标记西班牙语单词。

From the nltk book, It is quite easy to tag english words using their example. Because I am new to nltk and all language processing, I am quite confused on how to proceeed.

从nltk book 中，使用他们的示例标记英语单词非常容易。因为我是 nltk 和所有语言处理的新手，所以我对如何进行感到很困惑。

I have downloaded the cess_espcorpus. Is there a way to specifiy a corpus in nltk.pos_tag. I looked at the pos_tagdocumentation and didn't see anything that suggested I could. I feel like i'm missing some key concepts. Do I have to manually tag the words in my text agains the cess_esp corpus? (by manually I mean tokenize my sentance and run it agains the corpus) Or am I off the mark entirely. Thank you

我已经下载了cess_esp语料库。有没有办法在nltk.pos_tag. 我查看了pos_tag文档，没有看到任何建议我可以的内容。我觉得我错过了一些关键概念。我是否必须在 cess_esp 语料库中手动标记文本中的单词？（手动我的意思是标记我的句子并在语料库中再次运行它）或者我完全不合时宜。谢谢

Answer 1

采纳答案by alvas

First you need to read the tagged sentence from a corpus.NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml.

首先，您需要从语料库中读取标记的句子。NLTK 提供了一个很好的界面，不用担心来自不同语料库的不同格式；您可以简单地导入语料库，使用语料库对象函数来访问数据。请参阅http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml。

Then you have to choose your choice of tagger and train the tagger. There are more fancy options but you can start with the N-gram taggers.

然后您必须选择您选择的标记器并训练标记器。还有更多花哨的选项，但您可以从 N-gram 标记器开始。

Then you can use the tagger to tag the sentence you want. Here's an example code:

然后你就可以使用标注器来标注你想要的句子了。这是一个示例代码：

from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

# Read the corpus into a list, 
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()

# Train the unigram tagger
uni_tag = ut(cess_sents)

sentence = "Hola , esta foo bar ."

# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))

# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%

# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])

# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])

# Using the tagger.
bi_tag.tag(sentence.split(" "))

Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later re-use.

在大型语料库上训练标注器可能需要很长时间。与每次需要时都训练一个标记器不同，将训练过的标记器保存在一个文件中以供以后重用是很方便的。

Please look at Storing Taggerssection in http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

请查看http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html 中的存储标记器部分

Answer 2

回答by alvas

Given the tutorial in the previous answer, here's a more object-oriented approach from spaghetti tagger: https://github.com/alvations/spaghetti-tagger

鉴于上一个答案中的教程，这里有一种来自 spaghetti tagger 的更面向对象的方法：https: //github.com/alvations/spaghetti-tagger

#-*- coding: utf8 -*-

from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
from cPickle import dump,load

def loadtagger(taggerfilename):
    infile = open(taggerfilename,'rb')
    tagger = load(infile); infile.close()
    return tagger

def traintag(corpusname, corpus):
    # Function to save tagger.
    def savetagger(tagfilename,tagger):
        outfile = open(tagfilename, 'wb')
        dump(tagger,outfile,-1); outfile.close()
        return
    # Training UnigramTagger.
    uni_tag = ut(corpus)
    savetagger(corpusname+'_unigram.tagger',uni_tag)
    # Training BigramTagger.
    bi_tag = bt(corpus)
    savetagger(corpusname+'_bigram.tagger',bi_tag)
    print "Tagger trained with",corpusname,"using" +\
                "UnigramTagger and BigramTagger."
    return

# Function to unchunk corpus.
def unchunk(corpus):
    nomwe_corpus = []
    for i in corpus:
        nomwe = " ".join([j[0].replace("_"," ") for j in i])
        nomwe_corpus.append(nomwe.split())
    return nomwe_corpus

class cesstag():
    def __init__(self,mwe=True):
        self.mwe = mwe
        # Train tagger if it's used for the first time.
        try:
            loadtagger('cess_unigram.tagger').tag(['estoy'])
            loadtagger('cess_bigram.tagger').tag(['estoy'])
        except IOError:
            print "*** First-time use of cess tagger ***"
            print "Training tagger ..."
            from nltk.corpus import cess_esp as cess
            cess_sents = cess.tagged_sents()
            traintag('cess',cess_sents)
            # Trains the tagger with no MWE.
            cess_nomwe = unchunk(cess.tagged_sents())
            tagged_cess_nomwe = batch_pos_tag(cess_nomwe)
            traintag('cess_nomwe',tagged_cess_nomwe)
            print
        # Load tagger.
        if self.mwe == True:
            self.uni = loadtagger('cess_unigram.tagger')
            self.bi = loadtagger('cess_bigram.tagger')
        elif self.mwe == False:
            self.uni = loadtagger('cess_nomwe_unigram.tagger')
            self.bi = loadtagger('cess_nomwe_bigram.tagger')

def pos_tag(tokens, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.tag(tokens)

def batch_pos_tag(sentences, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.batch_tag(sentences)

tagger = cesstag()
print tagger.uni.tag('Mi colega me ayuda a programar cosas .'.split())

Answer 3

回答by alemol

The following script gives you a quick approach to get a "bag of words" in Spanish sentences. Note that if you want to do it correctly you must tokenize the sentences before tag, so 'religiosas.' must be separated in two tokens 'religiosas','.'

以下脚本为您提供了一种快速获取西班牙语句子中的“词袋”的方法。请注意，如果你想正确地做到这一点，你必须在标记之前标记句子，所以 'religiosas'。必须用两个标记 'religiosas','.' 分隔。

#-*- coding: utf8 -*-

# about the tagger: http://nlp.stanford.edu/software/tagger.shtml 
# about the tagset: nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html

import nltk

from nltk.tag.stanford import POSTagger

spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar', encoding='utf8')

sentences = ['El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.','Las flores, hojas y frutos se usan para aliviar la tos y también se emplea como sedante.']

for sent in sentences:

    words = sent.split()
    tagged_words = spanish_postagger.tag(words)

    nouns = []

    for (word, tag) in tagged_words:

        print(word+' '+tag).encode('utf8')
        if isNoun(tag): nouns.append(word)

    print(nouns)

Gives:

给出：

El da0000
copal nc0s000
se p0000000
usa vmip000
principalmente rg
para sp000
sahumar vmn0000
en sp000
distintas di0000
ocasiones nc0p000
como cs
lo pp000000
son vsip000
las da0000
fiestas nc0p000
religiosas. np00000
[u'copal', u'ocasiones', u'fiestas', u'religiosas.']
Las da0000
flores, np00000
hojas nc0p000
y cc
frutos nc0p000
se p0000000
usan vmip000
para sp000
aliviar vmn0000
la da0000
tos nc0s000
y cc
también rg
se p0000000
emplea vmip000
como cs
sedante. nc0s000
[u'flores,', u'hojas', u'frutos', u'tos', u'sedante.']

Answer 4

回答by Koot6133

I ended up here searching for POS taggers for other languages then English. Another option for your problem is using the Spacy library. Which offers POS tagging for multiple languages such as Dutch, German, French, Portuguese, Spanish, Norwegian, Italian, Greek and Lithuanian.

我最终在这里搜索了其他语言的 POS 标记器，然后是英语。您的问题的另一个选择是使用 Spacy 库。它提供多种语言的 POS 标记，例如荷兰语、德语、法语、葡萄牙语、西班牙语、挪威语、意大利语、希腊语和立陶宛语。

From the Spacy Documentation:

来自 Spacy 文档：

import es_core_news_sm
nlp = es_core_news_sm.load()
doc = nlp("El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.")
print([(w.text, w.pos_) for w in doc])

leads to:

造成：

[('El', 'DET'), ('copal', 'NOUN'), ('se', 'PRON'), ('usa', 'VERB'), ('principalmente', 'ADV'), ('para', 'ADP'), ('sahumar', 'VERB'), ('en', 'ADP'), ('distintas', 'DET'), ('ocasiones', 'NOUN'), ('como', 'SCONJ'), ('lo', 'PRON'), ('son', 'AUX'), ('las', 'DET'), ('fiestas', 'NOUN'), ('religiosas', 'ADJ'), ('.', 'PUNCT')]

[('El', 'DET'), ('copal', 'NOUN'), ('se', 'PRON'), ('usa', 'VERB'), ('principalmente', 'ADV') , ('para', 'ADP'), ('sahumar', 'VERB'), ('en', 'ADP'), ('distintas', 'DET'), ('ocasiones', 'NOUN') , ('como', 'SCONJ'), ('lo', 'PRON'), ('son', 'AUX'), ('las', 'DET'), ('fiestas', 'NOUN') , ('religiosas', 'ADJ'), ('.', 'PUNCT')]

and to visualize in a notebook:

并在笔记本中可视化：

displacy.render(doc, style='dep', jupyter = True, options = {'distance': 120})

Python NLTK 使用语料库标记西班牙语单词

提问by dm03514

采纳答案by alvas

回答by alvas

回答by alemol

回答by Koot6133

相关推荐

最近更新

标签

Python NLTK 使用语料库标记西班牙语单词

提问by dm03514

采纳答案by alvas

回答by alvas

回答by alemol

回答by Koot6133

相关推荐

Python乌龟设置起始位置

Python：实例没有属性

Python 将 unix 时间戳字符串转换为可读日期

Python Pandas DataFrame 上的条件逻辑

相关推荐

最近更新

标签