python POS 标记德语

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1639855/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 22:46:13  来源:igfitidea点击:

POS tagging in German

pythonnlpnltk

提问by Johannes Meier

I am using NLTK to extract nouns from a text-string starting with the following command:

我正在使用 NLTK 从以下命令开始的文本字符串中提取名词:

tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string)))

It works fine in English. Is there an easy way to make it work for German as well?

它在英语中运行良好。有没有简单的方法让它也适用于德语?

(I have no experience with natural language programming, but I managed to use the python nltk library which is great so far.)

(我没有自然语言编程的经验,但我设法使用了迄今为止很棒的 python nltk 库。)

采纳答案by Mike Atlas

Natural language software does its magic by leveraging corpora and the statistics they provide. You'll need to tell nltk about some German corpus to help it tokenize German correctly. I believe the EUROPARLcorpus might help get you going.

自然语言软件通过利用语料库和它们提供的统计数据来发挥它的魔力。您需要告诉 nltk 一些德语语料库,以帮助它正确标记德语。我相信EUROPARL语料库可能会帮助您前进。

See nltk.corpus.europarl_rawand this answerfor example configuration.

有关示例配置,请参见nltk.corpus.europarl_raw此答案

Also, consider tagging this question with "nlp".

另外,考虑用“nlp”标记这个问题。

回答by Suzana

The Pattern libraryincludes a function for parsing German sentences and the result includes the part-of-speech tags. The following is copied from their documentation:

模式库包括用于解析德国句子的功能和结果中包含部分的语音标签。以下是从他们的文档中复制的:

from pattern.de import parse, split
s = parse('Die Katze liegt auf der Matte.')
s = split(s)
print s.sentences[0]

>>>   Sentence('Die/DT/B-NP/O Katze/NN/I-NP/O liegt/VB/B-VP/O'
     'auf/IN/B-PP/B-PNP der/DT/B-NP/I-PNP Matte/NN/I-NP/I-PNP ././O/O')

If you prefer the SSTS tag set you can set the optional parameter tagset="STTS".

如果您更喜欢 SSTS 标签集,您可以设置可选参数tagset="STTS"

Update: Another option is spacy, there is a quick example in this blog article:

更新:另一个选项是spacy,这篇博客文章中有一个简单的例子

import spacy

nlp = spacy.load('de')
doc = nlp(u'Ich bin ein Berliner.')

# show universal pos tags
print(' '.join('{word}/{tag}'.format(word=t.orth_, tag=t.pos_) for t in doc))
# output: Ich/PRON bin/AUX ein/DET Berliner/NOUN ./PUNCT

回答by alvas

Possibly you can use the Stanford POS tagger. Below is a recipe I wrote. There are python recipes for German NLP that I've compiled and you can access them on http://htmlpreview.github.io/?https://github.com/alvations/DLTK/blob/master/docs/index.html

也许您可以使用斯坦福 POS 标记器。下面是我写的一个食谱。我已经编译了德语 NLP 的 Python 食谱,您可以在http://htmlpreview.github.io/?https://github.com/alvations/DLTK/blob/master/docs/index.html上访问它们

#-*- coding: utf8 -*-

import os, glob, codecs

def installStanfordTag():
    if not os.path.exists('stanford-postagger-full-2013-06-20'):
        os.system('wget http://nlp.stanford.edu/software/stanford-postagger-full-2013-06-20.zip')
        os.system('unzip stanford-postagger-full-2013-06-20.zip')
    return

def tag(infile):
    cmd = "./stanford-postagger.sh "+models[m]+" "+infile
    tagout = os.popen(cmd).readlines()
    return [i.strip() for i in tagout]

def taglinebyline(sents):
    tagged = []
    for ss in sents:
        os.popen("echo '''"+ss+"''' > stanfordtemp.txt")
        tagged.append(tag('stanfordtemp.txt')[0])
    return tagged

installStanfordTag()
stagdir = './stanford-postagger-full-2013-06-20/'
models = {'fast':'models/german-fast.tagger',
          'dewac':'models/german-dewac.tagger',
          'hgc':'models/german-hgc.tagger'}
os.chdir(stagdir)
print os.getcwd()


m = 'fast' # It's best to use the fast german tagger if your data is small.

sentences = ['Ich bin schwanger .','Ich bin wieder schwanger .','Ich verstehe nur Bahnhof .']

tagged_sents = taglinebyline(sentences) # Call the stanford tagger

for sent in tagged_sents:
    print sent

回答by mjv

Part-of-Speech (POS) tagging is very specific to a particular [natural] language. NLTK includes many different taggers, which use distinct techniques to infer the tag of a given token in a given token. Most (but not all) of these taggers use a statistical model of sorts as the main or sole device to "do the trick". Such taggers require some "training data" upon which to build this statistical representation of the language, and the training data comes in the form of corpora.

词性 (POS) 标记非常特定于特定的 [自然] 语言。NLTK 包括许多不同的标记器,它们使用不同的技术来推断给定标记中给定标记的标记。大多数(但不是全部)这些标记器使用某种统计模型作为主要或唯一的设备来“完成任务”。此类标注器需要一些“训练数据”来构建语言的这种统计表示,并且训练数据以语料库的形式出现。

The NTLK "distribution" itself includes many of these corpora, as well a set of "corpora readers" which provide an API to read different types of corpora. I don't know the state of affairs in NTLK proper, and if this includes any german corpus. You can however locate free some free corpora which you'll then need to convert to a format that satisfies the proper NTLK corpora reader, and then you can use this to train a POS tagger for the German language.

NTLK“分发版”本身包括许多这样的语料库,以及一组“语料库阅读器”,它们提供了一个 API 来读取不同类型的语料库。我不知道 NTLK 的实际情况,如果这包括任何德国语料库。但是,您可以找到一些免费的免费语料库,然后您需要将其转换为满足适当 NTLK 语料库阅读器的格式,然后您可以使用它来训练德语的 POS 标记器。

You can even create your own corpus, but that is a hell of a painstaking job; if you work in a univeristy, you gotta find ways of bribing and otherwise coercing students to do that for you ;-)

你甚至可以创建自己的语料库,但这是一项非常艰苦的工作;如果你在大学工作,你必须想办法贿赂和以其他方式胁迫学生为你做这件事;-)

回答by Philipp

I have written a blog-post about how to convert the German annotated TIGER Corpus in order to use it with the NLTK. Have a look at it here.

我写了一篇关于如何转换德语注释的 TIGER 语料库以便与 NLTK 一起使用的博客文章。看看这里。