Python spaCy 的词性和依赖标签是什么意思?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40288323/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:21:50  来源:igfitidea点击:

What do spaCy's part-of-speech and dependency tags mean?

pythonnlpspacy

提问by Mark Amery

spaCy tags up each of the Tokens in a Documentwith a part of speech (in two different formats, one stored in the posand pos_properties of the Tokenand the other stored in the tagand tag_properties) and a syntactic dependency to its .headtoken (stored in the depand dep_properties).

spaCy 用词性标记 a 中的每个Tokens Document(有两种不同的格式,一种存储在 the 的posandpos_属性中Token,另一个存储在tagandtag_属性中)和对其.head标记的句法依赖性(存储在depanddep_属性中) )。

Some of these tags are self-explanatory, even to somebody like me without a linguistics background:

其中一些标签是不言自明的,即使是像我这样没有语言学背景的人:

>>> import spacy
>>> en_nlp = spacy.load('en')
>>> document = en_nlp("I shot a man in Reno just to watch him die.")
>>> document[1]
shot
>>> document[1].pos_
'VERB'

Others... are not:

其他……不是:

>>> document[1].tag_
'VBD'
>>> document[2].pos_
'DET'
>>> document[3].dep_
'dobj'

Worse, the official docsdon't contain even a list of the possible tags for most of these properties, nor the meanings of any of them. They sometimes mention what tokenization standard they use, but these claims aren't currently entirely accurate and on top of that the standards are tricky to track down.

更糟糕的是,官方文档甚至不包含大多数这些属性的可能标签列表,也不包含其中任何一个的含义。他们有时会提到他们使用的标记化标准,但这些声明目前并不完全准确,而且这些标准很难追踪。

What are the possible values of the tag_, pos_, and dep_properties, and what do they mean?

什么是可能的值tag_pos_dep_性质,以及它们意味着什么?

回答by Mark Amery

tl;dr answer

tl;博士回答

Just expand the lists at:

只需在以下位置展开列表:

Longer answer

更长的答案

The docs have greatly improved since I first asked this question, and spaCy now documents this much better.

自从我第一次问这个问题以来,文档已经有了很大的改进,而 spaCy 现在更好地记录了这一点。

Part-of-speech tags

词性标签

The posand tagattributes are tabulated at https://spacy.io/api/annotation#pos-tagging, and the origin of those lists of values is described. At the time of this (January 2020) edit, the docs say of the posattribute that:

postag属性在表列https://spacy.io/api/annotation#pos-tagging,和值的这些列表中的原点进行说明。在本次(2020 年 1 月)编辑时,文档中pos提到的属性是:

spaCy maps all language-specific part-of-speech tags to a small, fixed set of word type tags following the Universal Dependencies scheme. The universal tags don't code for any morphological features and only cover the word type. They're available as the Token.posand Token.pos_attributes.

spaCy 将所有特定于语言的词性标签映射到一个小的、固定的单词类型标签集,遵循通用依赖关系方案。通用标签不对任何形态特征进行编码,仅涵盖单词类型。它们可用作Token.posToken.pos_属性。

As for the tagattribute, the docs say:

至于tag属性,文档说:

The English part-of-speech tagger uses the OntoNotes 5version of the Penn Treebank tag set. We also map the tags to the simpler Universal Dependencies v2 POS tag set.

英语词性标注器使用Penn Treebank标注集的OntoNotes 5版本。我们还将标签映射到更简单的 Universal Dependencies v2 POS 标签集。

and

The German part-of-speech tagger uses the TIGER Treebankannotation scheme. We also map the tags to the simpler Universal Dependencies v2 POS tag set.

德语词性标注器使用TIGER Treebank注释方案。我们还将标签映射到更简单的 Universal Dependencies v2 POS 标签集。

You thus have a choice between using a coarse-grained tag set that is consistent across languages (.pos), or a fine-grained tag set (.tag) that is specific to a particular treebank, and hence a particular language.

因此,您可以选择使用跨语言一致.pos的粗粒度标签集 ( .tag) 或特定于特定树库并因此特定语言的细粒度标签集 ( )。

.pos_tag list

.pos_标签列表

The docs list the following coarse-grained tags used for the posand pos_attributes:

文档列出了用于pospos_属性的以下粗粒度标签:

  • ADJ: adjective, e.g. big, old, green, incomprehensible, first
  • ADP: adposition, e.g. in, to, during
  • ADV: adverb, e.g. very, tomorrow, down, where, there
  • AUX: auxiliary, e.g. is, has (done), will (do), should (do)
  • CONJ: conjunction, e.g. and, or, but
  • CCONJ: coordinating conjunction, e.g. and, or, but
  • DET: determiner, e.g. a, an, the
  • INTJ: interjection, e.g. psst, ouch, bravo, hello
  • NOUN: noun, e.g. girl, cat, tree, air, beauty
  • NUM: numeral, e.g. 1, 2017, one, seventy-seven, IV, MMXIV
  • PART: particle, e.g. 's, not,
  • PRON: pronoun, e.g I, you, he, she, myself, themselves, somebody
  • PROPN: proper noun, e.g. Mary, John, London, NATO, HBO
  • PUNCT: punctuation, e.g. ., (, ), ?
  • SCONJ: subordinating conjunction, e.g. if, while, that
  • SYM: symbol, e.g. $, %, §, ?, +, ?, ×, ÷, =, :),
  • VERB: verb, e.g. run, runs, running, eat, ate, eating
  • X: other, e.g. sfpksdpsxmsa
  • SPACE: space, e.g.
  • ADJ:形容词,例如大的、老的、绿色的、难以理解的、第一
  • ADP:附着,例如在,到,在
  • ADV:副词,例如非常,明天,下来,哪里,那里
  • AUX:辅助,例如是,已经(完成),将(做),应该(做)
  • CONJ: 连词,例如 and, or, but
  • CCONJ: 并列连词,例如 and, or, but
  • DET: 限定词,例如 a, an, the
  • INTJ: 感叹词,例如 psst, 哎哟, bravo, 你好
  • NOUN:名词,例如女孩、猫、树、空气、美
  • NUM: 数字,例如 1, 2017, 一, 七十七, IV, MMXIV
  • PART:粒子,例如,不是,
  • PRON:代词,例如我、你、他、她、我自己、他们自己、某人
  • PROPN:专有名词,例如 Mary、John、London、NATO、HBO
  • PUNCT: 标点符号,例如 ., (, ), ?
  • SCONJ: 从属连词,例如 if、while、that
  • SYM: 符号,例如 $, %, §, ?, +, ?, ×, ÷, =, :),
  • VERB: 动词,例如 run、runs、running、eat、ate、eat
  • X: 其他,例如 sfpksdpsxmsa
  • SPACE: 空间,例如

Note that the docs are lying slightly when they say that this list follows the Universal Dependencies Scheme; there are two tags listed above that aren't part of that scheme.

请注意,当文档说此列表遵循通用依赖项计划时,他们在撒谎;上面列出的两个标签不属于该方案。

One of those is CONJ, which used to exist in the Universal POS Tags scheme but has been split into CCONJand SCONJsince spaCy was first written. Based on the mappings of tag->pos in the docs, it would seem that spaCy's current models don't actually use CONJ, but it still exists in spaCy's code and docs for some reason - perhaps backwards compatibility with old models.

其中之一是CONJ,它使用在通用POS标签方案存在,但已拆分为CCONJSCONJ自spaCy首次写入。根据文档中 tag->pos 的映射,spaCy 的当前模型似乎并未实际使用CONJ,但出于某种原因,它仍然存在于 spaCy 的代码和文档中 - 可能是向后兼容旧模型。

The second is SPACE, which isn't part of the Universal POS Tags scheme (and never has been, as far as I know) and is used by spaCy for any spacing besides single normal ASCII spaces (which don't get their own token):

第二个是SPACE,它不是通用 POS 标签方案的一部分(据我所知从来没有)并且被 spaCy 用于除单个普通 ASCII 空格(没有自己的令牌)之外的任何间距:

>>> document = en_nlp("This\nsentence\thas      some weird spaces in\n\n\n\n\t\t   it.")
>>> for token in document:
...   print('%r (%s)' % (str(token), token.pos_))
... 
'This' (DET)
'\n' (SPACE)
'sentence' (NOUN)
'\t' (SPACE)
'has' (VERB)
'     ' (SPACE)
'some' (DET)
'weird' (ADJ)
'spaces' (NOUN)
'in' (ADP)
'\n\n\n\n\t\t   ' (SPACE)
'it' (PRON)
'.' (PUNCT)

I'll omit the full list of .tag_tags (the finer-grained ones) from this answer, since they're numerous, well-documented now, different for English and German, and probably more likely to change between releases. Instead, look at the list in the docs (e.g. https://spacy.io/api/annotation#pos-enfor English) which lists every possible tag, the .pos_value it maps to, and a description of what it means.

我将在.tag_这个答案中省略完整的标签列表(更细粒度的),因为它们数量众多,现在有据可查,英语和德语不同,并且可能更有可能在不同版本之间发生变化。相反,请查看文档中的列表(例如https://spacy.io/api/annotation#pos-en用于英语),其中列出了每个可能的标签、.pos_它映射到的值以及其含义的描述。

Dependency tokens

依赖令牌

There are now threedifferent schemes that spaCy uses for dependency tagging: one for English, one for German, and one for everything else. Once again, the list of values is huge and I won't reproduce it in full here. Every dependency has a brief definition next to it, but unfortunately, many of them - like "appositional modifier" or "clausal complement" - are terms of art that are rather alien to an everyday programmer like me. If you're not a linguist, you'll simply have to research the meanings of those terms of art to make sense of them.

现在spaCy 使用三种不同的方案进行依赖标记:一种用于 English一种用于 German另一种用于其他所有内容。再一次,值列表是巨大的,我不会在这里完整地复制它。每个依赖项旁边都有一个简短的定义,但不幸的是,其中许多——比如“并位修饰符”或“从句补语”——是艺术术语,对像我这样的日常程序员来说相当陌生。如果您不是语言学家,则只需研究这些艺术术语的含义即可理解它们。

I can at least provide a starting point for that research for people working with English text, though. If you'd like to see some examplesof the CLEAR dependencies (used by the English model) in real sentences, check out the 2012 work of Jinho D. Choi: either his Optimization of Natural Language Processing Components for Robustness and Scalabilityor his Guidelines for the CLEAR Style Constituent to Dependency Conversion(which seems to just be a subsection of the former paper). Both list all the CLEAR dependency labels that existed in 2012 along with definitions and example sentences. (Unfortunately, the set of CLEAR dependency labels has changed a little since 2012, so some of the modern labels are not listed or exemplified in Choi's work - but it remains a useful resource despite being slightly outdated.)

不过,我至少可以为使用英语文本的人提供该研究的起点。如果您想在实际句子中查看 CLEAR 依赖项(由英语模型使用)的一些示例,请查看 Jinho D. Choi 2012 年的工作:他的优化自然语言处理组件的鲁棒性和可扩展性或他的指南对于 CLEAR 样式组成部分到依赖项转换(这似乎只是前一篇论文的一个小节)。两者都列出了 2012 年存在的所有 CLEAR 依赖项标签以及定义和例句。(不幸的是,自 2012 年以来,CLEAR 依赖标签集发生了一些变化,因此 Choi 的工作中没有列出或举例说明一些现代标签——但它仍然是一个有用的资源,尽管有些过时。)

回答by Nuhil Mehdy

Just a quick tip about getting the detail meaning of the short forms. You can use explainmethod like following:

只是关于获取简短表格的详细含义的快速提示。您可以使用explain如下方法:

spacy.explain('pobj')

which will give you output like:

这将为您提供如下输出:

'object of preposition'

回答by Silveri

The official documentation now provides much more details for all those annotations at https://spacy.io/api/annotation(and the list of other attributes for tokens can be found at https://spacy.io/api/token).

官方文档现在在https://spacy.io/api/annotation 上提供了所有这些注释的更多详细信息(令牌的其他属性列表可以在https://spacy.io/api/token上找到)。

As the documentation shows, their parts-of-speech (POS) and dependency tags have both Universal and specific variations for different languages and the explain()function is a very useful shortcut to get a better description of a tag's meaning without the documentation, e.g.

正如文档所示,它们的词性 (POS) 和相关性标签对于不同的语言都有通用和特定的变体,并且该explain()功能是一个非常有用的快捷方式,可以在没有文档的情况下更好地描述标签的含义,例如

spacy.explain("VBD")

which gives "verb, past tense".

这给出了“动词,过去时”。

回答by rebeccabilbro

At present, dependency parsing and tagging in SpaCy appears to be implemented only at the word level, and not at the phrase (other than noun phrase) or clause level. This means SpaCy can be used to identify things like nouns (NN, NNS), adjectives (JJ, JJR, JJS), and verbs (VB, VBD, VBG, etc.), but not adjective phrases (ADJP), adverbial phrases (ADVP), or questions (SBARQ, SQ).

目前,SpaCy 中的依存解析和标记似乎仅在单词级别实现,而不是在短语(名词短语除外)或从句级别实现。这意味着 SpaCy 可用于识别名词(NN、NNS)、形容词(JJ、JJR、JJS)和动词(VB、VBD、VBG 等)等内容,但不能用于识别形容词短语 (ADJP)、副词短语 ( ADVP)或问题(SBARQ、SQ)。

For illustration, when you use SpaCy to parse the sentence "Which way is the bus going?", we get the following tree.

例如,当您使用 SpaCy 解析句子“巴士开往哪条路?”时,我们得到以下树。

By contrast, if you use the Stanford parser you get a much more deeply structured syntax tree.

相比之下,如果你使用斯坦福解析器,你会得到一个结构更深的语法树。