Python 如何使用spaCy获取依赖树?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36610179/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:07:02  来源:igfitidea点击:

How to get the dependency tree with spaCy?

pythonspacy

提问by Nicolas Joseph

I have been trying to find how to get the dependency tree with spaCy but I can't find anything on how to get the tree, only on how to navigate the tree.

我一直试图找到如何使用 spaCy 获取依赖树,但我找不到任何关于如何获取树的信息,只能找到如何导航树

采纳答案by Nicolas Joseph

It turns out, the tree is available through the tokensin a document.

事实证明,该树可以通过文档中的令牌获得。

Would you want to find the root of the tree, you can just go though the document:

你想找到树的根,你可以通过文档:

def find_root(docu):
    for token in docu:
        if token.head is token:
            return token

To then navigate the tree, the tokens have API to get through the children

然后导航树,令牌有 API 来通过孩子

回答by Christos Baziotis

In case someone wants to easily view the dependency tree produced by spacy, one solution would be to convert it to an nltk.tree.Treeand use the nltk.tree.Tree.pretty_printmethod. Here is an example:

如果有人想轻松查看 spacy 生成的依赖树,一种解决方案是将其转换为 annltk.tree.Tree并使用该nltk.tree.Tree.pretty_print方法。下面是一个例子:

import spacy
from nltk import Tree


en_nlp = spacy.load('en')

doc = en_nlp("The quick brown fox jumps over the lazy dog.")

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    else:
        return node.orth_


[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]

Output:

输出:

                jumps                  
  ________________|____________         
 |    |     |     |    |      over     
 |    |     |     |    |       |        
 |    |     |     |    |      dog      
 |    |     |     |    |    ___|____    
The quick brown  fox   .  the      lazy


Edit:For changing the token representation you can do this:

编辑:要更改令牌表示,您可以执行以下操作:

def tok_format(tok):
    return "_".join([tok.orth_, tok.tag_])


def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
    else:
        return tok_format(node)

Which results in:

结果是:

                         jumps_VBZ                           
   __________________________|___________________             
  |       |        |         |      |         over_IN        
  |       |        |         |      |            |            
  |       |        |         |      |          dog_NN        
  |       |        |         |      |     _______|_______     
The_DT quick_JJ brown_JJ   fox_NN  ._. the_DT         lazy_JJ

回答by Mark Amery

The tree isn't an object in itself; you just navigate it via the relationships between tokens. That's why the docs talk about navigating the tree, but not 'getting' it.

树本身并不是一个对象;您只需通过令牌之间的关系导航即可。这就是为什么文档谈论导航树,而不是“获取”它的原因。

First, let's parse some text to get a Docobject:

首先,让我们解析一些文本来获取一个Doc对象:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp('First, I wrote some sentences. Then spaCy parsed them. Hooray!')

docis a Sequenceof Tokenobjects:

docSequenceToken对象:

>>> doc[0]
First
>>> doc[1]
,
>>> doc[2]
I
>>> doc[3]
wrote

But it doesn't have a single root token. We parsed a text made up of three sentences, so there are three distinct trees, each with their own root. If we want to start our parsing from the root of each sentence, it will help to get the sentences as distinct objects, first. Fortunately, docexposes these to us via the .sentsproperty:

但它没有一个根令牌。我们解析了一个由三个句子组成的文本,所以有三棵不同的树,每棵树都有自己的根。如果我们想从每个句子的词根开始解析,首先将句子作为不同的对象会有所帮助。幸运的是,doc通过.sents属性将这些暴露给我们:

>>> sentences = list(doc.sents)
>>> for sentence in sentences:
...     print(sentence)
... 
First, I wrote some sentences.
Then spaCy parsed them.
Hooray!

Each of these sentences is a Spanwith a .rootproperty pointing to its root token. Usually, the root token will be the main verb of the sentence (although this may not be true for unusual sentence structures, such as sentences without a verb):

这些句子中的每一个都是Span带有.root指向其根标记的属性。通常,词根标记将是句子的主要动词(尽管对于不寻常的句子结构可能不是这样,例如没有动词的句子):

>>> for sentence in sentences:
...     print(sentence.root)
... 
wrote
parsed
Hooray

With the root token found, we can navigate down the tree via the .childrenproperty of each token. For instance, let's find the subject and object of the verb in the first sentence. The .dep_property of each child token describes its relationship with its parent; for instance a dep_of 'nsubj'means that a token is the nominal subjectof its parent.

找到根令牌后,我们可以通过.children每个令牌的属性向下导航树。例如,让我们找出第一句话中动词的主语和宾语。.dep_每个子令牌的属性描述了它与父令牌的关系;例如 a dep_of'nsubj'表示令牌是其父代的名义主体

>>> root_token = sentences[0].root
>>> for child in root_token.children:
...     if child.dep_ == 'nsubj':
...         subj = child
...     if child.dep_ == 'dobj':
...         obj = child
... 
>>> subj
I
>>> obj
sentences

We can likewise keep going down the tree by viewing one of these token's children:

我们同样可以通过查看这些令牌的子代之一继续沿着树向下走:

>>> list(obj.children)
[some]

Thus with the properties above, you can navigate the entire tree. If you want to visualise some dependency trees for example sentences to help you understand the structure, I recommend playing with displaCy.

因此,使用上述属性,您可以导航整个树。如果你想可视化一些依赖树的例子来帮助你理解结构,我建议使用displaCy

回答by Rohan

You can use the library below to view your dependency tree, found it extremely helpful!

你可以使用下面的库来查看你的依赖树,发现它非常有帮助!

from spacy import displacy

nlp = spacy.load('en')
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')

To generate a svg file:

生成 svg 文件:

from pathlib import Path
output_path = Path("yourpath/.svg")
svg = displacy.render(doc, style='dep')
with output_path.open("w", encoding="utf-8") as fh:
    fh.write(svg)

回答by Christopher Reiss

I don't know if this is a new API call or what, but there's a .print_tree() method on the Document class that makes quick work of this.

我不知道这是一个新的 API 调用还是什么,但是 Document 类上有一个 .print_tree() 方法可以快速处理这个问题。

https://spacy.io/api/doc#print_tree

https://spacy.io/api/doc#print_tree

It dumps the dependency tree to JSON. It deals with multiple sentence roots and all that :

它将依赖树转储到 JSON。它处理多个句子词根以及所有这些:

    import spacy    
    nlp = spacy.load('en')
    doc1 = nlp(u'This is the way the world ends.  So you say.')  
    print(doc1.print_tree(light=True))

The name print_tree is a bit of a misnomer, the method itself doesn't print anything, rather it returns a list of dicts, one for each sentence.

名称print_tree 有点用词不当,该方法本身不打印任何内容,而是返回一个字典列表,每个句子一个。

回答by Krzysiek

I also needed to do it so below full code:

我还需要在完整代码下面这样做:

import sys
def showTree(sent):
    def __showTree(token):
        sys.stdout.write("{")
        [__showTree(t) for t in token.lefts]
        sys.stdout.write("%s->%s(%s)" % (token,token.dep_,token.tag_))
        [__showTree(t) for t in token.rights]
        sys.stdout.write("}")
    return __showTree(sent.root)

And if you want spacing for the terminal:

如果你想要终端的间距:

def showTree(sent):
    def __showTree(token, level):
        tab = "\t" * level
        sys.stdout.write("\n%s{" % (tab))
        [__showTree(t, level+1) for t in token.lefts]
        sys.stdout.write("\n%s\t%s [%s] (%s)" % (tab,token,token.dep_,token.tag_))
        [__showTree(t, level+1) for t in token.rights]
        sys.stdout.write("\n%s}" % (tab))
    return __showTree(sent.root, 1)

回答by lilienfa

I do not have enough knowledge about the parsing yet. However, outcome of my literature study has resulted in knowing that spaCy has a shift-reduce dependency parsing algorithm. This parses the question/sentence, resulting in a parsing tree. To visualize this, you can use the DisplaCy, combination of CSS and Javascript, works with Python and Cython. Furthermore, you can parse using the SpaCy library, and import the Natural Language Toolkit (NLTK). Hope this helps

我还没有足够的关于解析的知识。然而,我的文献研究的结果是知道 spaCy 有一个减少移位的依赖解析算法。这会解析问题/句子,从而生成解析树。为了可视化这一点,您可以使用 DisplaCy,CSS 和 Javascript 的组合,适用于 Python 和 Cython。此外,您可以使用 SpaCy 库进行解析,并导入自然语言工具包 (NLTK)。希望这可以帮助