Python 如何使用spaCy获取依赖树?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36610179/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get the dependency tree with spaCy?
提问by Nicolas Joseph
I have been trying to find how to get the dependency tree with spaCy but I can't find anything on how to get the tree, only on how to navigate the tree.
我一直试图找到如何使用 spaCy 获取依赖树,但我找不到任何关于如何获取树的信息,只能找到如何导航树。
采纳答案by Nicolas Joseph
It turns out, the tree is available through the tokensin a document.
Would you want to find the root of the tree, you can just go though the document:
你想找到树的根,你可以通过文档:
def find_root(docu):
for token in docu:
if token.head is token:
return token
To then navigate the tree, the tokens have API to get through the children
然后导航树,令牌有 API 来通过孩子
回答by Christos Baziotis
In case someone wants to easily view the dependency tree produced by spacy, one solution would be to convert it to an nltk.tree.Tree
and use the nltk.tree.Tree.pretty_print
method. Here is an example:
如果有人想轻松查看 spacy 生成的依赖树,一种解决方案是将其转换为 annltk.tree.Tree
并使用该nltk.tree.Tree.pretty_print
方法。下面是一个例子:
import spacy
from nltk import Tree
en_nlp = spacy.load('en')
doc = en_nlp("The quick brown fox jumps over the lazy dog.")
def to_nltk_tree(node):
if node.n_lefts + node.n_rights > 0:
return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
else:
return node.orth_
[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]
Output:
输出:
jumps
________________|____________
| | | | | over
| | | | | |
| | | | | dog
| | | | | ___|____
The quick brown fox . the lazy
Edit:For changing the token representation you can do this:
编辑:要更改令牌表示,您可以执行以下操作:
def tok_format(tok):
return "_".join([tok.orth_, tok.tag_])
def to_nltk_tree(node):
if node.n_lefts + node.n_rights > 0:
return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
else:
return tok_format(node)
Which results in:
结果是:
jumps_VBZ
__________________________|___________________
| | | | | over_IN
| | | | | |
| | | | | dog_NN
| | | | | _______|_______
The_DT quick_JJ brown_JJ fox_NN ._. the_DT lazy_JJ
回答by Mark Amery
The tree isn't an object in itself; you just navigate it via the relationships between tokens. That's why the docs talk about navigating the tree, but not 'getting' it.
树本身并不是一个对象;您只需通过令牌之间的关系导航即可。这就是为什么文档谈论导航树,而不是“获取”它的原因。
First, let's parse some text to get a Doc
object:
首先,让我们解析一些文本来获取一个Doc
对象:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp('First, I wrote some sentences. Then spaCy parsed them. Hooray!')
doc
is a Sequence
of Token
objects:
>>> doc[0]
First
>>> doc[1]
,
>>> doc[2]
I
>>> doc[3]
wrote
But it doesn't have a single root token. We parsed a text made up of three sentences, so there are three distinct trees, each with their own root. If we want to start our parsing from the root of each sentence, it will help to get the sentences as distinct objects, first. Fortunately, doc
exposes these to us via the .sents
property:
但它没有一个根令牌。我们解析了一个由三个句子组成的文本,所以有三棵不同的树,每棵树都有自己的根。如果我们想从每个句子的词根开始解析,首先将句子作为不同的对象会有所帮助。幸运的是,doc
通过.sents
属性将这些暴露给我们:
>>> sentences = list(doc.sents)
>>> for sentence in sentences:
... print(sentence)
...
First, I wrote some sentences.
Then spaCy parsed them.
Hooray!
Each of these sentences is a Span
with a .root
property pointing to its root token. Usually, the root token will be the main verb of the sentence (although this may not be true for unusual sentence structures, such as sentences without a verb):
这些句子中的每一个都是Span
带有.root
指向其根标记的属性。通常,词根标记将是句子的主要动词(尽管对于不寻常的句子结构可能不是这样,例如没有动词的句子):
>>> for sentence in sentences:
... print(sentence.root)
...
wrote
parsed
Hooray
With the root token found, we can navigate down the tree via the .children
property of each token. For instance, let's find the subject and object of the verb in the first sentence. The .dep_
property of each child token describes its relationship with its parent; for instance a dep_
of 'nsubj'
means that a token is the nominal subjectof its parent.
找到根令牌后,我们可以通过.children
每个令牌的属性向下导航树。例如,让我们找出第一句话中动词的主语和宾语。.dep_
每个子令牌的属性描述了它与父令牌的关系;例如 a dep_
of'nsubj'
表示令牌是其父代的名义主体。
>>> root_token = sentences[0].root
>>> for child in root_token.children:
... if child.dep_ == 'nsubj':
... subj = child
... if child.dep_ == 'dobj':
... obj = child
...
>>> subj
I
>>> obj
sentences
We can likewise keep going down the tree by viewing one of these token's children:
我们同样可以通过查看这些令牌的子代之一继续沿着树向下走:
>>> list(obj.children)
[some]
Thus with the properties above, you can navigate the entire tree. If you want to visualise some dependency trees for example sentences to help you understand the structure, I recommend playing with displaCy.
因此,使用上述属性,您可以导航整个树。如果你想可视化一些依赖树的例子来帮助你理解结构,我建议使用displaCy。
回答by Rohan
You can use the library below to view your dependency tree, found it extremely helpful!
你可以使用下面的库来查看你的依赖树,发现它非常有帮助!
from spacy import displacy
nlp = spacy.load('en')
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')
To generate a svg file:
生成 svg 文件:
from pathlib import Path
output_path = Path("yourpath/.svg")
svg = displacy.render(doc, style='dep')
with output_path.open("w", encoding="utf-8") as fh:
fh.write(svg)
回答by Christopher Reiss
I don't know if this is a new API call or what, but there's a .print_tree() method on the Document class that makes quick work of this.
我不知道这是一个新的 API 调用还是什么,但是 Document 类上有一个 .print_tree() 方法可以快速处理这个问题。
https://spacy.io/api/doc#print_tree
https://spacy.io/api/doc#print_tree
It dumps the dependency tree to JSON. It deals with multiple sentence roots and all that :
它将依赖树转储到 JSON。它处理多个句子词根以及所有这些:
import spacy
nlp = spacy.load('en')
doc1 = nlp(u'This is the way the world ends. So you say.')
print(doc1.print_tree(light=True))
The name print_tree is a bit of a misnomer, the method itself doesn't print anything, rather it returns a list of dicts, one for each sentence.
名称print_tree 有点用词不当,该方法本身不打印任何内容,而是返回一个字典列表,每个句子一个。
回答by Krzysiek
I also needed to do it so below full code:
我还需要在完整代码下面这样做:
import sys
def showTree(sent):
def __showTree(token):
sys.stdout.write("{")
[__showTree(t) for t in token.lefts]
sys.stdout.write("%s->%s(%s)" % (token,token.dep_,token.tag_))
[__showTree(t) for t in token.rights]
sys.stdout.write("}")
return __showTree(sent.root)
And if you want spacing for the terminal:
如果你想要终端的间距:
def showTree(sent):
def __showTree(token, level):
tab = "\t" * level
sys.stdout.write("\n%s{" % (tab))
[__showTree(t, level+1) for t in token.lefts]
sys.stdout.write("\n%s\t%s [%s] (%s)" % (tab,token,token.dep_,token.tag_))
[__showTree(t, level+1) for t in token.rights]
sys.stdout.write("\n%s}" % (tab))
return __showTree(sent.root, 1)
回答by lilienfa
I do not have enough knowledge about the parsing yet. However, outcome of my literature study has resulted in knowing that spaCy has a shift-reduce dependency parsing algorithm. This parses the question/sentence, resulting in a parsing tree. To visualize this, you can use the DisplaCy, combination of CSS and Javascript, works with Python and Cython. Furthermore, you can parse using the SpaCy library, and import the Natural Language Toolkit (NLTK). Hope this helps
我还没有足够的关于解析的知识。然而,我的文献研究的结果是知道 spaCy 有一个减少移位的依赖解析算法。这会解析问题/句子,从而生成解析树。为了可视化这一点,您可以使用 DisplaCy,CSS 和 Javascript 的组合,适用于 Python 和 Cython。此外,您可以使用 SpaCy 库进行解析,并导入自然语言工具包 (NLTK)。希望这可以帮助