java 斯坦福核心 NLP - 理解共指解析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6572207/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Stanford Core NLP - understanding coreference resolution
提问by pnsilva
I'm having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools. As an example, below is a sentence and the corresponding CorefChainAnnotation:
我在理解最新版本的斯坦福 NLP 工具中对 coref 解析器所做的更改时遇到了一些麻烦。例如,下面是一个句子和对应的 CorefChainAnnotation:
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}
I am not sure I understand the meaning of these numbers. Looking at the source doesn't really help either.
我不确定我是否理解这些数字的含义。查看源代码也无济于事。
Thank you
谢谢
采纳答案by Skarab
The first number is a cluster id (representing tokens, which stand for the same entity), see source code of SieveCoreferenceSystem#coref(Document)
. The pair numbers are outout of CorefChain#toString():
第一个数字是集群id(代表tokens,代表同一个实体),见源码SieveCoreferenceSystem#coref(Document)
。对数在 CorefChain#toString() 之外:
public String toString(){
return position.toString();
}
where position is a set of postion pairs of entity mentioning (to get them use CorefChain.getCorefMentions()
). Here is an example of a complete code (in groovy), which shows how to get from positions to tokens:
其中 position 是一组实体提及的位置对(让它们使用CorefChain.getCorefMentions()
)。这是一个完整代码的示例(在groovy 中),它显示了如何从位置获取令牌:
class Example {
public static void main(String[] args) {
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("dcoref.score", true);
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");
pipeline.annotate(document);
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
println aText
for(Map.Entry<Integer, CorefChain> entry : graph) {
CorefChain c = entry.getValue();
println "ClusterId: " + entry.getKey();
CorefMention cm = c.getRepresentativeMention();
println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);
List<CorefMention> cms = c.getCorefMentions();
println "Mentions: ";
cms.each { it ->
print aText.subSequence(it.startIndex, it.endIndex) + "|";
}
}
}
}
Output (I do not understand where 's' comes from):
输出(我不明白“s”来自哪里):
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention: basic unit
Mentions: basic unit |
ClusterId: 8
Representative Mention: unit
Mentions: unit |
ClusterId: 10
Representative Mention: it
Mentions: it |
回答by user1084563
I've been working with the coreference dependency graph and I started by using the other answer to this question. After a while though I realized that this algorithm above is not exactly correct. The output it produced is not even close to the modified version I have.
我一直在使用共指依赖关系图,我开始使用这个问题的另一个答案。过了一会儿,我意识到上面的这个算法并不完全正确。它产生的输出甚至不接近我的修改版本。
For anyone else who uses this article, here is the algorithm I ended up with which also filters out self references because every representativeMention also mentions itself and a lot of mentions only reference themselves.
对于使用本文的任何其他人,这是我最终使用的算法,它也过滤掉了自我引用,因为每个代表提及也提到了自己,而很多提及只提到了他们自己。
Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class);
for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) {
CorefChain c = entry.getValue();
//this is because it prints out a lot of self references which aren't that useful
if(c.getCorefMentions().size() <= 1)
continue;
CorefMention cm = c.getRepresentativeMention();
String clust = "";
List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class);
for(int i = cm.startIndex-1; i < cm.endIndex-1; i++)
clust += tks.get(i).get(TextAnnotation.class) + " ";
clust = clust.trim();
System.out.println("representative mention: \"" + clust + "\" is mentioned by:");
for(CorefMention m : c.getCorefMentions()){
String clust2 = "";
tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class);
for(int i = m.startIndex-1; i < m.endIndex-1; i++)
clust2 += tks.get(i).get(TextAnnotation.class) + " ";
clust2 = clust2.trim();
//don't need the self mention
if(clust.equals(clust2))
continue;
System.out.println("\t" + clust2);
}
}
And the final output for your example sentence is the following:
您的例句的最终输出如下:
representative mention: "a basic unit of matter" is mentioned by:
The atom
it
Usually "the atom" ends up being the representative mention but in the case it doesn't surprisingly. Another example with a slightly more accurate output is for the following sentence:
通常“原子”最终成为代表性提及,但在这种情况下并不奇怪。另一个输出更准确的示例是以下句子:
The Revolutionary War occurred during the 1700s and it was the first war in the United States.
独立War发生在 1700 年代,是美国的第一场War。
produces the following output:
产生以下输出:
representative mention: "The Revolutionary War" is mentioned by:
it
the first war in the United States
回答by Purvanshi
These are the recent results from the annotator.
这些是注释者最近的结果。
- [1, 1] 1 The atom
- [1, 2] 1 a basic unit of matter
- [1, 3] 1 it
- [1, 6] 6 negatively charged electrons
- [1, 5] 5 a cloud of negatively charged electrons
- [1, 1] 1 原子
- [1, 2] 1 物质的基本单位
- [1, 3] 1 它
- [1, 6] 6 个带负电的电子
- [1, 5] 5 带负电的电子云
The markings are as follows :
标记如下:
[Sentence number,'id'] Cluster_no Text_Associated
The text belonging to the same cluster refers to the same context.
属于同一簇的文本指的是同一上下文。