Python 在 NLTK 中使用斯坦福 NER 标记器提取人员和组织列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30664677/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract list of Persons and Organizations using Stanford NER Tagger in NLTK
提问by user1680859
I am trying to extract list of persons and organizations using Stanford Named Entity Recognizer (NER) in Python NLTK. When I run:
我正在尝试使用 Python NLTK 中的斯坦福命名实体识别器 (NER) 提取人员和组织列表。当我运行时:
from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
print(r)
the output is:
输出是:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
what I want is to extract from this list all persons and organizations in this form:
我想要的是从这个列表中以这种形式提取所有的人和组织:
Rami Eid
Sony Brook University
I tried to loop through the list of tuples:
我试图遍历元组列表:
for x,y in i:
if y == 'ORGANIZATION':
print(x)
But this code only prints every entity one per line:
但是这段代码每行只打印一个实体:
Sony
Brook
University
With real data there can be more than one organizations, persons in one sentence, how can I put the limits between different entities?
有了真实数据,一句话可以有多个组织、多个人,我如何在不同的实体之间设置界限?
采纳答案by alexis
Thanks to the linkdiscovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:
感谢@Vaulstein 发现的链接,很明显,受过训练的斯坦福标注器(至少在 2012 年)不会对命名实体进行分块。从接受的答案:
Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)
许多 NER 系统使用更复杂的标签,例如 IOB 标签,其中 B-PERS 等代码指示个人实体的起始位置。CRFClassifier 类和特征工厂支持此类标签,但我们目前分发的模型中未使用它们(截至 2012 年)
You have the following options:
您有以下选择:
Collect runs of identically tagged words; e.g., all adjacent words tagged
PERSON
should be taken together as one named entity. That's very easy, but of course it will sometimes combine different named entities. (E.g.New York, Boston [and] Baltimore
is about three cities, not one.) Edit:This is what Alvas's code does in the accepted anwser. See below for a simpler implementation.Use
nltk.ne_recognize()
. It doesn't use the Stanford recognizer but it does chunk entities. (It's a wrapper around an IOB named entity tagger).Figure out a way to do your own chunking on top of the results that the Stanford tagger returns.
Train your own IOB named entity chunker (using the Stanford tools, or the NLTK's framework) for the domain you are interested in. If you have the time and resources to do this right, it will probably give you the best results.
收集相同标记的单词;例如,所有被标记的相邻单词
PERSON
应该被视为一个命名实体。这很容易,但当然有时会组合不同的命名实体。(例如New York, Boston [and] Baltimore
,大约是三个城市,而不是一个。) 编辑:这是 Alvas 的代码在接受的 anwser 中所做的。有关更简单的实现,请参见下文。使用
nltk.ne_recognize()
. 它不使用斯坦福识别器,但它使用块实体。(它是围绕 IOB 命名实体标记器的包装器)。找出一种方法,在斯坦福标注器返回的结果之上进行自己的分块。
为您感兴趣的领域训练您自己的 IOB 命名实体分块器(使用斯坦福工具或 NLTK 的框架)。如果您有时间和资源来正确地执行此操作,它可能会给您最好的结果。
Edit:If all you want is to pull out runs of continuous named entities (option 1 above), you should use itertools.groupby
:
编辑:如果您只想提取连续命名实体的运行(上面的选项 1),您应该使用itertools.groupby
:
from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
if tag != "O":
print("%-12s"%tag, " ".join(w for w, t in chunk))
If netagged_words
is the list of (word, type)
tuples in your question, this produces:
如果netagged_words
是(word, type)
您问题中的元组列表,则会产生:
PERSON Rami Eid
ORGANIZATION Stony Brook University
LOCATION NY
Note again that if two named entities of the same type occur right next to each other, this approach will combine them. E.g. New York, Boston [and] Baltimore
is about three cities, not one.
再次注意,如果两个相同类型的命名实体紧挨着出现,这种方法会将它们组合起来。例如New York, Boston [and] Baltimore
,大约是三个城市,而不是一个。
回答by alvas
IOB/BIO means Inside, Outside, Beginning (IOB), or sometimes aka Beginning, Inside, Outside (BIO)
IOB / BIO意味着予n侧,öutside,乙eginning(IOB),或者有时也称为乙eginning,我n侧,öutside(BIO)
The Stanford NE tagger returns IOB/BIO style tags, e.g.
斯坦福 NE 标记器返回 IOB/BIO 样式的标记,例如
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
The ('Rami', 'PERSON'), ('Eid', 'PERSON')
are tagged as PERSON and "Rami" is the Beginning or a NE chunk and "Eid" is the inside. And then you see that any non-NE will be tagged with "O".
在('Rami', 'PERSON'), ('Eid', 'PERSON')
被标记为PERSON和“拉米”是开始或NE组块和“节日”是内侧。然后你会看到任何非 NE 都会被标记为“O”。
The idea to extract continuous NE chunk is very similar to Named Entity Recognition with Regular Expression: NLTKbut because the Stanford NE chunker API doesn't return a nice tree to parse, you have to do this:
提取连续 NE 块的想法与使用正则表达式的命名实体识别:NLTK非常相似,但是因为斯坦福 NE 块器 API 没有返回一个好的树来解析,你必须这样做:
def get_continuous_chunks(tagged_sent):
continuous_chunk = []
current_chunk = []
for token, tag in tagged_sent:
if tag != "O":
current_chunk.append((token, tag))
else:
if current_chunk: # if the current chunk is not empty
continuous_chunk.append(current_chunk)
current_chunk = []
# Flush the final current_chunk into the continuous_chunk, if any.
if current_chunk:
continuous_chunk.append(current_chunk)
return continuous_chunk
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]
print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print
[out]:
[出去]:
[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]
['Rami Eid', 'Stony Brook University', 'NY']
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
But please note the limitation that if two NEs are continuous, then it might be wrong, nevertheless i still can't think of any example where two NEs are continuous without any "O" between them.
但是请注意,如果两个网元是连续的,那么它可能是错误的,但我仍然想不出任何两个网元之间没有任何“O”连续的例子。
As @alexis suggested, it's better to convert the stanford NE output into NLTK trees:
正如@alexis 所建议的,最好将斯坦福 NE 输出转换为 NLTK 树:
from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree
def stanfordNE2BIO(tagged_sent):
bio_tagged_sent = []
prev_tag = "O"
for token, tag in tagged_sent:
if tag == "O": #O
bio_tagged_sent.append((token, tag))
prev_tag = tag
continue
if tag != "O" and prev_tag == "O": # Begin NE
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag == tag: # Inside NE
bio_tagged_sent.append((token, "I-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
return bio_tagged_sent
def stanfordNE2tree(ne_tagged_sent):
bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]
sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
ne_tree = conlltags2tree(sent_conlltags)
return ne_tree
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'),
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'),
('in', 'O'), ('NY', 'LOCATION')]
ne_tree = stanfordNE2tree(ne_tagged_sent)
print ne_tree
[out]:
[出去]:
(S
(PERSON Rami/NNP Eid/NNP)
is/VBZ
studying/VBG
at/IN
(ORGANIZATION Stony/NNP Brook/NNP University/NNP)
in/IN
(LOCATION NY/NNP))
Then:
然后:
ne_in_sent = []
for subtree in ne_tree:
if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
ne_label = subtree.label()
ne_string = " ".join([token for token, pos in subtree.leaves()])
ne_in_sent.append((ne_string, ne_label))
print ne_in_sent
[out]:
[出去]:
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
回答by Kanishk Gandharv
Not exactly as per the topic author requirement to print what he wants, maybe this can be of any help,
不完全按照主题作者要求打印他想要的内容,也许这可以有任何帮助,
listx = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
def parser(n, string):
for i in listx[n]:
if i == string:
pass
else:
return i
name = parser(0,'PERSON')
lname = parser(1,'PERSON')
org1 = parser(5,'ORGANIZATION')
org2 = parser(6,'ORGANIZATION')
org3 = parser(7,'ORGANIZATION')
print name, lname
print org1, org2, org3
Output would be something like this
输出将是这样的
Rami Eid
Stony Brook University
回答by Abhishek Bisht
Use pycorenlp wrapper from python and then use 'entitymentions' as a key to get the continuous chunk of person or organization in a single string.
使用 python 中的 pycorenlp 包装器,然后使用 'entitymentions' 作为键,在单个字符串中获取连续的人员或组织块。
回答by Akash Tyagi
Try using the "enumerate" method.
尝试使用“枚举”方法。
When you apply NER to the list of words, once tuples are created of (word,type), enumerate this list using the enumerate(list). This would assign an index to every tuple in the list.
当您将 NER 应用于单词列表时,一旦创建了 (word,type) 的元组,请使用 enumerate(list) 枚举此列表。这将为列表中的每个元组分配一个索引。
So later, when you extract PERSON/ORGANISATION/LOCATIONfrom the list they would have an index attached to it.
所以稍后,当您从列表中提取PERSON/ORGANISATION/LOCATION时,它们会附加一个索引。
1 Hussein
2 Obama
3 II
6 James
7 Naismith
21 Naismith
19 Tony
20 Hinkle
0 Frank
1 Mahan
14 Naismith
0 Naismith
0 Mahan
0 Mahan
0 Naismith
Now on the basis of the consecutive index a single name can be filtered out.
现在可以根据连续索引过滤掉单个名称。
Hussein Obama II, James Naismith, Tony Hank, Frank Mahan
侯赛因·奥巴马二世、詹姆斯·奈史密斯、托尼·汉克、弗兰克·马汉
回答by yunus
WARNING: Even if u get this model "all.3class.distsim.crf.ser.gz" please dont use it because
警告:即使你得到这个模型“all.3class.distsim.crf.ser.gz”,也请不要使用它,因为
- 1st reason :
- 第一个原因:
For this model stanford nlp people have openly appologized for bad accuracy
对于这个模型,stanford nlp 的人们已经公开为错误的准确性道歉
- 2nd reason :
- 第二个原因:
It has bad accuracy becase it is case sensitive .
它的准确性很差,因为它区分大小写。
- SOLUTION
- 解决方案
use the model called "english.all.3class.caseless.distsim.crf.ser.gz"
使用名为“english.all.3class.caseless.dissim.crf.ser.gz”的模型