Python 在 NLTK 中使用斯坦福 NER 标记器提取人员和组织列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30664677/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:46:51  来源:igfitidea点击:

Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

pythonnltkstanford-nlpnamed-entity-recognition

提问by user1680859

I am trying to extract list of persons and organizations using Stanford Named Entity Recognizer (NER) in Python NLTK. When I run:

我正在尝试使用 Python NLTK 中的斯坦福命名实体识别器 (NER) 提取人员和组织列表。当我运行时:

from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
               '/usr/share/stanford-ner/stanford-ner.jar') 
r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
print(r) 

the output is:

输出是:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

what I want is to extract from this list all persons and organizations in this form:

我想要的是从这个列表中以这种形式提取所有的人和组织:

Rami Eid
Sony Brook University

I tried to loop through the list of tuples:

我试图遍历元组列表:

for x,y in i:
        if y == 'ORGANIZATION':
            print(x)

But this code only prints every entity one per line:

但是这段代码每行只打印一个实体:

Sony 
Brook 
University

With real data there can be more than one organizations, persons in one sentence, how can I put the limits between different entities?

有了真实数据,一句话可以有多个组织、多个人,我如何在不同的实体之间设置界限?

采纳答案by alexis

Thanks to the linkdiscovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:

感谢@Vaulstein 发现的链接,很明显,受过训练的斯坦福标注器(至少在 2012 年)不会对命名实体进行分块。从接受的答案

Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)

许多 NER 系统使用更复杂的标签,例如 IOB 标签,其中 B-PERS 等代码指示个人实体的起始位置。CRFClassifier 类和特征工厂支持此类标签,但我们目前分发的模型中未使用它们(截至 2012 年)

You have the following options:

您有以下选择:

  1. Collect runs of identically tagged words; e.g., all adjacent words tagged PERSONshould be taken together as one named entity. That's very easy, but of course it will sometimes combine different named entities. (E.g. New York, Boston [and] Baltimoreis about three cities, not one.) Edit:This is what Alvas's code does in the accepted anwser. See below for a simpler implementation.

  2. Use nltk.ne_recognize(). It doesn't use the Stanford recognizer but it does chunk entities. (It's a wrapper around an IOB named entity tagger).

  3. Figure out a way to do your own chunking on top of the results that the Stanford tagger returns.

  4. Train your own IOB named entity chunker (using the Stanford tools, or the NLTK's framework) for the domain you are interested in. If you have the time and resources to do this right, it will probably give you the best results.

  1. 收集相同标记的单词;例如,所有被标记的相邻单词PERSON应该被视为一个命名实体。这很容易,但当然有时会组合不同的命名实体。(例如New York, Boston [and] Baltimore,大约是三个城市,而不是一个。) 编辑:这是 Alvas 的代码在接受的 anwser 中所做的。有关更简单的实现,请参见下文。

  2. 使用nltk.ne_recognize(). 它不使用斯坦福识别器,但它使用块实体。(它是围绕 IOB 命名实体标记器的包装器)。

  3. 找出一种方法,在斯坦福标注器返回的结果之上进行自己的分块。

  4. 为您感兴趣的领域训练您自己的 IOB 命名实体分块器(使用斯坦福工具或 NLTK 的框架)。如果您有时间和资源来正确地执行此操作,它可能会给您最好的结果。

Edit:If all you want is to pull out runs of continuous named entities (option 1 above), you should use itertools.groupby:

编辑:如果您只想提取连续命名实体的运行(上面的选项 1),您应该使用itertools.groupby

from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
    if tag != "O":
        print("%-12s"%tag, " ".join(w for w, t in chunk))

If netagged_wordsis the list of (word, type)tuples in your question, this produces:

如果netagged_words(word, type)您问题中的元组列表,则会产生:

PERSON       Rami Eid
ORGANIZATION Stony Brook University
LOCATION     NY

Note again that if two named entities of the same type occur right next to each other, this approach will combine them. E.g. New York, Boston [and] Baltimoreis about three cities, not one.

再次注意,如果两个相同类型的命名实体紧挨着出现,这种方法会将它们组合起来。例如New York, Boston [and] Baltimore,大约是三个城市,而不是一个。

回答by alvas

IOB/BIO means Inside, Outside, Beginning (IOB), or sometimes aka Beginning, Inside, Outside (BIO)

IOB / BIO意味着n侧,öutside,eginning(IOB),或者有时也称为eginning,n侧,öutside(BIO)

The Stanford NE tagger returns IOB/BIO style tags, e.g.

斯坦福 NE 标记器返回 IOB/BIO 样式的标记,例如

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

The ('Rami', 'PERSON'), ('Eid', 'PERSON')are tagged as PERSON and "Rami" is the Beginning or a NE chunk and "Eid" is the inside. And then you see that any non-NE will be tagged with "O".

('Rami', 'PERSON'), ('Eid', 'PERSON')被标记为PERSON和“拉米”是开始或NE组块和“节日”是内侧。然后你会看到任何非 NE 都会被标记为“O”。

The idea to extract continuous NE chunk is very similar to Named Entity Recognition with Regular Expression: NLTKbut because the Stanford NE chunker API doesn't return a nice tree to parse, you have to do this:

提取连续 NE 块的想法与使用正则表达式的命名实体识别:NLTK非常相似,但是因为斯坦福 NE 块器 API 没有返回一个好的树来解析,你必须这样做:

def get_continuous_chunks(tagged_sent):
    continuous_chunk = []
    current_chunk = []

    for token, tag in tagged_sent:
        if tag != "O":
            current_chunk.append((token, tag))
        else:
            if current_chunk: # if the current chunk is not empty
                continuous_chunk.append(current_chunk)
                current_chunk = []
    # Flush the final current_chunk into the continuous_chunk, if any.
    if current_chunk:
        continuous_chunk.append(current_chunk)
    return continuous_chunk

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]

print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print

[out]:

[出去]:

[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]

['Rami Eid', 'Stony Brook University', 'NY']

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

But please note the limitation that if two NEs are continuous, then it might be wrong, nevertheless i still can't think of any example where two NEs are continuous without any "O" between them.

但是请注意,如果两个网元是连续的,那么它可能是错误的,但我仍然想不出任何两个网元之间没有任何“O”连续的例子。



As @alexis suggested, it's better to convert the stanford NE output into NLTK trees:

正如@alexis 所建议的,最好将斯坦福 NE 输出转换为 NLTK 树:

from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree

def stanfordNE2BIO(tagged_sent):
    bio_tagged_sent = []
    prev_tag = "O"
    for token, tag in tagged_sent:
        if tag == "O": #O
            bio_tagged_sent.append((token, tag))
            prev_tag = tag
            continue
        if tag != "O" and prev_tag == "O": # Begin NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag == tag: # Inside NE
            bio_tagged_sent.append((token, "I-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag

    return bio_tagged_sent


def stanfordNE2tree(ne_tagged_sent):
    bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
    sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
    sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]

    sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
    ne_tree = conlltags2tree(sent_conlltags)
    return ne_tree

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), 
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), 
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), 
('in', 'O'), ('NY', 'LOCATION')]

ne_tree = stanfordNE2tree(ne_tagged_sent)

print ne_tree

[out]:

[出去]:

  (S
  (PERSON Rami/NNP Eid/NNP)
  is/VBZ
  studying/VBG
  at/IN
  (ORGANIZATION Stony/NNP Brook/NNP University/NNP)
  in/IN
  (LOCATION NY/NNP))

Then:

然后:

ne_in_sent = []
for subtree in ne_tree:
    if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
        ne_label = subtree.label()
        ne_string = " ".join([token for token, pos in subtree.leaves()])
        ne_in_sent.append((ne_string, ne_label))
print ne_in_sent

[out]:

[出去]:

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

回答by Kanishk Gandharv

Not exactly as per the topic author requirement to print what he wants, maybe this can be of any help,

不完全按照主题作者要求打印他想要的内容,也许这可以有任何帮助,

listx = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]


def parser(n, string):
    for i in listx[n]:
        if i == string:
            pass
        else:
            return i

name = parser(0,'PERSON')
lname = parser(1,'PERSON')
org1 = parser(5,'ORGANIZATION')
org2 = parser(6,'ORGANIZATION')
org3 = parser(7,'ORGANIZATION')


print name, lname
print org1, org2, org3

Output would be something like this

输出将是这样的

Rami Eid
Stony Brook University

回答by Abhishek Bisht

Use pycorenlp wrapper from python and then use 'entitymentions' as a key to get the continuous chunk of person or organization in a single string.

使用 python 中的 pycorenlp 包装器,然后使用 'entitymentions' 作为键,在单个字符串中获取连续的人员或组织块。

回答by Akash Tyagi

Try using the "enumerate" method.

尝试使用“枚举”方法。

When you apply NER to the list of words, once tuples are created of (word,type), enumerate this list using the enumerate(list). This would assign an index to every tuple in the list.

当您将 NER 应用于单词列表时,一旦创建了 (word,type) 的元组,请使用 enumerate(list) 枚举此列表。这将为列表中的每个元组分配一个索引。

So later, when you extract PERSON/ORGANISATION/LOCATIONfrom the list they would have an index attached to it.

所以稍后,当您从列表中提取PERSON/ORGANISATION/LOCATION时,它们会附加一个索引。

1   Hussein
2   Obama
3   II
6   James
7   Naismith
21   Naismith
19   Tony
20   Hinkle
0   Frank
1   Mahan
14   Naismith
0   Naismith
0   Mahan
0   Mahan
0   Naismith

Now on the basis of the consecutive index a single name can be filtered out.

现在可以根据连续索引过滤掉单个名称。

Hussein Obama II, James Naismith, Tony Hank, Frank Mahan

侯赛因·奥巴马二世、詹姆斯·奈史密斯、托尼·汉克、弗兰克·马汉

回答by yunus

WARNING: Even if u get this model "all.3class.distsim.crf.ser.gz" please dont use it because

警告:即使你得到这个模型“all.3class.distsim.crf.ser.gz”,也请不要使用它,因为

    1st reason :
    第一个原因:

For this model stanford nlp people have openly appologized for bad accuracy

对于这个模型,stanford nlp 的人们已经公开为错误的准确性道歉

    2nd reason :
    第二个原因:

It has bad accuracy becase it is case sensitive .

它的准确性很差,因为它区分大小写。

    SOLUTION
    解决方案

use the model called "english.all.3class.caseless.distsim.crf.ser.gz"

使用名为“english.all.3class.caseless.dissim.crf.ser.gz”的模型