NLTK 命名实体识别到 Python 列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31836058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:37:21  来源:igfitidea点击:

NLTK Named Entity recognition to a Python list

pythonnlpnltknamed-entity-recognition

提问by Zlo

I used NLTK's ne_chunkto extract named entities from a text:

我使用 NLTKne_chunk从文本中提取命名实体:

my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."


nltk.ne_chunk(my_sent, binary=True)

But I can't figure out how to save these entities to a list? E.g. –

但我不知道如何将这些实体保存到列表中?例如——

print Entity_list
('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')

Thanks.

谢谢。

采纳答案by alvas

nltk.ne_chunkreturns a nested nltk.tree.Treeobject so you would have to traverse the Treeobject to get to the NEs.

nltk.ne_chunk返回一个嵌套nltk.tree.Tree对象,因此您必须遍历该Tree对象才能到达 NE。

Take a look at Named Entity Recognition with Regular Expression: NLTK

看看使用正则表达式的命名实体识别:NLTK

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>> 
>>> def get_continuous_chunks(text):
...     chunked = ne_chunk(pos_tag(word_tokenize(text)))
...     continuous_chunk = []
...     current_chunk = []
...     for i in chunked:
...             if type(i) == Tree:
...                     current_chunk.append(" ".join([token for token, pos in i.leaves()]))
...             elif current_chunk:
...                     named_entity = " ".join(current_chunk)
...                     if named_entity not in continuous_chunk:
...                             continuous_chunk.append(named_entity)
...                             current_chunk = []
...             else:
...                     continue
...     return continuous_chunk
... 
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']

回答by b3000

As you get a treeas a return value, I guess you want to pick those subtrees that are labeled with NE

当你得到 atree作为返回值时,我猜你想选择那些标有NE

Here is a simple example to gather all those in a list:

这是一个简单的示例,用于收集列表中的所有内容:

import nltk

my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

parse_tree = nltk.ne_chunk(nltk.tag.pos_tag(my_sent.split()), binary=True)  # POS tagging before chunking!

named_entities = []

for t in parse_tree.subtrees():
    if t.label() == 'NE':
        named_entities.append(t)
        # named_entities.append(list(t))  # if you want to save a list of tagged words instead of a tree

print named_entities

This gives:

这给出:

[Tree('NE', [('WASHINGTON', 'NNP')]), Tree('NE', [('New', 'NNP'), ('York', 'NNP')])]

or as a list of lists:

或作为列表列表:

[[('WASHINGTON', 'NNP')], [('New', 'NNP'), ('York', 'NNP')]]

Also see: How to navigate a nltk.tree.Tree?

另请参阅:如何导航 nltk.tree.Tree?

回答by alexis

A Treeis a list. Chunks are subtrees, non-chunked words are regular strings. So let's go down the list, extract the words from each chunk, and join them.

ATree是一个列表。块是子树,非块词是常规字符串。因此,让我们沿着列表向下,从每个块中提取单词,然后加入它们。

>>> chunked = nltk.ne_chunk(my_sent)
>>>
>>>  [ " ".join(w for w, t in elt) for elt in chunked if isinstance(elt, nltk.Tree) ]
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']

回答by imanzabet

You can also extract the labelof each Name Entity in the text using this code:

您还可以label使用以下代码提取文本中每个名称实体的 :

import nltk
for sent in nltk.sent_tokenize(sentence):
   for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
         print(chunk.label(), ' '.join(c[0] for c in chunk))

Output:

输出:

GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn

You can see Washington, New Yorkand Brooklynare GPEmeans geo-political entities

你可以看到WashingtonNew YorkBrooklynGPE手段地缘实体

and Loretta E. Lynchis a PERSON

并且Loretta E. Lynch是一个PERSON

回答by elwhite

use tree2conlltags from nltk.chunk. Also ne_chunk needs pos tagging which tags word tokens (thus needs word_tokenize).

使用 nltk.chunk 中的 tree2conlltags。ne_chunk 还需要 pos 标记来标记单词标记(因此需要 word_tokenize)。

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags

sentence = "Mark and John are working at Google."
print(tree2conlltags(ne_chunk(pos_tag(word_tokenize(sentence))
"""[('Mark', 'NNP', 'B-PERSON'), 
    ('and', 'CC', 'O'), ('John', 'NNP', 'B-PERSON'), 
    ('are', 'VBP', 'O'), ('working', 'VBG', 'O'), 
    ('at', 'IN', 'O'), ('Google', 'NNP', 'B-ORGANIZATION'), 
    ('.', '.', 'O')] """

This will give you a list of tuples: [(token, pos_tag, name_entity_tag)] If this list is not exactly what you want, it is certainly easier to parse the list you want out of this list then an nltk tree.

这会给你一个元组列表: [(token, pos_tag, name_entity_tag)] 如果这个列表不是你想要的,那么从这个列表中解析你想要的列表肯定更容易,然后是 nltk 树。

Code and details from this link; check it out for more information

此链接中的代码和详细信息;查看更多信息

You can also continue by only extracting the words, with the following function:

您也可以通过仅提取单词来继续,使用以下功能:

def wordextractor(tuple1):

    #bring the tuple back to lists to work with it
    words, tags, pos = zip(*tuple1)
    words = list(words)
    pos = list(pos)
    c = list()
    i=0
    while i<= len(tuple1)-1:
        #get words with have pos B-PERSON or I-PERSON
        if pos[i] == 'B-PERSON':
            c = c+[words[i]]
        elif pos[i] == 'I-PERSON':
            c = c+[words[i]]
        i=i+1

    return c

print(wordextractor(tree2conlltags(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence))))

EditAdded output docstring **Edit* Added Output only for B-Person

编辑添加输出文档字符串 **编辑* 仅为 B-Person 添加输出

回答by Nic Scozzaro

You may also consider using Spacy:

您也可以考虑使用 Spacy:

import spacy
nlp = spacy.load('en')

doc = nlp('WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.')

print([ent for ent in doc.ents])

>>> [WASHINGTON, New York, the 1990s, Loretta E. Lynch, Brooklyn, African-Americans]

回答by Akshay

nltk.ne_chunk returns a nested nltk.tree.Tree object so you would have to traverse the Tree object to get to the NEs. You can use list comprehension to do the same.

nltk.ne_chunk 返回一个嵌套的 nltk.tree.Tree 对象,因此您必须遍历 Tree 对象才能到达 NE。您可以使用列表理解来做同样的事情。

import nltk   
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

word = nltk.word_tokenize(my_sent)   
pos_tag = nltk.pos_tag(word)   
chunk = nltk.ne_chunk(pos_tag)   
NE = [ " ".join(w for w, t in ele) for ele in chunk if isinstance(ele, nltk.Tree)]   
print (NE)