NLTK 命名实体识别到 Python 列表

Question

提问by Zlo

I used NLTK's ne_chunkto extract named entities from a text:

我使用 NLTKne_chunk从文本中提取命名实体：

my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."


nltk.ne_chunk(my_sent, binary=True)

But I can't figure out how to save these entities to a list? E.g. –

但我不知道如何将这些实体保存到列表中？例如——

print Entity_list
('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')

Thanks.

谢谢。

Answer 1

采纳答案by alvas

nltk.ne_chunkreturns a nested nltk.tree.Treeobject so you would have to traverse the Treeobject to get to the NEs.

nltk.ne_chunk返回一个嵌套nltk.tree.Tree对象，因此您必须遍历该Tree对象才能到达 NE。

Take a look at Named Entity Recognition with Regular Expression: NLTK

看看使用正则表达式的命名实体识别：NLTK

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>> 
>>> def get_continuous_chunks(text):
...     chunked = ne_chunk(pos_tag(word_tokenize(text)))
...     continuous_chunk = []
...     current_chunk = []
...     for i in chunked:
...             if type(i) == Tree:
...                     current_chunk.append(" ".join([token for token, pos in i.leaves()]))
...             elif current_chunk:
...                     named_entity = " ".join(current_chunk)
...                     if named_entity not in continuous_chunk:
...                             continuous_chunk.append(named_entity)
...                             current_chunk = []
...             else:
...                     continue
...     return continuous_chunk
... 
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']

Answer 2

回答by b3000

As you get a treeas a return value, I guess you want to pick those subtrees that are labeled with NE

当你得到 atree作为返回值时，我猜你想选择那些标有NE

Here is a simple example to gather all those in a list:

这是一个简单的示例，用于收集列表中的所有内容：

import nltk

my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

parse_tree = nltk.ne_chunk(nltk.tag.pos_tag(my_sent.split()), binary=True)  # POS tagging before chunking!

named_entities = []

for t in parse_tree.subtrees():
    if t.label() == 'NE':
        named_entities.append(t)
        # named_entities.append(list(t))  # if you want to save a list of tagged words instead of a tree

print named_entities

This gives:

这给出：

[Tree('NE', [('WASHINGTON', 'NNP')]), Tree('NE', [('New', 'NNP'), ('York', 'NNP')])]

or as a list of lists:

或作为列表列表：

[[('WASHINGTON', 'NNP')], [('New', 'NNP'), ('York', 'NNP')]]

Also see: How to navigate a nltk.tree.Tree?

另请参阅：如何导航 nltk.tree.Tree？

Answer 3

回答by alexis

A Treeis a list. Chunks are subtrees, non-chunked words are regular strings. So let's go down the list, extract the words from each chunk, and join them.

ATree是一个列表。块是子树，非块词是常规字符串。因此，让我们沿着列表向下，从每个块中提取单词，然后加入它们。

>>> chunked = nltk.ne_chunk(my_sent)
>>>
>>>  [ " ".join(w for w, t in elt) for elt in chunked if isinstance(elt, nltk.Tree) ]
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']

Answer 4

回答by imanzabet

You can also extract the labelof each Name Entity in the text using this code:

您还可以label使用以下代码提取文本中每个名称实体的：

import nltk
for sent in nltk.sent_tokenize(sentence):
   for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
         print(chunk.label(), ' '.join(c[0] for c in chunk))

Output:

输出：

GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn

You can see Washington, New Yorkand Brooklynare GPEmeans geo-political entities

你可以看到Washington，New York并Brooklyn有GPE手段地缘实体

and Loretta E. Lynchis a PERSON

并且Loretta E. Lynch是一个PERSON

Answer 5

回答by elwhite

use tree2conlltags from nltk.chunk. Also ne_chunk needs pos tagging which tags word tokens (thus needs word_tokenize).

使用 nltk.chunk 中的 tree2conlltags。ne_chunk 还需要 pos 标记来标记单词标记（因此需要 word_tokenize）。

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags

sentence = "Mark and John are working at Google."
print(tree2conlltags(ne_chunk(pos_tag(word_tokenize(sentence))
"""[('Mark', 'NNP', 'B-PERSON'), 
    ('and', 'CC', 'O'), ('John', 'NNP', 'B-PERSON'), 
    ('are', 'VBP', 'O'), ('working', 'VBG', 'O'), 
    ('at', 'IN', 'O'), ('Google', 'NNP', 'B-ORGANIZATION'), 
    ('.', '.', 'O')] """

This will give you a list of tuples: [(token, pos_tag, name_entity_tag)] If this list is not exactly what you want, it is certainly easier to parse the list you want out of this list then an nltk tree.

这会给你一个元组列表： [(token, pos_tag, name_entity_tag)] 如果这个列表不是你想要的，那么从这个列表中解析你想要的列表肯定更容易，然后是 nltk 树。

Code and details from this link; check it out for more information

此链接中的代码和详细信息；查看更多信息

You can also continue by only extracting the words, with the following function:

您也可以通过仅提取单词来继续，使用以下功能：

def wordextractor(tuple1):

    #bring the tuple back to lists to work with it
    words, tags, pos = zip(*tuple1)
    words = list(words)
    pos = list(pos)
    c = list()
    i=0
    while i<= len(tuple1)-1:
        #get words with have pos B-PERSON or I-PERSON
        if pos[i] == 'B-PERSON':
            c = c+[words[i]]
        elif pos[i] == 'I-PERSON':
            c = c+[words[i]]
        i=i+1

    return c

print(wordextractor(tree2conlltags(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence))))

EditAdded output docstring **Edit* Added Output only for B-Person

编辑添加输出文档字符串 **编辑* 仅为 B-Person 添加输出

Answer 6

回答by Nic Scozzaro

You may also consider using Spacy:

您也可以考虑使用 Spacy：

import spacy
nlp = spacy.load('en')

doc = nlp('WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.')

print([ent for ent in doc.ents])

>>> [WASHINGTON, New York, the 1990s, Loretta E. Lynch, Brooklyn, African-Americans]

Answer 7

回答by Akshay

nltk.ne_chunk returns a nested nltk.tree.Tree object so you would have to traverse the Tree object to get to the NEs. You can use list comprehension to do the same.

nltk.ne_chunk 返回一个嵌套的 nltk.tree.Tree 对象，因此您必须遍历 Tree 对象才能到达 NE。您可以使用列表理解来做同样的事情。

import nltk   
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

word = nltk.word_tokenize(my_sent)   
pos_tag = nltk.pos_tag(word)   
chunk = nltk.ne_chunk(pos_tag)   
NE = [ " ".join(w for w, t in ele) for ele in chunk if isinstance(ele, nltk.Tree)]   
print (NE)

NLTK 命名实体识别到 Python 列表

提问by Zlo

采纳答案by alvas

回答by b3000

回答by alexis

回答by imanzabet

回答by elwhite

回答by Nic Scozzaro

回答by Akshay

相关推荐

最近更新

标签

NLTK 命名实体识别到 Python 列表

提问by Zlo

采纳答案by alvas

回答by b3000

回答by alexis

回答by imanzabet

回答by elwhite

回答by Nic Scozzaro

回答by Akshay

相关推荐

Python 何时使用哪个模糊函数来比较 2 个字符串

如何在文本文件中搜索单词并用 Python 打印部分行？

Python 'MyClass' 对象没有属性 '__getitem__'

Python pandas.read_csv：如何跳过评论行

相关推荐

最近更新

标签

Python 'MyClass' 对象没有属性 'getitem'