Python 使用 nltk 改进人名的提取

Question

提问by emh

I am trying to extract human names from text.

我正在尝试从文本中提取人名。

Does anyone have a method that they would recommend?

有没有人有他们会推荐的方法？

This is what I tried (code is below): I am using nltkto find everything marked as a person and then generating a list of all the NNP parts of that person. I am skipping persons where there is only one NNP which avoids grabbing a lone surname.

这是我尝试过的（代码如下）：我正在使用nltk查找标记为人的所有内容，然后生成该人所有 NNP 部分的列表。我正在跳过只有一个 NNP 的人，以避免抓住一个孤独的姓氏。

I am getting decent results but was wondering if there are better ways to go about solving this problem.

我得到了不错的结果，但想知道是否有更好的方法来解决这个问题。

Code:

代码：

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that's exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

Output:

输出：

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

Apart from Virgin Galactic, this is all valid output. Of course, knowing that Virgin Galactic isn't a human name in the context of this article is the hard (maybe impossible) part.

除了维珍银河，这都是有效的输出。当然，知道维珍银河在本文的上下文中不是人名是困难的（也许是不可能的）部分。

Answer 1

采纳答案by troyane

Must agree with suggestion that "make my code better" isn't well suited for this site, but I can give you some way where you can try to dig in.

必须同意“让我的代码更好”的建议不太适合这个网站，但我可以给你一些你可以尝试挖掘的方法。

Take a look at Stanford Named Entity Recognizer (NER). Its binding has been included into NLTK v 2.0, but you must download some core files. Here is scriptwhich can do all of that for you.

看看斯坦福命名实体识别器（NER）。它的绑定已包含在 NLTK v 2.0 中，但您必须下载一些核心文件。这是可以为您完成所有这些的脚本。

I wrote this script:

我写了这个脚本：

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

and got not so bad output:

并没有那么糟糕的输出：

('Francois', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin', 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

('Francois', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin' , 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

Hope this is helpful.

希望这是有帮助的。

Answer 2

回答by Viktor Vojnovski

You can try to do resolution of the found names, and check if you can find them in a database such as freebase.com. Get the data locally and query it (it's in RDF), or use google's api: https://developers.google.com/freebase/v1/getting-started. Most big companies, geographical locations, etc. (that would be caught by your snippet) could be then discarded based on the freebase data.

您可以尝试对找到的名称进行解析，并检查是否可以在 freebase.com 等数据库中找到它们。在本地获取数据并查询它（它在 RDF 中），或者使用 google 的 api：https: //developers.google.com/freebase/v1/getting-started。大多数大公司、地理位置等（会被您的代码片段捕获）然后可以根据自由库数据被丢弃。

Answer 3

回答by Curtis Mattoon

For anyone else looking, I found this article to be useful: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

对于其他人，我发现这篇文章很有用：http: //timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk
>>> def extract_entities(text):
...     for sent in nltk.sent_tokenize(text):
...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
...             if hasattr(chunk, 'node'):
...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())
...

Answer 4

回答by C.Rider

This worked pretty well for me. I just had to change one line in order for it to run.

这对我来说效果很好。我只需要更改一行即可运行。

    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):

needs to be

需要是

    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):

There were imperfections in the output (for example it identified "Money Laundering" as a person), but with my data a name database may not be dependable.

输出中存在缺陷（例如，它将“洗钱”识别为一个人），但根据我的数据，名称数据库可能不可靠。

Answer 5

回答by Martin Thoma

The answer of @trojane didn't quite work for me, but helped a lot for this one.

@trojane 的答案对我来说不太有用，但对这个答案有很大帮助。

Prerequesites

先决条件

Create a folder stanford-nerand download the following two files to it:

创建一个文件夹stanford-ner并将以下两个文件下载到其中：

english.all.3class.distsim.crf.ser.gz
stanford-ner.jar(Look for downloadand extract the archive)

english.all.3class.dissim.crf.ser.gz
stanford-ner.jar（查找下载并解压缩存档）

Script

脚本

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk
from nltk.tag.stanford import StanfordNERTagger

text = u"""
Some economists have responded positively to Bitcoin, including
Francois R. Velde, senior economist of the Federal Reserve in Chicago
who described it as "an elegant solution to the problem of creating a
digital currency." In November 2013 Richard Branson announced that
Virgin Galactic would accept Bitcoin as payment, saying that he had invested
in Bitcoin and found it "fascinating how a whole new global currency
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical.
Economist Paul Krugman has suggested that the structure of the currency
incentivizes hoarding and that its value derives from the expectation that
others will accept it as payment. Economist Larry Summers has expressed
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market
strategist for ConvergEx Group, has remarked on the effect of increasing
use of Bitcoin and its restricted supply, noting, "When incremental
adoption meets relatively fixed supply, it should be no surprise that
prices go up. And that's exactly what is happening to BTC prices.
"""

st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',
                       'stanford-ner/stanford-ner.jar')

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:
            print(tag)

Results

结果

(u'Bitcoin', u'LOCATION')       # wrong
(u'Francois', u'PERSON')
(u'R.', u'PERSON')
(u'Velde', u'PERSON')
(u'Federal', u'ORGANIZATION')
(u'Reserve', u'ORGANIZATION')
(u'Chicago', u'LOCATION')
(u'Richard', u'PERSON')
(u'Branson', u'PERSON')
(u'Virgin', u'PERSON')         # Wrong
(u'Galactic', u'PERSON')       # Wrong
(u'Bitcoin', u'PERSON')        # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Paul', u'PERSON')
(u'Krugman', u'PERSON')
(u'Larry', u'PERSON')
(u'Summers', u'PERSON')
(u'Bitcoin', u'PERSON')        # Wrong
(u'Nick', u'PERSON')
(u'Colas', u'PERSON')
(u'ConvergEx', u'ORGANIZATION')
(u'Group', u'ORGANIZATION')     
(u'Bitcoin', u'LOCATION')       # Wrong
(u'BTC', u'ORGANIZATION')       # Wrong

Answer 6

回答by Shivansh bhandari

I actually wanted to extract only the person name, so, thought to check all the names that come as an output against wordnet( A large lexical database of English). More Information on Wordnet can be found here: http://www.nltk.org/howto/wordnet.html

我实际上只想提取人名，因此，想检查作为输出的所有姓名与 wordnet（一个大型英语词汇数据库）。有关 Wordnet 的更多信息，请访问：http: //www.nltk.org/howto/wordnet.html

import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet


person_list = []
person_names=person_list
def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)

    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
#     print (person_list)

text = """

Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that's exactly what is happening to BTC prices."
"""

names = get_human_names(text)
for person in person_list:
    person_split = person.split(" ")
    for name in person_split:
        if wordnet.synsets(name):
            if(name in person):
                person_names.remove(person)
                break

print(person_names)

OUTPUT

输出

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']

Apart from Larry Summers all the names are correct and that is because of the last name "Summers".

除了拉里·萨默斯之外，所有的名字都是正确的，这是因为姓氏“萨默斯”。

Answer 7

回答by Maxmoe

I would like to post a brutal and greedy solution here to solve the problem cast by @Enthusiast: get the full name of a person if possible.

我想在这里发布一个残酷而贪婪的解决方案来解决@Enthusiast 提出的问题：如果可能，获取一个人的全名。

The capitalization of the first character in each name is used as a criterion for recognizing PERSON in Spacy. For example, 'jim hoffman' itself won't be recognized as a named entity, while 'Jim Hoffman' will be.

每个名称中第一个字符的大小写用作识别中的 PERSON 的标准Spacy。例如，'jim hoffman' 本身不会被识别为命名实体，而 'Jim Hoffman' 会被识别为命名实体。

Therefore, if our task is simply picking out persons from a script, we may simply first capitalize the first letter of each word, and then dump it to spacy.

因此，如果我们的任务只是从脚本中挑选出人物，我们可以简单地首先将每个单词的第一个字母大写，然后将其转储到spacy.

import spacy

def capitalizeWords(text):

  newText = ''

  for sentence in text.split('.'):
    newSentence = ''
    for word in sentence.split():
      newSentence += word+' '
    newText += newSentence+'\n'

  return newText

nlp = spacy.load('en_core_web_md')

doc = nlp(capitalizeWords(rawText))

#......

Note that this approach covers full names at the cost of the increasing of false positives.

请注意，这种方法以增加误报为代价覆盖了全名。

Python 使用 nltk 改进人名的提取

提问by emh

采纳答案by troyane

回答by Viktor Vojnovski

回答by Curtis Mattoon

回答by C.Rider

回答by Martin Thoma

Prerequesites

先决条件

Script

脚本

Results

结果

回答by Shivansh bhandari

回答by Maxmoe

相关推荐

最近更新

标签

Python 使用 nltk 改进人名的提取

提问by emh

采纳答案by troyane

回答by Viktor Vojnovski

回答by Curtis Mattoon

回答by C.Rider

回答by Martin Thoma

Prerequesites

先决条件

Script

脚本

Results

结果

回答by Shivansh bhandari

回答by Maxmoe

相关推荐

Python 将大型 DataFrame 输出到 CSV 文件的最快方法是什么？

为什么python不能使用zip方法解压缩由winrar创建的受密码保护的zip文件？

Python 如何在 Anaconda 中连接到 SQL Server 数据库

Python setuptools 与 distutils：为什么 distutils 仍然是一个东西？

相关推荐

最近更新

标签