pandas Python(NLTK)-提取名词短语的更有效方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49564176/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:23:42  来源:igfitidea点击:

Python (NLTK) - more efficient way to extract noun phrases?

python-3.xpandasnlpnltktext-chunking

提问by Silent-J

I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline. I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below.

我有一项涉及大量文本数据的机器学习任务。我想识别和提取训练文本中的名词短语,以便稍后在管道中使用它们进行特征构建。我已经从文本中提取了我想要的名词短语类型,但我对 NLTK 还很陌生,所以我以一种可以分解列表理解中的每个步骤的方式来解决这个问题,如下所示。

But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing?

但我真正的问题是,我是在重新发明轮子吗?有没有我没有看到的更快的方法来做到这一点?

import nltk
import pandas as pd

myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)

tokens = [nltk.word_tokenize(i) for i in texts]

tag_list = [nltk.pos_tag(w) for w in tokens]

phrases = [chunkr.parse(sublist) for sublist in tag_list]

leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]

flatten the list of lists of lists of tuples that we've ended up with, into just a list of lists of tuples

将我们最终得到的元组列表列表扁平化为一个元组列表列表

leaves = [tupls for sublists in leaves for tupls in sublists]

Join the extracted terms into one bigram

将提取的术语加入一个二元组

nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]

回答by alvas

Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.

看看为什么我的 NLTK 函数在处理 DataFrame 时会变慢?,如果您不需要中间步骤,则无需多次遍历所有行。

With ne_chunkand solution from

随着ne_chunk和解决方案

[code]:

[代码]:

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})

df['text'].apply(lambda sent: get_continuous_chunks((sent)))

[out]:

[出去]:

0                   [New York]
1    [Washington, Bruce Wayne]
Name: text, dtype: object

To use the custom RegexpParser:

要使用自定义RegexpParser

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk


df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})


df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))

[out]:

[出去]:

0                  [bar sentence, New York city]
1    [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object