如何有效地将 pos_tag_sents() 应用于 Pandas 数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41674573/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:47:17  来源:igfitidea点击:

How to apply pos_tag_sents() to pandas dataframe efficiently

pythonpython-3.xpandasnltkpos-tagger

提问by mobcdi

In situations where you wish to POS tag a column of text stored in a pandas dataframe with 1 sentence per row the majority of implementations on SO use the apply method

在您希望用每行 1 个句子对存储在 Pandas 数据框中的一列文本进行 POS 标记的情况下,SO 上的大多数实现都使用 apply 方法

dfData['POSTags']= dfData['SourceText'].apply(
                 lamda row: [pos_tag(word_tokenize(row) for item in row])

The NLTK documentation recommends using the pos_tag_sents()for efficient tagging of more than one sentence.

NLTK 文档建议使用 pos_tag_sents()来高效标记多个句子。

Does that apply to this example and if so would the code be as simple as changing pso_tagto pos_tag_sentsor does NLTK mean text sources of paragraphs

这是否适用于本示例,如果适用,代码是否会像更改pso_tag为一样简单,pos_tag_sents或者 NLTK 是否表示段落的文本源

As mentioned in the comments pos_tag_sents()aims to reduce the loading of the preceptor each time but the issue is how to do this and still produce a column in a pandas dataframe?

正如评论中提到的,pos_tag_sents()旨在减少每次导师的加载,但问题是如何做到这一点并仍然在Pandas数据框中生成一列?

Link to Sample Dataset 20kRows

链接到示例数据集 20kRows

采纳答案by alvas

Input

输入

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat

TL;DR

TL; 博士

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0    cozily married practical athletics Mr. Brown flat
1       active married expensive soccer Mr. Chang flat
2    healthy single expensive badminton Mrs. Green ...
3    cozily married practical soccer Mr. Brown hier...
4     cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
   ID                 Task        label  \
0   1  Collect Information  no response   
1   2           New Credit  no response   
2   3  Collect Information     response   
3   4  Collect Information     response   
4   5  Collect Information     response   

                                                Text  \
0  cozily married practical athletics Mr. Brown flat   
1     active married expensive soccer Mr. Chang flat   
2  healthy single expensive badminton Mrs. Green ...   
3  cozily married practical soccer Mr. Brown hier...   
4   cozily single practical badminton Mr. Brown flat   

                                                 POS  
0  [(cozily, RB), (married, JJ), (practical, JJ),...  
1  [(active, JJ), (married, VBD), (expensive, JJ)...  
2  [(healthy, JJ), (single, JJ), (expensive, JJ),...  
3  [(cozily, RB), (married, JJ), (practical, JJ),...  
4  [(cozily, RB), (single, JJ), (practical, JJ), ... 


In Long:

在长:

First, you can extract the Textcolumn to a list of string:

首先,您可以将该Text列提取到字符串列表中:

texts = df['Text'].tolist()

Then you can apply the word_tokenizefunction:

然后你可以应用这个word_tokenize函数:

map(word_tokenize, texts)

Note that, @Boud's suggested is almost the same, using df.apply:

请注意,@Boud 的建议几乎相同,使用df.apply

df['Text'].apply(word_tokenize)

Then you dump the tokenized text into a list of list of string:

然后将标记化的文本转储到字符串列表中:

df['Text'].apply(word_tokenize).tolist()

Then you can use pos_tag_sents:

然后你可以使用pos_tag_sents

pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

Then you add the column back to the DataFrame:

然后将该列添加回 DataFrame:

df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

回答by Iulius Curt

By applying pos_tagon each row, the Perceptron model will be loaded each time (costly operation, as it reads a pickle from disk).

通过pos_tag在每一行上应用,感知器模型将每次加载(昂贵的操作,因为它从磁盘读取泡菜)。

If you instead get all the rows and send them to pos_tag_sents(which takes list(list(str))), the model is loaded once and used for all.

如果您改为获取所有行并将它们发送到pos_tag_sents(需要list(list(str))),则模型将加载一次并用于所有。

See the source.

源码

回答by Boud

Assign this to your new column instead:

改为将其分配给您的新列:

dfData['POSTags'] = pos_tag_sents(dfData['SourceText'].apply(word_tokenize).tolist())