将搬运工词干分析器应用于每个单词的 Pandas 列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43795310/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:32:19  来源:igfitidea点击:

apply porters stemmer to a Pandas column for each word

pythonpandasporter-stemmer

提问by Sampath Rajapaksha

i have a pandas dataframe called 'data_stem' and there is a column named 'TWEET_SENT_1' which have strings like below (50 rows)

我有一个名为“data_stem”的 Pandas 数据框,有一列名为“TWEET_SENT_1”的列有如下字符串(50 行)

TWEET_SENT_1

TWEET_SENT_1

the mack daddy of kiss cross

亲吻十字架的麦克爸爸

i liked that video body party

我喜欢那个视频身体派对

i want to apply porters stemmer in to 'TWEET_SENT_1' column (for all words of a row) i tried below code and it gives an error . could you please help me to overcome this

我想将搬运工词干应用到“TWEET_SENT_1”列(对于一行的所有单词)我在下面的代码中尝试过,它给出了一个错误。你能帮我克服这个吗

from nltk.stem import PorterStemmer, WordNetLemmatizer
porter_stemmer = PorterStemmer()
data_stem[' TWEET_SENT_1 '] = data_stem[' TWEET_SENT_1 '].apply(lambda x: [porter_stemmer.stem(y) for y in x])

below is the error

下面是错误

    ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-412-c16b1beddfb5> in <module>()
      1 from nltk.stem import PorterStemmer, WordNetLemmatizer
      2 porter_stemmer = PorterStemmer()
----> 3 data_stem[' TWEET_SENT_1 '] = data_stem[' TWEET_SENT_1 '].apply(lambda x: [porter_stemmer.stem(y) for y in x])

C:\Users\SampathR\Anaconda2\envs\dato-env\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2058             values = lib.map_infer(values, lib.Timestamp)
   2059 
-> 2060         mapped = lib.map_infer(values, f, convert=convert_dtype)
   2061         if len(mapped) and isinstance(mapped[0], Series):
   2062             from pandas.core.frame import DataFrame

pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:58435)()

<ipython-input-412-c16b1beddfb5> in <lambda>(x)
      1 from nltk.stem import PorterStemmer, WordNetLemmatizer
      2 porter_stemmer = PorterStemmer()
----> 3 data_stem[' TWEET_SENT_1 '] = data_stem[' TWEET_SENT_1 '].apply(lambda x: [porter_stemmer.stem(y) for y in x])

TypeError: 'NoneType' object is not iterable

回答by Satyadev

What you need to do first is tokenize your sentences. Tokenize means splitting a sentence into words based on the kind of delimiters you have so that you avoid things like punctuations which sometimes are not really required. This depends on the use case though. In sequence modeling where you are trying to predict the next sequence, a comma matters but when you are trying to get pos tags for words just for analysis , it might not.Anyhow, here is how to do the tokenization.

您首先需要做的是标记您的句子。标记化意味着根据您拥有的分隔符类型将句子拆分为单词,这样您就可以避免使用标点符号之类的东西,而这些标点符号有时并不是真正需要的。但这取决于用例。在尝试预测下一个序列的序列建模中,逗号很重要,但是当您尝试获取单词的 pos 标签仅用于分析时,它可能不会。无论如何,这里是如何进行标记化。

data_stem['TWEET_TOKENIZED']=data_stem['TWEET_SENT_1'].apply(lambda x : filter(None,x.split(" ")))

Apply your stemmer to the above tokenized column as follows:

将您的词干分析器应用于上述标记化列,如下所示:

data_stem['Tweet_stemmed']=data_stem['TWEET_TOKENIZED'].apply(lambda x : [porter_stemmer.stem(y) for y in x])

Update : Adding concatenation functionality

更新:添加连接功能

To get back the tweet into sentence format, do the following:

要将推文恢复为句子格式,请执行以下操作:

data_stem['tweet_stemmed_sentence']=data_stem['Tweet_stemmed'].apply(lambda x : " ".join(x))

回答by Devaroop

Applying three different operations to the series with millions of rows is very expensive operation. Instead, apply all at once:

对具有数百万行的系列应用三种不同的操作是非常昂贵的操作。相反,一次全部应用:

def stem_sentences(sentence):
    tokens = sentence.split()
    stemmed_tokens = [porter_stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

data_stem['TWEET_SENT_1'] = data_stem['TWEET_SENT_1'].apply(stem_sentences)

(Note: This is just a modified version of the accepted answer)

(注意:这只是已接受答案的修改版本)