将搬运工词干分析器应用于每个单词的 Pandas 列

Question

提问by Sampath Rajapaksha

i have a pandas dataframe called 'data_stem' and there is a column named 'TWEET_SENT_1' which have strings like below (50 rows)

我有一个名为“data_stem”的 Pandas 数据框，有一列名为“TWEET_SENT_1”的列有如下字符串（50 行）

TWEET_SENT_1

the mack daddy of kiss cross

亲吻十字架的麦克爸爸

i liked that video body party

我喜欢那个视频身体派对

i want to apply porters stemmer in to 'TWEET_SENT_1' column (for all words of a row) i tried below code and it gives an error . could you please help me to overcome this

我想将搬运工词干应用到“TWEET_SENT_1”列（对于一行的所有单词）我在下面的代码中尝试过，它给出了一个错误。你能帮我克服这个吗

from nltk.stem import PorterStemmer, WordNetLemmatizer
porter_stemmer = PorterStemmer()
data_stem[' TWEET_SENT_1 '] = data_stem[' TWEET_SENT_1 '].apply(lambda x: [porter_stemmer.stem(y) for y in x])

below is the error

下面是错误

    ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-412-c16b1beddfb5> in <module>()
      1 from nltk.stem import PorterStemmer, WordNetLemmatizer
      2 porter_stemmer = PorterStemmer()
----> 3 data_stem[' TWEET_SENT_1 '] = data_stem[' TWEET_SENT_1 '].apply(lambda x: [porter_stemmer.stem(y) for y in x])

C:\Users\SampathR\Anaconda2\envs\dato-env\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2058             values = lib.map_infer(values, lib.Timestamp)
   2059 
-> 2060         mapped = lib.map_infer(values, f, convert=convert_dtype)
   2061         if len(mapped) and isinstance(mapped[0], Series):
   2062             from pandas.core.frame import DataFrame

pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:58435)()

<ipython-input-412-c16b1beddfb5> in <lambda>(x)
      1 from nltk.stem import PorterStemmer, WordNetLemmatizer
      2 porter_stemmer = PorterStemmer()
----> 3 data_stem[' TWEET_SENT_1 '] = data_stem[' TWEET_SENT_1 '].apply(lambda x: [porter_stemmer.stem(y) for y in x])

TypeError: 'NoneType' object is not iterable

Answer 1

回答by Satyadev

What you need to do first is tokenize your sentences. Tokenize means splitting a sentence into words based on the kind of delimiters you have so that you avoid things like punctuations which sometimes are not really required. This depends on the use case though. In sequence modeling where you are trying to predict the next sequence, a comma matters but when you are trying to get pos tags for words just for analysis , it might not.Anyhow, here is how to do the tokenization.

您首先需要做的是标记您的句子。标记化意味着根据您拥有的分隔符类型将句子拆分为单词，这样您就可以避免使用标点符号之类的东西，而这些标点符号有时并不是真正需要的。但这取决于用例。在尝试预测下一个序列的序列建模中，逗号很重要，但是当您尝试获取单词的 pos 标签仅用于分析时，它可能不会。无论如何，这里是如何进行标记化。

data_stem['TWEET_TOKENIZED']=data_stem['TWEET_SENT_1'].apply(lambda x : filter(None,x.split(" ")))

Apply your stemmer to the above tokenized column as follows:

将您的词干分析器应用于上述标记化列，如下所示：

data_stem['Tweet_stemmed']=data_stem['TWEET_TOKENIZED'].apply(lambda x : [porter_stemmer.stem(y) for y in x])

Update : Adding concatenation functionality

更新：添加连接功能

To get back the tweet into sentence format, do the following:

要将推文恢复为句子格式，请执行以下操作：

data_stem['tweet_stemmed_sentence']=data_stem['Tweet_stemmed'].apply(lambda x : " ".join(x))

Answer 2

回答by Devaroop

Applying three different operations to the series with millions of rows is very expensive operation. Instead, apply all at once:

对具有数百万行的系列应用三种不同的操作是非常昂贵的操作。相反，一次全部应用：

def stem_sentences(sentence):
    tokens = sentence.split()
    stemmed_tokens = [porter_stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

data_stem['TWEET_SENT_1'] = data_stem['TWEET_SENT_1'].apply(stem_sentences)

(Note: This is just a modified version of the accepted answer)

（注意：这只是已接受答案的修改版本）

将搬运工词干分析器应用于每个单词的 Pandas 列

提问by Sampath Rajapaksha

回答by Satyadev

回答by Devaroop

相关推荐

最近更新

标签

将搬运工词干分析器应用于每个单词的 Pandas 列

提问by Sampath Rajapaksha

回答by Satyadev

回答by Devaroop

相关推荐

pandas 熊猫：将月份中的日期转换为下个月的第一天

如何重命名 Python Pandas 中的索引行？

根据包含 Pandas 中特定字符串的列名选择列

pandas 如何在数据帧熊猫中将 0 替换为空值？

相关推荐

最近更新

标签