Python 从 Pandas 数据框中删除停用词

Question

提问by I am not George

I want to remove the stop words from my column "tweets". How do I iterative over each row and each item?

我想从我的专栏“推文”中删除停用词。我如何迭代每一行和每个项目？

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]
test["tweet"] = test["tweet"].str.lower().str.split()

from nltk.corpus import stopwords
stop = stopwords.words('english')

Answer 1

采纳答案by Liam Foley

Using List Comprehension

使用列表理解

test['tweet'].apply(lambda x: [item for item in x if item not in stop])

Returns:

返回：

0               [love, car]
1           [view, amazing]
2    [feel, great, morning]
3        [excited, concert]
4            [best, friend]

Answer 2

回答by mok0

Check out pd.DataFrame.replace(), it might work for you:

查看 pd.DataFrame.replace()，它可能对你有用：

In [42]: test.replace(to_replace='I', value="",regex=True)
Out[42]:
                              tweet     class
0                     love this car  positive
1              This view is amazing  positive
2           feel great this morning  positive
3   am so excited about the concert  positive
4              He is my best friend  positive

Edit : replace()would search for string(and even substrings). For e.g. it would replace rkfrom workif rkis a stopword which sometimes is not expected.

编辑：replace()将搜索字符串（甚至子字符串）。例如，它将替换rkfrom workifrk是一个有时不期望的停用词。

Hence the use of regexhere :

因此在regex这里使用：

for i in stop :
    test = test.replace(to_replace=r'\b%s\b'%i, value="",regex=True)

Answer 3

回答by Keiku

We can import stopwordsfrom nltk.corpusas below. With that, We exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.

我们可以stopwords从nltk.corpus下面导入。有了这个，我们用 Python 的列表理解和pandas.DataFrame.apply.

# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test)
# Out[40]:
#                                tweet     class tweet_without_stopwords
# 0                    I love this car  positive              I love car
# 1               This view is amazing  positive       This view amazing
# 2          I feel great this morning  positive    I feel great morning
# 3  I am so excited about the concert  positive       I excited concert
# 4               He is my best friend  positive          He best friend

It can also be excluded by using pandas.Series.str.replace.

也可以使用pandas.Series.str.replace.

pat = r'\b(?:{})\b'.format('|'.join(stop))
test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ')
# Same results.
# 0              I love car
# 1       This view amazing
# 2    I feel great morning
# 3       I excited concert
# 4          He best friend

If you can not import stopwords, you can download as follows.

如果不能导入停用词，可以如下下载。

import nltk
nltk.download('stopwords')

Another way to answer is to import text.ENGLISH_STOP_WORDSfrom sklearn.feature_extraction.

另一种回答方法是text.ENGLISH_STOP_WORDS从sklearn.feature_extraction.

# Import stopwords with scikit-learn
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS

Notice that the number of words in the scikit-learn stopwords and nltk stopwords are different.

请注意，scikit-learn 停用词和 nltk 停用词中的词数是不同的。

Python 从 Pandas 数据框中删除停用词

提问by I am not George

采纳答案by Liam Foley

回答by mok0

回答by Keiku

相关推荐

最近更新

标签

Python 从 Pandas 数据框中删除停用词

提问by I am not George

采纳答案by Liam Foley

回答by mok0

回答by Keiku

相关推荐

如何在终端中执行一行 python 脚本？

Python 您的数据库没有 South 数据库模块“south.db.postgresql_psycopg2”

Python lxml 安装错误 ubuntu 14.04（内部编译器错误）

Python 将排序应用于 Pandas groupby 操作

相关推荐

最近更新

标签