Python 从 Pandas 数据框中删除停用词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29523254/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python remove stop words from pandas dataframe
提问by I am not George
I want to remove the stop words from my column "tweets". How do I iterative over each row and each item?
我想从我的专栏“推文”中删除停用词。我如何迭代每一行和每个项目?
pos_tweets = [('I love this car', 'positive'),
('This view is amazing', 'positive'),
('I feel great this morning', 'positive'),
('I am so excited about the concert', 'positive'),
('He is my best friend', 'positive')]
test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]
test["tweet"] = test["tweet"].str.lower().str.split()
from nltk.corpus import stopwords
stop = stopwords.words('english')
采纳答案by Liam Foley
Using List Comprehension
使用列表理解
test['tweet'].apply(lambda x: [item for item in x if item not in stop])
Returns:
返回:
0 [love, car]
1 [view, amazing]
2 [feel, great, morning]
3 [excited, concert]
4 [best, friend]
回答by mok0
Check out pd.DataFrame.replace(), it might work for you:
查看 pd.DataFrame.replace(),它可能对你有用:
In [42]: test.replace(to_replace='I', value="",regex=True)
Out[42]:
tweet class
0 love this car positive
1 This view is amazing positive
2 feel great this morning positive
3 am so excited about the concert positive
4 He is my best friend positive
Edit : replace()
would search for string(and even substrings). For e.g. it would replace rk
from work
if rk
is a stopword which sometimes is not expected.
编辑:replace()
将搜索字符串(甚至子字符串)。例如,它将替换rk
from work
ifrk
是一个有时不期望的停用词。
Hence the use of regex
here :
因此在regex
这里使用:
for i in stop :
test = test.replace(to_replace=r'\b%s\b'%i, value="",regex=True)
回答by Keiku
We can import stopwords
from nltk.corpus
as below. With that, We exclude stopwords with Python's list comprehension and pandas.DataFrame.apply
.
我们可以stopwords
从nltk.corpus
下面导入。有了这个,我们用 Python 的列表理解和pandas.DataFrame.apply
.
# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')
pos_tweets = [('I love this car', 'positive'),
('This view is amazing', 'positive'),
('I feel great this morning', 'positive'),
('I am so excited about the concert', 'positive'),
('He is my best friend', 'positive')]
test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]
# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test)
# Out[40]:
# tweet class tweet_without_stopwords
# 0 I love this car positive I love car
# 1 This view is amazing positive This view amazing
# 2 I feel great this morning positive I feel great morning
# 3 I am so excited about the concert positive I excited concert
# 4 He is my best friend positive He best friend
It can also be excluded by using pandas.Series.str.replace
.
也可以使用pandas.Series.str.replace
.
pat = r'\b(?:{})\b'.format('|'.join(stop))
test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ')
# Same results.
# 0 I love car
# 1 This view amazing
# 2 I feel great morning
# 3 I excited concert
# 4 He best friend
If you can not import stopwords, you can download as follows.
如果不能导入停用词,可以如下下载。
import nltk
nltk.download('stopwords')
Another way to answer is to import text.ENGLISH_STOP_WORDS
from sklearn.feature_extraction
.
另一种回答方法是text.ENGLISH_STOP_WORDS
从sklearn.feature_extraction
.
# Import stopwords with scikit-learn
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS
Notice that the number of words in the scikit-learn stopwords and nltk stopwords are different.
请注意,scikit-learn 停用词和 nltk 停用词中的词数是不同的。