使用 NLTK 和 Pandas 去除停用词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33245567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Stopword removal with NLTK and Pandas
提问by slm
I have some issues with Pandas and NLTK. I am new at programming, so excuse me if i ask questions that might be easy to solve. I have a csv file which has 3 columns(Id,Title,Body) and about 15.000 rows.
我对 Pandas 和 NLTK 有一些问题。我是编程新手,所以如果我问的问题可能很容易解决,请原谅。我有一个 csv 文件,它有 3 列(Id、Title、Body)和大约 15.000 行。
My goal is to remove the stopwords from this csv file. The operation for lowercase and split are working well. But i can not find my mistake why the stopwords does not get removed. What am i missing?
我的目标是从这个 csv 文件中删除停用词。小写和拆分操作运行良好。但我找不到我的错误为什么停用词没有被删除。我错过了什么?
import pandas as pd
from nltk.corpus import stopwords
pd.read_csv("test10in.csv", encoding="utf-8")
df = pd.read_csv("test10in.csv")
df.columns = ['Id','Title','Body']
df['Title'] = df['Title'].str.lower().str.split()
df['Body'] = df['Body'].str.lower().str.split()
stop = stopwords.words('english')
df['Title'].apply(lambda x: [item for item in x if item not in stop])
df['Body'].apply(lambda x: [item for item in x if item not in stop])
df.to_csv("test10out.csv")
回答by AbtPst
you are trying to do an inplace replace. you should do
您正在尝试进行就地替换。你应该做
df['Title'] = df['Title'].apply(lambda x: [item for item in x if item not in stop])
df['Body'] = df['Body'].apply(lambda x: [item for item in x if item not in stop])
回答by 176coding
df.replace(stop,regex=True,inplace=True)