pandas 如何在 Python 中删除非英语单词?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49510935/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove Non English words in Python?
提问by Aziz Bokhari
I am doing a sentiment analysis project in Python (using Natural Language Processing). I already collected the data from twitter and saved it as a CSV file. The file contains tweets, which are mostly about cryptocurrency. I cleaned the data but there is one more thing before I apply sentiment analysis using classfication algorithms. Here's the out for importing libraries
我正在用 Python 做一个情感分析项目(使用自然语言处理)。我已经从 Twitter 收集了数据并将其保存为 CSV 文件。该文件包含推文,主要是关于加密货币的。我清理了数据,但在使用分类算法应用情感分析之前还有一件事。这是导入库的输出
# importing Libraries
from pandas import DataFrame, read_csv
import chardet
import matplotlib.pyplot as plt; plt.rcdefaults()
from matplotlib import rc
%matplotlib inline
import pandas as pd
plt.style.use('ggplot')
import numpy as np
import re
import warnings
#Visualisation
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from IPython.display import display
from mpl_toolkits.basemap import Basemap
from wordcloud import WordCloud, STOPWORDS
#nltk
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
matplotlib.style.use('ggplot')
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore")
%matplotlib inline
## Reading CSV File and naming the object called crime
ltweet=pd.read_csv("C:\Users\name\Documents\python assignment\litecoin1.csv",index_col = None, skipinitialspace = True)
print(ltweet)
I already clean most of the data, so no need to put the codes for that part. In my column there are tweets that contains mostly non English language. I want to remove all of them(Non English text only). Here's the output for example
我已经清理了大部分数据,因此无需放置该部分的代码。在我的专栏中,有一些推文主要包含非英语语言。我想删除所有这些(仅限非英文文本)。例如,这是输出
ltweet['Tweets'][0:3]
output:
0 the has published a book on understanding ?????????????
1 accepts litecoin gives % discount on all iphon...
2 days until litepay launches accept store and s...
3 ltc to usd price litecoin ltc cryptocurrency
Is there a way to remove non English words in the data? Can anyone help me write the code for it? By the way, the code is based on Pandas.
有没有办法去除数据中的非英语单词?谁能帮我写代码?顺便说一下,代码是基于 Pandas 的。
回答by lenngro
There has been a similar question here.
You could try enchant:
你可以试试附魔:
import enchant
d = enchant.Dict("en_US")
word = "Bonjour"
d.check(word)
This will return "False".
这将返回“假”。
Do this for every word in the text:
对文本中的每个单词执行此操作:
english_words = []
for word in text:
if d.check(word):
english_words.append(word)
Edit: Watch out for words that appear in multiple languages.
编辑:注意出现在多种语言中的单词。