pandas 如何在 Python 中删除非英语单词？

Question

提问by Aziz Bokhari

I am doing a sentiment analysis project in Python (using Natural Language Processing). I already collected the data from twitter and saved it as a CSV file. The file contains tweets, which are mostly about cryptocurrency. I cleaned the data but there is one more thing before I apply sentiment analysis using classfication algorithms. Here's the out for importing libraries

我正在用 Python 做一个情感分析项目（使用自然语言处理）。我已经从 Twitter 收集了数据并将其保存为 CSV 文件。该文件包含推文，主要是关于加密货币的。我清理了数据，但在使用分类算法应用情感分析之前还有一件事。这是导入库的输出

# importing Libraries
from pandas import DataFrame, read_csv
import chardet
import matplotlib.pyplot as plt; plt.rcdefaults()
from matplotlib import rc
%matplotlib inline
import pandas as pd
plt.style.use('ggplot')
import numpy as np
import re
import warnings

#Visualisation
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from IPython.display import display
from mpl_toolkits.basemap import Basemap
from wordcloud import WordCloud, STOPWORDS

#nltk
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer


matplotlib.style.use('ggplot')
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore")

%matplotlib inline

    ## Reading CSV File and naming the object called crime
ltweet=pd.read_csv("C:\Users\name\Documents\python assignment\litecoin1.csv",index_col = None, skipinitialspace = True)
print(ltweet)

I already clean most of the data, so no need to put the codes for that part. In my column there are tweets that contains mostly non English language. I want to remove all of them(Non English text only). Here's the output for example

我已经清理了大部分数据，因此无需放置该部分的代码。在我的专栏中，有一些推文主要包含非英语语言。我想删除所有这些（仅限非英文文本）。例如，这是输出

ltweet['Tweets'][0:3]

output:
0      the has published a book on understanding ?????????????
1      accepts litecoin gives % discount on all iphon...
2      days until litepay launches accept store and s...
3           ltc to usd price litecoin ltc cryptocurrency

Is there a way to remove non English words in the data? Can anyone help me write the code for it? By the way, the code is based on Pandas.

有没有办法去除数据中的非英语单词？谁能帮我写代码？顺便说一下，代码是基于 Pandas 的。

Answer 1

回答by lenngro

There has been a similar question here.

这里有一个类似的问题。

You could try enchant:

你可以试试附魔：

import enchant
d = enchant.Dict("en_US")
word = "Bonjour"
d.check(word)

This will return "False".

这将返回“假”。

Do this for every word in the text:

对文本中的每个单词执行此操作：

english_words = []
for word in text:
    if d.check(word):
        english_words.append(word)

Edit: Watch out for words that appear in multiple languages.

编辑：注意出现在多种语言中的单词。

pandas 如何在 Python 中删除非英语单词？

提问by Aziz Bokhari

回答by lenngro

相关推荐

最近更新

标签

pandas 如何在 Python 中删除非英语单词？

提问by Aziz Bokhari

回答by lenngro

相关推荐

pandas 如何检查浮动熊猫列是否只包含整数？

pandas raise ValueError("np.nan 是一个无效的文档，预期的字节或"

pandas 通过字典有效地替换熊猫系列中的值

Python Pandas 按多列分组，另一列的平均值 - 不按对象分组

相关推荐

最近更新

标签