使用 Python 从文本中删除非英语单词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41290028/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing non-English words from text using Python
提问by Andre Croucher
I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.
我正在对 python 进行数据清理练习,我正在清理的文本包含我想删除的意大利语单词。我一直在网上搜索是否可以使用 nltk 之类的工具包在 Python 上执行此操作。
For example given some text :
例如给出一些文本:
"Io andiamo to the beach with my amico."
I would like to be left with :
我想留下:
"to the beach with my"
Does anyone know of a way as to how this could be done? Any help would be much appreciated.
有谁知道如何做到这一点?任何帮助将非常感激。
回答by DYZ
You can use the words
corpus from NLTK:
您可以使用words
NLTK的语料库:
import nltk
words = set(nltk.corpus.words.words())
sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'
Unfortunately, Iohappens to be an English word. In general, it may be hard to decide whether a word is English or not.
不幸的是,Io恰好是一个英文单词。一般来说,可能很难确定一个词是否是英语。
回答by gdmanandamohon
In MAC OSX it still can show an exception if you try this code. So make sure you download the words corpus manually. Once you import
your nltk
library, make you might as in mac os it does not download the words corpus automatically. So you have to download it potentially otherwise you will face exception.
在 MAC OSX 中,如果您尝试此代码,它仍然可以显示异常。因此,请确保您手动下载单词语料库。一旦你成为import
你的nltk
图书馆,让你可能像在 mac os 中一样它不会自动下载单词语料库。因此,您必须潜在地下载它,否则您将面临异常。
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())
Now you can perform same execution as previous person directed.
现在,您可以按照前一个人的指示执行相同的执行。
sent = "Io andiamo to the beach with my amico."
sent = " ".join(w for w in nltk.wordpunct_tokenize(sent) if w.lower() in words or not w.isalpha())
According to NLTKdocumentation it doesn't say so. But I got a issueover github and solved that way and it really works. If you don't put the word
parameter there, you OSX can logg off and happen again and again.
根据NLTK文档,它没有这么说。但是我在 github 上遇到了一个问题,并以这种方式解决了,它确实有效。如果你不把word
参数放在那里,你的 OSX 可以注销并一次又一次地发生。