使用 Python 从文本中删除非英语单词

Question

提问by Andre Croucher

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.

我正在对 python 进行数据清理练习，我正在清理的文本包含我想删除的意大利语单词。我一直在网上搜索是否可以使用 nltk 之类的工具包在 Python 上执行此操作。

For example given some text :

例如给出一些文本：

"Io andiamo to the beach with my amico."

I would like to be left with :

我想留下：

"to the beach with my"

Does anyone know of a way as to how this could be done? Any help would be much appreciated.

有谁知道如何做到这一点？任何帮助将非常感激。

Answer 1

回答by DYZ

You can use the wordscorpus from NLTK:

您可以使用wordsNLTK的语料库：

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'

Unfortunately, Iohappens to be an English word. In general, it may be hard to decide whether a word is English or not.

不幸的是，Io恰好是一个英文单词。一般来说，可能很难确定一个词是否是英语。

Answer 2

回答by gdmanandamohon

In MAC OSX it still can show an exception if you try this code. So make sure you download the words corpus manually. Once you importyour nltklibrary, make you might as in mac os it does not download the words corpus automatically. So you have to download it potentially otherwise you will face exception.

在 MAC OSX 中，如果您尝试此代码，它仍然可以显示异常。因此，请确保您手动下载单词语料库。一旦你成为import你的nltk图书馆，让你可能像在 mac os 中一样它不会自动下载单词语料库。因此，您必须潜在地下载它，否则您将面临异常。

import nltk 
nltk.download('words')
words = set(nltk.corpus.words.words())

Now you can perform same execution as previous person directed.

现在，您可以按照前一个人的指示执行相同的执行。

sent = "Io andiamo to the beach with my amico."
sent = " ".join(w for w in nltk.wordpunct_tokenize(sent) if w.lower() in words or not w.isalpha())

According to NLTKdocumentation it doesn't say so. But I got a issueover github and solved that way and it really works. If you don't put the wordparameter there, you OSX can logg off and happen again and again.

根据NLTK文档，它没有这么说。但是我在 github 上遇到了一个问题，并以这种方式解决了，它确实有效。如果你不把word参数放在那里，你的 OSX 可以注销并一次又一次地发生。

使用 Python 从文本中删除非英语单词

提问by Andre Croucher

回答by DYZ

回答by gdmanandamohon

相关推荐

最近更新

标签

使用 Python 从文本中删除非英语单词

提问by Andre Croucher

回答by DYZ

回答by gdmanandamohon

相关推荐

Python NameError: 名称 'csv' 未定义

Python 错误：找不到 pip 的匹配分布

Python：ImportError: lxml not found，请安装

Python 在模块内部使用时未定义 itertools

相关推荐

最近更新

标签