如何用Python检查一个单词是否是英文单词?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3788870/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to check if a word is an English word with Python?
提问by Barthelemy
I want to check in a Python program if a word is in the English dictionary.
如果某个单词在英语词典中,我想检查 Python 程序。
I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.
我相信 nltk wordnet 接口可能是要走的路,但我不知道如何将它用于如此简单的任务。
def is_english_word(word):
pass # how to I implement is_english_word?
is_english_word(token.lower())
In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?
将来,我可能想检查字典中是否有单词的单数形式(例如,properties -> property -> english word)。我将如何实现这一目标?
采纳答案by Katriel
For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There's a tutorial, or you could just dive straight in:
为了(更)更多的功能和灵活性,请使用专用的拼写检查库,如PyEnchant. 有一个教程,或者你可以直接潜入:
>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>
PyEnchantcomes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice onesif you want more languages.
PyEnchant附带一些词典(en_GB、en_US、de_DE、fr_FR),但如果您想要更多语言,可以使用任何OpenOffice词典。
There appears to be a pluralisation library called inflect, but I've no idea whether it's any good.
似乎有一个名为 的复数库inflect,但我不知道它是否有任何好处。
回答by kindall
Using a set to store the word list because looking them up will be faster:
使用集合来存储单词列表,因为查找它们会更快:
with open("english_words.txt") as word_file:
english_words = set(word.strip().lower() for word in word_file)
def is_english_word(word):
return word.lower() in english_words
print is_english_word("ham") # should be true if you have a good english_words.txt
To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I'd just include the plurals in the word list to begin with.
为了回答问题的第二部分,复数词已经在一个好的单词列表中,但是如果您出于某种原因想从列表中专门排除那些,您确实可以编写一个函数来处理它。但是英语复数规则非常棘手,我只需要在单词列表中包含复数。
As to where to find English word lists, I found several just by Googling "English word list". Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txtYou could Google for British or American English if you want specifically one of those dialects.
至于在哪里可以找到英文单词表,我只是通过谷歌搜索“英文单词表”找到了几个。这是一个:http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt如果您特别想要其中一种方言,您可以在 Google 上搜索英式或美式英语。
回答by burkestar
For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python 'json' module. If it's not English word you'll get no results.
对于语义 Web 方法,您可以对 RDF 格式的 WordNet运行sparql 查询。基本上只使用 urllib 模块发出 GET 请求并以 JSON 格式返回结果,使用 python 'json' 模块解析。如果不是英文单词,则不会得到任何结果。
As another idea, you could query Wiktionary's API.
作为另一个想法,您可以查询维基词典的 API。
回答by Susheel Javadi
Using NLTK:
使用 NLTK:
from nltk.corpus import wordnet
if not wordnet.synsets(word_to_test):
#Not an English Word
else:
#English Word
You should refer to this articleif you have trouble installing wordnet or want to try other approaches.
如果您在安装 wordnet 时遇到问题或想尝试其他方法,请参考这篇文章。
回答by Sadik
It won't work well with WordNet, because WordNet does not contain all english words. Another possibility based on NLTK without enchant is NLTK's words corpus
它不适用于 WordNet,因为 WordNet 不包含所有英文单词。另一种基于NLTK没有附魔的可能性是NLTK的词库
>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True
回答by Eb Abadi
For a faster NLTK-based solution you could hash the set of words to avoid a linear search.
对于更快的基于 NLTK 的解决方案,您可以散列单词集以避免线性搜索。
from nltk.corpus import words as nltk_words
def is_english_word(word):
# creation of this dictionary would be done outside of
# the function because you only need to do it once.
dictionary = dict.fromkeys(nltk_words.words(), None)
try:
x = dictionary[word]
return True
except KeyError:
return False
回答by grizmin
With pyEnchant.checker SpellChecker:
使用 pyEnchant.checker SpellChecker:
from enchant.checker import SpellChecker
def is_in_english(quote):
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > 4) or len(quote.split()) < 3) else True
print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
> False
> True
回答by Young Yang
I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn't installed easily in win64 with py3. Wordnet doesn't work very well because it's corpus isn't complete. So for me, I choose the solution answered by @Sadik, and use 'set(words.words())' to speed up.
我发现有 3 个基于包的解决方案可以解决这个问题。它们是 pyenchant、wordnet 和语料库(自定义或来自 ntlk)。Pyenchant 无法使用 py3在win64 中轻松安装。Wordnet 不能很好地工作,因为它的语料库不完整。所以对我来说,我选择 @Sadik 回答的解决方案,并使用 'set(words.words())' 来加速。
First:
第一的:
pip3 install nltk
python3
import nltk
nltk.download('words')
Then:
然后:
from nltk.corpus import words
setofwords = set(words.words())
print("hello" in setofwords)
>>True
回答by Linux4Life531
For All Linux/Unix Users
适用于所有 Linux/Unix 用户
If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dictyou have a wordsfile. There is also a more specific american-englishand british-englishfiles. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.
如果您的操作系统使用 Linux 内核,则有一种简单的方法可以从英/美词典中获取所有单词。在目录中/usr/share/dict有一个words文件。还有一个更具体的american-english和british-english文件。这些包含该特定语言中的所有单词。您可以在每种编程语言中访问它,这就是为什么我认为您可能想了解这一点。
Now, for python specific users, the python code below should assign the list words to have the value of every single word:
现在,对于特定于 python 的用户,下面的 python 代码应该分配列表单词以具有每个单词的值:
import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ", file.read()).split()
def is_word(word):
return word.lower() in words
is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False
Hope this helps!!!
希望这可以帮助!!!

