如何用Python检查一个单词是否是英文单词？

Question

提问by Barthelemy

I want to check in a Python program if a word is in the English dictionary.

如果某个单词在英语词典中，我想检查 Python 程序。

I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.

我相信 nltk wordnet 接口可能是要走的路，但我不知道如何将它用于如此简单的任务。

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?

将来，我可能想检查字典中是否有单词的单数形式（例如，properties -> property -> english word）。我将如何实现这一目标？

Answer 1

采纳答案by Katriel

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There's a tutorial, or you could just dive straight in:

为了（更）更多的功能和灵活性，请使用专用的拼写检查库，如PyEnchant. 有一个教程，或者你可以直接潜入：

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchantcomes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice onesif you want more languages.

PyEnchant附带一些词典（en_GB、en_US、de_DE、fr_FR），但如果您想要更多语言，可以使用任何OpenOffice词典。

There appears to be a pluralisation library called inflect, but I've no idea whether it's any good.

似乎有一个名为的复数库inflect，但我不知道它是否有任何好处。

Answer 2

回答by kindall

Using a set to store the word list because looking them up will be faster:

使用集合来存储单词列表，因为查找它们会更快：

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I'd just include the plurals in the word list to begin with.

为了回答问题的第二部分，复数词已经在一个好的单词列表中，但是如果您出于某种原因想从列表中专门排除那些，您确实可以编写一个函数来处理它。但是英语复数规则非常棘手，我只需要在单词列表中包含复数。

As to where to find English word lists, I found several just by Googling "English word list". Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txtYou could Google for British or American English if you want specifically one of those dialects.

至于在哪里可以找到英文单词表，我只是通过谷歌搜索“英文单词表”找到了几个。这是一个：http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt如果您特别想要其中一种方言，您可以在 Google 上搜索英式或美式英语。

Answer 3

回答by burkestar

For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python 'json' module. If it's not English word you'll get no results.

对于语义 Web 方法，您可以对 RDF 格式的 WordNet运行sparql 查询。基本上只使用 urllib 模块发出 GET 请求并以 JSON 格式返回结果，使用 python 'json' 模块解析。如果不是英文单词，则不会得到任何结果。

As another idea, you could query Wiktionary's API.

作为另一个想法，您可以查询维基词典的 API。

Answer 4

回答by Susheel Javadi

Using NLTK:

使用 NLTK：

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

You should refer to this articleif you have trouble installing wordnet or want to try other approaches.

如果您在安装 wordnet 时遇到问题或想尝试其他方法，请参考这篇文章。

Answer 5

回答by Sadik

It won't work well with WordNet, because WordNet does not contain all english words. Another possibility based on NLTK without enchant is NLTK's words corpus

它不适用于 WordNet，因为 WordNet 不包含所有英文单词。另一种基于NLTK没有附魔的可能性是NLTK的词库

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

Answer 6

回答by Eb Abadi

For a faster NLTK-based solution you could hash the set of words to avoid a linear search.

对于更快的基于 NLTK 的解决方案，您可以散列单词集以避免线性搜索。

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

Answer 7

回答by grizmin

With pyEnchant.checker SpellChecker:

使用 pyEnchant.checker SpellChecker：

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

Answer 8

回答by Young Yang

I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn't installed easily in win64 with py3. Wordnet doesn't work very well because it's corpus isn't complete. So for me, I choose the solution answered by @Sadik, and use 'set(words.words())' to speed up.

我发现有 3 个基于包的解决方案可以解决这个问题。它们是 pyenchant、wordnet 和语料库（自定义或来自 ntlk）。Pyenchant 无法使用 py3在win64 中轻松安装。Wordnet 不能很好地工作，因为它的语料库不完整。所以对我来说，我选择 @Sadik 回答的解决方案，并使用 'set(words.words())' 来加速。

First:

第一的：

pip3 install nltk
python3

import nltk
nltk.download('words')

Then:

然后：

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

Answer 9

回答by Linux4Life531

For All Linux/Unix Users

适用于所有 Linux/Unix 用户

If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dictyou have a wordsfile. There is also a more specific american-englishand british-englishfiles. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.

如果您的操作系统使用 Linux 内核，则有一种简单的方法可以从英/美词典中获取所有单词。在目录中/usr/share/dict有一个words文件。还有一个更具体的american-english和british-english文件。这些包含该特定语言中的所有单词。您可以在每种编程语言中访问它，这就是为什么我认为您可能想了解这一点。

Now, for python specific users, the python code below should assign the list words to have the value of every single word:

现在，对于特定于 python 的用户，下面的 python 代码应该分配列表单词以具有每个单词的值：

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()

def is_word(word):
    return word.lower() in words

is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

Hope this helps!!!

希望这可以帮助！！！

如何用Python检查一个单词是否是英文单词？

提问by Barthelemy

采纳答案by Katriel

回答by kindall

回答by burkestar

回答by Susheel Javadi

回答by Sadik

回答by Eb Abadi

回答by grizmin

回答by Young Yang

回答by Linux4Life531

For All Linux/Unix Users

适用于所有 Linux/Unix 用户

相关推荐

最近更新

标签

如何用Python检查一个单词是否是英文单词？

提问by Barthelemy

采纳答案by Katriel

回答by kindall

回答by burkestar

回答by Susheel Javadi

回答by Sadik

回答by Eb Abadi

回答by grizmin

回答by Young Yang

回答by Linux4Life531

For All Linux/Unix Users

适用于所有 Linux/Unix 用户

相关推荐

在 Python 中更新和创建多维字典

Python 如何突出显示 tkinter 文本小部件中的文本

python中无类型的单元测试？

从另一个 Python 脚本运行 Python 脚本，传入参数

相关推荐

最近更新

标签