Python 确定文本是否为英文?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43377265/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Determine if text is in English?
提问by ocean800
I am using both Nltkand Scikit Learnto do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:
我同时使用Nltk和Scikit Learn来进行一些文本处理。但是,在我的文件列表中,我有一些不是英文的文件。例如,以下情况可能为真:
[ "this is some text written in English",
"this is some more text written in English",
"Ce n'est pas en anglais" ]
For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk
or Scikit learn
? EDITI've seen questions both like thisand thisbut both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?
出于分析的目的,我希望将所有非英语句子作为预处理的一部分删除。但是,有没有好的方法可以做到这一点?我一直在谷歌搜索,但找不到任何能让我识别字符串是否为英文的具体内容。这是在Nltk
或中都没有作为功能提供的东西Scikit learn
吗?编辑我见过这样和这样的问题,但都是针对单个单词的......不是“文档”。我是否必须遍历句子中的每个单词来检查整个句子是否是英语?
I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.
我正在使用 Python,所以 Python 中的库会更可取,但如果需要,我可以切换语言,只是认为 Python 是最好的。
回答by salehinejad
There is a library called langdetect. It is ported from Google's language-detection available here:
有一个名为 langdetect 的库。它是从谷歌的语言检测移植过来的,这里提供:
https://pypi.python.org/pypi/langdetect
https://pypi.python.org/pypi/langdetect
It supports 55 languages out of the box.
它支持 55 种开箱即用的语言。
回答by Martin Thoma
You might be interested in my paper The WiLI benchmark dataset for written language identification. I also benchmarked a couple of tools.
您可能对我的论文The WiLI benchmark dataset forwritten language identification感兴趣。我还对几个工具进行了基准测试。
TL;DR:
特尔;博士:
- CLD-2 is pretty good and extremely fast
- lang-detectis a tiny bit better, but much slower
- langid is good, but CLD-2 and lang-detect are much better
- NLTK's Textcat is neither efficient nor effective.
- CLD-2 非常好而且速度非常快
- lang-detect稍微好一点,但要慢得多
- langid 很好,但 CLD-2 和 lang-detect 好得多
- NLTK 的 Textcat 既不高效也不有效。
You can install lidtk
and classify languages:
您可以安装lidtk
和分类语言:
$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"
fra
回答by grizmin
This is what I've used some time ago. It works for texts longer than 3 words and with less than 3 non-recognized words. Of course, you can play with the settings, but for my use case (website scrapping) those worked pretty well.
这是我前段时间用过的。它适用于长度超过 3 个单词且少于 3 个无法识别的单词的文本。当然,您可以使用这些设置,但对于我的用例(网站抓取)来说,这些设置效果很好。
from enchant.checker import SpellChecker
max_error_count = 4
min_text_length = 3
def is_in_english(quote):
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True
print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
> False
> True
回答by Thom Ives
Pretrained Fast Text Model Worked Best For My Similar Needs
预训练的快速文本模型最适合我的类似需求
I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.
我带着非常相似的需求来回答你的问题。我很欣赏 Martin Thoma 的回答。但是,我从 Rabash 的答案第 7 部分HERE 中找到了最大的帮助。
After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttextwas an excellent tool.
在尝试找到最适合我的需求(确保 60,000 多个文本文件中的文本文件为英文)后,我发现fasttext是一个出色的工具。
With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.
通过一些工作,我有了一个可以在许多文件上运行非常快的工具。下面是带注释的代码。我相信您和其他人将能够根据您更具体的需求修改此代码。
class English_Check:
def __init__(self):
# Don't need to train a model to detect languages. A model exists
# that is very good. Let's use it.
pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
self.model = fasttext.load_model(pretrained_model_path)
def predictionict_languages(self, text_file):
this_D = {}
with open(text_file, 'r') as f:
fla = f.readlines() # fla = file line array.
# fasttext doesn't like newline characters, but it can take
# an array of lines from a file. The two list comprehensions
# below, just clean up the lines in fla
fla = [line.rstrip('\n').strip(' ') for line in fla]
fla = [line for line in fla if len(line) > 0]
for line in fla: # Language predict each line of the file
language_tuple = self.model.predictionict(line)
# The next two lines simply get at the top language prediction
# string AND the confidence value for that prediction.
prediction = language_tuple[0][0].replace('__label__', '')
value = language_tuple[1][0]
# Each top language prediction for the lines in the file
# becomes a unique key for the this_D dictionary.
# Everytime that language is found, add the confidence
# score to the running tally for that language.
if prediction not in this_D.keys():
this_D[prediction] = 0
this_D[prediction] += value
self.this_D = this_D
def determine_if_file_is_english(self, text_file):
self.predictionict_languages(text_file)
# Find the max tallied confidence and the sum of all confidences.
max_value = max(self.this_D.values())
sum_of_values = sum(self.this_D.values())
# calculate a relative confidence of the max confidence to all
# confidence scores. Then find the key with the max confidence.
confidence = max_value / sum_of_values
max_key = [key for key in self.this_D.keys()
if self.this_D[key] == max_value][0]
# Only want to know if this is english or not.
return max_key == 'en'
Below is the application / instantiation and use of the above class for my needs.
以下是我需要的上述类的应用程序/实例化和使用。
file_list = # some tool to get my specific list of files to check for English
en_checker = English_Check()
for file in file_list:
check = en_checker.determine_if_file_is_english(file)
if not check:
print(file)
回答by alexis
If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:
如果你想要一些轻量级的东西,字母三元组是一种流行的方法。每种语言都有不同的常见和不常见三元组“配置文件”。你可以在谷歌上搜索它,或者自己编写代码。这是我遇到的一个示例实现,它使用“余弦相似度”作为示例文本和参考数据之间距离的度量:
http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/
http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/
If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.
如果您知道语料库中常见的非英语语言,则很容易将其转换为是/否测试。如果没有,您需要预测没有三元组统计数据的语言中的句子。我会做一些测试以查看文档中单句文本的相似度分数的正常范围,并为英语余弦分数选择合适的阈值。