Python：如何确定语言？

Question

提问by Rita

I want to get this:

我想得到这个：

Input text: "ру?сский язы?к"
Output text: "Russian" 

Input text: "中文"
Output text: "Chinese" 

Input text: "にほんご"
Output text: "Japanese" 

Input text: "????????????"
Output text: "Arabic"

How can I do it in python? Thanks.

我怎样才能在python中做到这一点？谢谢。

Answer 1

采纳答案by dheiberg

Have you had a look at langdetect?

你看过langdetect吗？

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de

Answer 2

回答by Rabash

TextBlob. Requires NLTK package, uses Google.

from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()

pip install textblob

Polyglot. Requires numpy and some arcane libraries, ~~unlikely to get it work for Windows~~. (For Windows, get an appropriate versions of PyICU, Morfessorand PyCLD2from here, then just pip install downloaded_wheel.whl.) Able to detect texts with mixed languages.

from polyglot.detect import Detector

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
        print(language)

# name: English     code: en       confidence:  87.0 read bytes:  1154
# name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
# name: un          code: un       confidence:   0.0 read bytes:     0

pip install polyglot

To install the dependencies, run: sudo apt-get install python-numpy libicu-dev

chardethas also a feature of detecting languages if there are character bytes in range (127-255]:

>>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
{'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}

pip install chardet

langdetectRequires large portions of text. It uses non-deterministic approach under the hood. That means you get different results for the same text sample. Docs say you have to use following code to make it determined:
```
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
detect('今一はお前さん')
```
pip install langdetect
guess_languageCan detect very short samples by using thisspell checker with dictionaries.
pip install guess_language-spirit

langidprovides both module

import langid
langid.classify("This is a test")
# ('en', -54.41310358047485)

and a command-line tool:

$ langid < README.md

pip install langid

FastTextis a text classifier, can be used to recognize 176 languages with a proper models for language classification. Download this model, then:

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('????? ????', k=2))  # top 2 matching languages

(('__label__ar', '__label__fa'), array([0.98124713, 0.01265871]))

pip install fasttext

pyCLD3is a neural network model for language identification. This package contains the inference code and a trained model.

import cld3
cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")

LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

pip install pycld3

文本块。需要 NLTK 包，使用 Google。

from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()

pip install textblob

多语种。需要 numpy 和一些神秘的库，~~不太可能让它在 Windows 上工作~~。（对于 Windows，从这里获取适当版本的PyICU、Morfessor和PyCLD2，然后就可以了。）能够检测混合语言的文本。pip install downloaded_wheel.whl

from polyglot.detect import Detector

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
        print(language)

# name: English     code: en       confidence:  87.0 read bytes:  1154
# name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
# name: un          code: un       confidence:   0.0 read bytes:     0

pip install polyglot

要安装依赖项，请运行： sudo apt-get install python-numpy libicu-dev

如果字符字节在范围 (127-255])，chardet还具有检测语言的功能：

>>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
{'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}

pip install chardet

langdetect需要大部分文本。它在幕后使用非确定性方法。这意味着对于相同的文本样本，您会得到不同的结果。文档说您必须使用以下代码来确定：
```
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
detect('今一はお前さん')
```
pip install langdetect
guess_language可以通过将这个拼写检查器与字典一起使用来检测非常短的样本。
pip install guess_language-spirit

langid提供了两个模块

import langid
langid.classify("This is a test")
# ('en', -54.41310358047485)

和一个命令行工具：

$ langid < README.md

pip install langid

FastText是一个文本分类器，可用于识别具有适当语言分类模型的176 种语言。下载这个模型，然后：

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('????? ????', k=2))  # top 2 matching languages

(('__label__ar', '__label__fa'), array([0.98124713, 0.01265871]))

pip install fasttext

pyCLD3是一种用于语言识别的神经网络模型。这个包包含推理代码和一个经过训练的模型。

import cld3
cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")

LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

pip install pycld3

Answer 3

回答by Habib Karbasian

There is an issue with langdetectwhen it is being used for parallelization and it fails. But spacy_langdetectis a wrapper for that and you can use it for that purpose. You can use the following snippet as well:

有有一个问题langdetect，当被用于并行化它，它失败。但它spacy_langdetect是一个包装器，您可以将其用于该目的。您也可以使用以下代码段：

import spacy
from spacy_langdetect import LanguageDetector

nlp = spacy.load("en")
nlp.add_pipe(LanguageDetector(), name="language_detector", last=True)
text = "This is English text Er lebt mit seinen Eltern und seiner Schwester in Berlin. Yo me divierto todos los días en el parque. Je m'appelle Angélica Summer, j'ai 12 ans et je suis canadienne."
doc = nlp(text)
# document level language detection. Think of it like average language of document!
print(doc._.language['language'])
# sentence level language detection
for i, sent in enumerate(doc.sents):
    print(sent, sent._.language)

Answer 4

回答by Salva Carrión

Depending on the case, you might be interested in using one of the following methods:

根据具体情况，您可能对使用以下方法之一感兴趣：

Method 0: Use an API or library

方法 0：使用 API 或库

Usually, there are a few problems with these libraries because some of them are not accurate for small texts, some languages are missing, are slow, require internet connection, are non-free,... But generally speaking, they will suit most needs.

通常，这些库存在一些问题，因为其中一些对小文本不准确，一些语言缺失，速度慢，需要互联网连接，非免费，...但总的来说，它们将满足大多数需求.

Method 1: Language models

方法一：语言模型

A language model gives us the probability of a sequence of words. This is important because it allows us to robustly detect the language of a text, even when the text contains words in other languages (e.g.: "'Hola' means 'hello' in spanish").

语言模型为我们提供了一系列单词的概率。这很重要，因为它允许我们稳健地检测文本的语言，即使文本包含其他语言的单词（例如：“'Hola' 在西班牙语中的意思是 'hello'”）。

You can use N language models (one per language), to score your text. The detected language will be the language of the model that gave you the highest score.

您可以使用 N 个语言模型（每种语言一个）来为您的文本评分。检测到的语言将是给您最高分的模型的语言。

If you want to build a simple language model for this, I'd go for 1-grams. To do this, you only need to count the number of times each word from a big text (e.g. Wikipedia Corpus in "X" language) has appeared.

如果你想为此构建一个简单的语言模型，我会选择 1-grams。为此，您只需计算大文本（例如“X”语言的维基百科语料库）中每个单词出现的次数。

Then, the probability of a word will be its frequency divided by the total number of words analyzed (sum of all frequencies).

然后，一个单词的概率将是它的频率除以分析的单词总数（所有频率的总和）。

the 23135851162
of  13151942776
and 12997637966
to  12136980858
a   9081174698
in  8469404971
for 5933321709
...

=> P("'Hola' means 'hello' in spanish") = P("hola") * P("means") * P("hello") * P("in") * P("spanish")

If the text to detect is quite big, I recommend sampling N random words and then use the sum of logarithms instead of multiplications to avoid floating-point precision problems.

如果要检测的文本很大，我建议对 N 个随机单词进行采样，然后使用对数之和而不是乘法来避免浮点精度问题。

P(s) = 0.03 * 0.01 * 0.014 = 0.0000042
P(s) = log10(0.03) + log10(0.01) + log10(0.014) = -5.376

Method 2: Intersecting sets

方法 2：相交集

An even simpler approach is to prepare N sets (one per language) with the top M most frequent words. Then intersect your text with each set. The set with the highest number of intersections will be your detected language.

一种更简单的方法是用前 M 个最常用的词准备 N 个集合（每种语言一个）。然后将您的文本与每组相交。交叉点数量最多的集合将是您检测到的语言。

spanish_set = {"de", "hola", "la", "casa",...}
english_set = {"of", "hello", "the", "house",...}
czech_set = {"z", "ahoj", "závěrky", "d?m",...}
...

text_set = {"hola", "means", "hello", "in", "spanish"}

spanish_votes = text_set.intersection(spanish_set)  # 1
english_votes = text_set.intersection(english_set)  # 4
czech_votes = text_set.intersection(czech_set)  # 0
...

Method 3: Zip compression

方法 3：Zip 压缩

This more a curiosity than anything else, but here it goes... You can compress your text (e.g LZ77) and then measure the zip-distance with regards to a reference compressed text (target language). Personally, I didn't like it because it's slower, less accurate and less descriptive than other methods. Nevertheless, there might be interesting applications for this method. To read more: Language Trees and Zipping

这比其他任何事情都更令人好奇，但在这里...您可以压缩您的文本（例如 LZ77），然后根据参考压缩文本（目标语言）测量 zip-distance。就我个人而言，我不喜欢它，因为它比其他方法更慢、更不准确且描述性更差。然而，这种方法可能有一些有趣的应用。阅读更多：语言树和压缩

Answer 5

回答by Kerbiter

You can try determining the Unicode group of chars in input string to point out type of language, (Cyrillic for Russian, for example), and then search for language-specific symbols in text.

您可以尝试确定输入字符串中的 Unicode 字符组以指出语言类型（例如，俄语的西里尔字母），然后在文本中搜索特定于语言的符号。

Answer 6

回答by Thom Ives

Pretrained Fast Text Model Worked Best For My Similar Needs

预训练的快速文本模型最适合我的类似需求

I arrived at your question with a very similar need. I found the most help from Rabash's answers for my specific needs.

我带着非常相似的需求来回答你的问题。我从 Rabash 的回答中找到了对我特定需求的最大帮助。

After experimenting to find what worked best among his recommendations, which was making sure that text files were in English in 60,000+ text files, I found that fasttext was an excellent tool for such a task.

在尝试从他的建议中找出最有效的方法后，即确保 60,000 多个文本文件中的文本文件是英文的，我发现 fasttext 是完成此类任务的绝佳工具。

With a little work, I had a tool that worked very fast over many files. But it could be easily modified for something like your case, because fasttext works over a list of lines easily.

通过一些工作，我有了一个可以在许多文件上运行非常快的工具。但是对于像您这样的情况可以很容易地修改它，因为 fasttext 可以轻松地处理一系列行。

My code with comments is among the answers on THISpost. I believe that you and others can easily modify this code for other specific needs.

我的带有注释的代码是这篇文章的答案之一。我相信您和其他人可以针对其他特定需求轻松修改此代码。

Python：如何确定语言？

提问by Rita

采纳答案by dheiberg

回答by Rabash

回答by Habib Karbasian

回答by Salva Carrión

回答by Kerbiter

回答by Thom Ives

Pretrained Fast Text Model Worked Best For My Similar Needs

预训练的快速文本模型最适合我的类似需求

相关推荐

最近更新

标签

Python：如何确定语言？

提问by Rita

采纳答案by dheiberg

回答by Rabash

回答by Habib Karbasian

回答by Salva Carrión

回答by Kerbiter

回答by Thom Ives

Pretrained Fast Text Model Worked Best For My Similar Needs

预训练的快速文本模型最适合我的类似需求

相关推荐

Python 对 Pandas 数据框中的列使用 map()

Python Tensorflow Deep MNIST：资源耗尽：分配具有形状的张量时OOM [10000,32,28,28]

Python 找不到满足 easy_install 要求的版本（来自版本：）

Python 如何在 Keras 中处理 LSTM 的多个输入？

相关推荐

最近更新

标签