pandas python中的文本语言检测

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43916600/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:35:24  来源:igfitidea点击:

Text Language detection in python

pythonpandasdataframe

提问by Abrar

I am trying to detect the language of the text that may consist of an unknown number of languages. The following code gives me different languages as answer NOTE: I reduced the review becuase it was giving the error during post "" are not allowed

我正在尝试检测可能包含未知语言数量的文本的语言。以下代码为我提供了不同的语言作为答案 注意:我减少了评论,因为它在发布“”时出现错误是不允许的

print(detect(???? ????? ?????? ??????? ???? ??? ????? ??????))
print(detect(的马来西亚))
print(detect(Vi havde 2 perfekte dage i Legoland Malaysia))
print(detect(Wij hebben alleen gekozen voor het waterpark maar daar ben je vrijs snel doorheen. Super leuke glijbanen en overal ruimte om te zitten en te liggen. Misschien volgende keer een gecombineerd ticket kopen met ook toegang tot waterpark))
print(detect(This is a park thats just ok, nothing great to write home about.  There is barely any shade, the weather is always really hot so they need to take this into consideration. The atractions are just meh. I would only go if you are a fan of lego, for the sculptures are nice.))

Here is the output

这是输出

ar
zh-cn
da
nl
en

But using the following loop, all reviews give me 'en' as result

但是使用以下循环,所有评论都给我“en”作为结果

from langdetect import detect
import pandas as pd
df = pd.read_excel('data.xls') #
lang = []    
for r in df.Review:
    lang = detect(r)
    df['Languagereveiw'] = lang

the output is 'en' for all five rows.

所有五行的输出都是“en”。

Need guidance that where is the missing chain?

需要指导以了解缺失的链条在哪里?

Here is the sample data

这是示例数据

Secondly, How can I get the complete name of languages i.e. English for 'en'

其次,我如何获得语言的完整名称,即“en”的英语

回答by EdChum

In your loop you're overwriting the entire column by doing this:

在您的循环中,您通过执行以下操作覆盖整个列:

df['Languagereveiw'] = lang

If you want to do this in a for loop use iteritems:

如果要在 for 循环中执行此操作,请使用iteritems

for index, row in df['Review'].iteritems():
    lang = detect(row) #detecting each row
    df.loc[index, 'Languagereveiw'] = lang

however, you can just ditch the loop and just do

但是,您可以放弃循环并执行

df['Languagereveiw'] = df['Review'].apply(detect)

Which is syntactic sugar to execute your func on the entire column

这是在整个列上执行 func 的语法糖

Regarding your latter question about converting from language code to full description:

关于从语言代码转换为完整描述的后一个问题:

'en' to 'english',

'en' 到 'english',

look at polyglot

多语言

this provides the facility to detect language, get the language code, and the full description

这提供了检测语言、获取语言代码和完整描述的工具