pandas python中的文本语言检测
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43916600/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Text Language detection in python
提问by Abrar
I am trying to detect the language of the text that may consist of an unknown number of languages. The following code gives me different languages as answer NOTE: I reduced the review becuase it was giving the error during post "" are not allowed
我正在尝试检测可能包含未知语言数量的文本的语言。以下代码为我提供了不同的语言作为答案 注意:我减少了评论,因为它在发布“”时出现错误是不允许的
print(detect(???? ????? ?????? ??????? ???? ??? ????? ??????))
print(detect(的马来西亚))
print(detect(Vi havde 2 perfekte dage i Legoland Malaysia))
print(detect(Wij hebben alleen gekozen voor het waterpark maar daar ben je vrijs snel doorheen. Super leuke glijbanen en overal ruimte om te zitten en te liggen. Misschien volgende keer een gecombineerd ticket kopen met ook toegang tot waterpark))
print(detect(This is a park thats just ok, nothing great to write home about. There is barely any shade, the weather is always really hot so they need to take this into consideration. The atractions are just meh. I would only go if you are a fan of lego, for the sculptures are nice.))
Here is the output
这是输出
ar
zh-cn
da
nl
en
But using the following loop, all reviews give me 'en' as result
但是使用以下循环,所有评论都给我“en”作为结果
from langdetect import detect
import pandas as pd
df = pd.read_excel('data.xls') #
lang = []
for r in df.Review:
lang = detect(r)
df['Languagereveiw'] = lang
the output is 'en' for all five rows.
所有五行的输出都是“en”。
Need guidance that where is the missing chain?
需要指导以了解缺失的链条在哪里?
Here is the sample data
这是示例数据
Secondly, How can I get the complete name of languages i.e. English for 'en'
其次,我如何获得语言的完整名称,即“en”的英语
回答by EdChum
In your loop you're overwriting the entire column by doing this:
在您的循环中,您通过执行以下操作覆盖整个列:
df['Languagereveiw'] = lang
If you want to do this in a for loop use iteritems
:
如果要在 for 循环中执行此操作,请使用iteritems
:
for index, row in df['Review'].iteritems():
lang = detect(row) #detecting each row
df.loc[index, 'Languagereveiw'] = lang
however, you can just ditch the loop and just do
但是,您可以放弃循环并执行
df['Languagereveiw'] = df['Review'].apply(detect)
Which is syntactic sugar to execute your func on the entire column
这是在整个列上执行 func 的语法糖
Regarding your latter question about converting from language code to full description:
关于从语言代码转换为完整描述的后一个问题:
'en' to 'english',
'en' 到 'english',
look at polyglot
看多语言
this provides the facility to detect language, get the language code, and the full description
这提供了检测语言、获取语言代码和完整描述的工具