pandas python中的文本语言检测

Question

提问by Abrar

I am trying to detect the language of the text that may consist of an unknown number of languages. The following code gives me different languages as answer NOTE: I reduced the review becuase it was giving the error during post "" are not allowed

我正在尝试检测可能包含未知语言数量的文本的语言。以下代码为我提供了不同的语言作为答案 注意：我减少了评论，因为它在发布“”时出现错误是不允许的

print(detect(???? ????? ?????? ??????? ???? ??? ????? ??????))
print(detect(的马来西亚))
print(detect(Vi havde 2 perfekte dage i Legoland Malaysia))
print(detect(Wij hebben alleen gekozen voor het waterpark maar daar ben je vrijs snel doorheen. Super leuke glijbanen en overal ruimte om te zitten en te liggen. Misschien volgende keer een gecombineerd ticket kopen met ook toegang tot waterpark))
print(detect(This is a park thats just ok, nothing great to write home about.  There is barely any shade, the weather is always really hot so they need to take this into consideration. The atractions are just meh. I would only go if you are a fan of lego, for the sculptures are nice.))

Here is the output

这是输出

ar
zh-cn
da
nl
en

But using the following loop, all reviews give me 'en' as result

但是使用以下循环，所有评论都给我“en”作为结果

from langdetect import detect
import pandas as pd
df = pd.read_excel('data.xls') #
lang = []    
for r in df.Review:
    lang = detect(r)
    df['Languagereveiw'] = lang

the output is 'en' for all five rows.

所有五行的输出都是“en”。

Need guidance that where is the missing chain?

需要指导以了解缺失的链条在哪里？

Here is the sample data

这是示例数据

Secondly, How can I get the complete name of languages i.e. English for 'en'

其次，我如何获得语言的完整名称，即“en”的英语

Answer 1

回答by EdChum

In your loop you're overwriting the entire column by doing this:

在您的循环中，您通过执行以下操作覆盖整个列：

df['Languagereveiw'] = lang

If you want to do this in a for loop use iteritems:

如果要在 for 循环中执行此操作，请使用iteritems：

for index, row in df['Review'].iteritems():
    lang = detect(row) #detecting each row
    df.loc[index, 'Languagereveiw'] = lang

however, you can just ditch the loop and just do

但是，您可以放弃循环并执行

df['Languagereveiw'] = df['Review'].apply(detect)

Which is syntactic sugar to execute your func on the entire column

这是在整个列上执行 func 的语法糖

Regarding your latter question about converting from language code to full description:

关于从语言代码转换为完整描述的后一个问题：

'en' to 'english',

'en' 到 'english',

look at polyglot

看多语言

this provides the facility to detect language, get the language code, and the full description

这提供了检测语言、获取语言代码和完整描述的工具

pandas python中的文本语言检测

提问by Abrar

回答by EdChum

相关推荐

最近更新

标签

pandas python中的文本语言检测

提问by Abrar

回答by EdChum

相关推荐

Python pandas 将 csv ANSI 格式加载为 UTF-8

pandas 如何计算pandas中前N行的累积总和？

将搬运工词干分析器应用于每个单词的 Pandas 列

pandas 为什么我不能在循环中附加熊猫数据框

相关推荐

最近更新

标签