pandas UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/54536539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:18:49  来源:igfitidea点击:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

pythonpython-3.xpandas

提问by Mohit Motwani

When I try to use:

当我尝试使用:

df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_csv('sentiment_data.csv')

I get the error:

我收到错误:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

I don't understand what this error means and how I can fix this error and export my data to a csv/excel. I have referred to this questionbut I don't understand much and it doesn't answer how to do this with pandas.

我不明白此错误的含义以及如何修复此错误并将我的数据导出到 csv/excel。我已经提到了这个问题,但我不太明白,它没有回答如何用Pandas来做到这一点。

What does position 388 mean? What is the character '\ud83d'?

位置 388 是什么意思?'\ud83d' 是什么字符?

I get a different error position when I try to export to an excel:

当我尝试导出到 excel 时,我得到一个不同的错误位置:

df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_excel('sentiment_data_new.xlsx')

Error while exporting to excel:

导出到excel时出错:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 261: surrogates not allowed

Why is the position different when it's the same encoding?

为什么相同编码时位置不同?

The other duplicate questions don't answer how to escape this error with pandas DataFrame.

其他重复的问题没有回答如何使用 Pandas DataFrame 来避免这个错误。

回答by BoarGules

This answer responds to a comment and is too long to put in a comment itself.

这个答案是对评论的回应,而且太长而无法发表评论本身。

Emojis in Unicode lie outside the Basic Multilingual Pane. Surrogate pairs are a way to make these glyphs directly representable in UTF-16 as two codepoints.

Unicode 中的表情符号位于基本多语言窗格之外。代理对是一种使这些字形在 UTF-16 中直接表示为两个代码点的方法。

You can force surrogate pairs to be resolved into the corresponding codepoint outside the BMP like this:

您可以强制将代理对解析为 BMP 之外的相应代码点,如下所示:

>>> "\ud83d\ude04".encode('utf-16','surrogatepass').decode('utf-16')
'\U0001f604'

But this solution may only get you so far.

但是这个解决方案可能只能让你走到这一步。

A lot of software (for example IDLE) only supports the BMP, because it doesn't really use UTF-16 but its predecessor UCS-2, which is essentially UTF-16 but without support for codepoints outside the BMP. In IDLE, print ('\U0001f604')will just raise a UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f604' in position 0: Non-BMP character not supported in Tk

很多软件(例如 IDLE)只支持 BMP,因为它并没有真正使用 UTF-16,而是它的前身 UCS-2,它本质上是 UTF-16,但不支持 BMP 之外的代码点。在 IDLE 中,print ('\U0001f604')只会引发一个UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f604' in position 0: Non-BMP character not supported in Tk