pandas UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

Question

提问by Mohit Motwani

When I try to use:

当我尝试使用：

df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_csv('sentiment_data.csv')

I get the error:

我收到错误：

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

I don't understand what this error means and how I can fix this error and export my data to a csv/excel. I have referred to this questionbut I don't understand much and it doesn't answer how to do this with pandas.

我不明白此错误的含义以及如何修复此错误并将我的数据导出到 csv/excel。我已经提到了这个问题，但我不太明白，它没有回答如何用Pandas来做到这一点。

What does position 388 mean? What is the character '\ud83d'?

位置 388 是什么意思？'\ud83d' 是什么字符？

I get a different error position when I try to export to an excel:

当我尝试导出到 excel 时，我得到一个不同的错误位置：

df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_excel('sentiment_data_new.xlsx')

Error while exporting to excel:

导出到excel时出错：

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 261: surrogates not allowed

Why is the position different when it's the same encoding?

为什么相同编码时位置不同？

The other duplicate questions don't answer how to escape this error with pandas DataFrame.

其他重复的问题没有回答如何使用 Pandas DataFrame 来避免这个错误。

Answer 1

回答by BoarGules

This answer responds to a comment and is too long to put in a comment itself.

这个答案是对评论的回应，而且太长而无法发表评论本身。

Emojis in Unicode lie outside the Basic Multilingual Pane. Surrogate pairs are a way to make these glyphs directly representable in UTF-16 as two codepoints.

Unicode 中的表情符号位于基本多语言窗格之外。代理对是一种使这些字形在 UTF-16 中直接表示为两个代码点的方法。

You can force surrogate pairs to be resolved into the corresponding codepoint outside the BMP like this:

您可以强制将代理对解析为 BMP 之外的相应代码点，如下所示：

>>> "\ud83d\ude04".encode('utf-16','surrogatepass').decode('utf-16')
'\U0001f604'

But this solution may only get you so far.

但是这个解决方案可能只能让你走到这一步。

A lot of software (for example IDLE) only supports the BMP, because it doesn't really use UTF-16 but its predecessor UCS-2, which is essentially UTF-16 but without support for codepoints outside the BMP. In IDLE, print ('\U0001f604')will just raise a UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f604' in position 0: Non-BMP character not supported in Tk

很多软件（例如 IDLE）只支持 BMP，因为它并没有真正使用 UTF-16，而是它的前身 UCS-2，它本质上是 UTF-16，但不支持 BMP 之外的代码点。在 IDLE 中，print ('\U0001f604')只会引发一个UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f604' in position 0: Non-BMP character not supported in Tk

pandas UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

提问by Mohit Motwani

回答by BoarGules

相关推荐

最近更新

标签

pandas UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

提问by Mohit Motwani

回答by BoarGules

相关推荐

没有名为“pandas._libs.tslib”的模块

Pandas groupby mean() 不忽略 NaN

pandas 将excel中的某些列读取到数据框

Pandas 数据框列的浮动百分比样式错误

相关推荐

最近更新

标签