为什么表情符号有两种不同的 utf-8 代码?如何从 utf-8 转换表情符号,在 ios 中使用 NSString?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34409085/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 08:27:17  来源:igfitidea点击:

Why does emoji have two different utf-8 codes? How to convert emoji from utf-8 , use NSString in ios?

iosunicodeutf-8nsstringemoji

提问by pinchwang

We have found an issue, that some emoji have two utf-8 codes, such as:

我们发现了一个问题,一些表情符号有两个 utf-8 代码,例如:

emoji   unicode    utf-8                another utf-8
      U+1F601    \xf0\x9f\x98\x81     \xed\xa0\xbd\xed\xb8\x81

But ios language can't decode the other type of utf-8, so resulting an error when i decode string from utf-8.

但是ios语言无法解码其他类型的utf-8,所以当我从utf-8解码字符串时会出现错误。

ios code

ios代码



In all documents i found, i can just find one type of utf-8 code for a emoji, no where to find the other.

在我找到的所有文档中,我只能找到一种表情符号的 utf-8 代码,而找不到另一种。

Documents i referenced includes:

我参考的文件包括:

emoji code link

表情符号代码链接

whole utf-8 code link

整个 utf-8 代码链接

But in a web tool bianma, all the two types of utf-8 code can be converted into emoji correctly.

但是在一个网络工具bianma 中,两种类型的 utf-8 代码都可以正确转换为 emoji。

input code

输入代码

ouput

输出



So, my question is :

所以,我的问题是:

  1. Why does there have two types of utf-8 codes for one emoji ?

  2. Where has a document which includes the two types of utf-8 codes?

  3. How to correctly convert string from utf-8, using NSString in ios language?

  1. 为什么一个表情符号有两种类型的 utf-8 代码?

  2. 哪里有包含两种 utf-8 代码的文档?

  3. 如何在ios语言中使用NSString从utf-8正确转换字符串?

采纳答案by bobince

0xF0, 0x9F, 0x98, 0x81

0xF0、0x9F、0x98、0x81

Is the correct UTF-8 encoding for U+1F601 .

是 U+1F601 的正确 UTF-8 编码。

0xED, 0xA0, 0xBD, 0xED, 0xB8, 0x81

0xED、0xA0、0xBD、0xED、0xB8、0x81

Is not a valid UTF-8 sequence(*). It should really be rejected; iOS is correct to do so.

不是有效的 UTF-8 序列 (*)。真的应该拒绝;iOS 这样做是正确的。

This is a bug in the bianma tool: the convertUtf8BytesToUnicodeCodePointsfunction is more lenient about what input it accepts than the specified algorithm in eg RFC 3629.

这是 bianma 工具中的一个错误:该convertUtf8BytesToUnicodeCodePoints函数对于它接受的输入比RFC 3629 中的指定算法更宽松。

This happens to return a working string only because the tool is written in JavaScript. Having decoded the above byte sequence to the bogus surrogate code point sequence U+D83D,U+DE01 it then converts that into a JavaScript string using a direct code-point-to-code-unit mapping giving \uD83D\xDE01. As this is the correct way to encode in a UTF-16 string it appears to have worked.

这恰好返回一个工作字符串,因为该工具是用 JavaScript 编写的。将上述字节序列解码为伪代理代码点序列 U+D83D,U+DE01 后,它然后使用直接代码点到代码单元映射将其转换为 JavaScript 字符串\uD83D\xDE01。由于这是在 UTF-16 字符串中编码的正确方法,因此它似乎有效。

(*: It isa valid CESU-8 sequence, but that encoding is just “bogus broken encoding for compatibility with badly-written historical tools” and should generally be avoided.)

(*:这一个有效的 CESU-8 序列,但该编码只是“为了与写得不好的历史工具兼容的虚假损坏编码”,通常应该避免。)

You should not usually encounter a sequence like this; it is typically not worth catering for unless you have a specific source of this kind of malformed data which you don't have the power to get fixed.

您通常不会遇到这样的序列;通常不值得考虑,除非您有这种格式错误的数据的特定来源,而您无权修复。

回答by Polina

This worked for me in php to send a message with emoji to telegram bot:

这在 php 中对我有用,可以向电报机器人发送带有表情符号的消息:

$message_text = " \xf0\x9f\x98\x81 ";