为什么表情符号有两种不同的 utf-8 代码?如何从 utf-8 转换表情符号,在 ios 中使用 NSString?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34409085/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why does emoji have two different utf-8 codes? How to convert emoji from utf-8 , use NSString in ios?
提问by pinchwang
We have found an issue, that some emoji have two utf-8 codes, such as:
我们发现了一个问题,一些表情符号有两个 utf-8 代码,例如:
emoji unicode utf-8 another utf-8
U+1F601 \xf0\x9f\x98\x81 \xed\xa0\xbd\xed\xb8\x81
But ios language can't decode the other type of utf-8, so resulting an error when i decode string from utf-8.
但是ios语言无法解码其他类型的utf-8,所以当我从utf-8解码字符串时会出现错误。
In all documents i found, i can just find one type of utf-8 code for a emoji, no where to find the other.
在我找到的所有文档中,我只能找到一种表情符号的 utf-8 代码,而找不到另一种。
Documents i referenced includes:
我参考的文件包括:
But in a web tool bianma, all the two types of utf-8 code can be converted into emoji correctly.
但是在一个网络工具bianma 中,两种类型的 utf-8 代码都可以正确转换为 emoji。
So, my question is :
所以,我的问题是:
Why does there have two types of utf-8 codes for one emoji ?
Where has a document which includes the two types of utf-8 codes?
How to correctly convert string from utf-8, using NSString in ios language?
为什么一个表情符号有两种类型的 utf-8 代码?
哪里有包含两种 utf-8 代码的文档?
如何在ios语言中使用NSString从utf-8正确转换字符串?
采纳答案by bobince
0xF0, 0x9F, 0x98, 0x81
0xF0、0x9F、0x98、0x81
Is the correct UTF-8 encoding for U+1F601 .
是 U+1F601 的正确 UTF-8 编码。
0xED, 0xA0, 0xBD, 0xED, 0xB8, 0x81
0xED、0xA0、0xBD、0xED、0xB8、0x81
Is not a valid UTF-8 sequence(*). It should really be rejected; iOS is correct to do so.
不是有效的 UTF-8 序列 (*)。真的应该拒绝;iOS 这样做是正确的。
This is a bug in the bianma tool: the convertUtf8BytesToUnicodeCodePoints
function is more lenient about what input it accepts than the specified algorithm in eg RFC 3629.
这是 bianma 工具中的一个错误:该convertUtf8BytesToUnicodeCodePoints
函数对于它接受的输入比RFC 3629 中的指定算法更宽松。
This happens to return a working string only because the tool is written in JavaScript. Having decoded the above byte sequence to the bogus surrogate code point sequence U+D83D,U+DE01 it then converts that into a JavaScript string using a direct code-point-to-code-unit mapping giving \uD83D\xDE01
. As this is the correct way to encode in a UTF-16 string it appears to have worked.
这恰好返回一个工作字符串,因为该工具是用 JavaScript 编写的。将上述字节序列解码为伪代理代码点序列 U+D83D,U+DE01 后,它然后使用直接代码点到代码单元映射将其转换为 JavaScript 字符串\uD83D\xDE01
。由于这是在 UTF-16 字符串中编码的正确方法,因此它似乎有效。
(*: It isa valid CESU-8 sequence, but that encoding is just “bogus broken encoding for compatibility with badly-written historical tools” and should generally be avoided.)
(*:这是一个有效的 CESU-8 序列,但该编码只是“为了与写得不好的历史工具兼容的虚假损坏编码”,通常应该避免。)
You should not usually encounter a sequence like this; it is typically not worth catering for unless you have a specific source of this kind of malformed data which you don't have the power to get fixed.
您通常不会遇到这样的序列;通常不值得考虑,除非您有这种格式错误的数据的特定来源,而您无权修复。
回答by Polina
This worked for me in php to send a message with emoji to telegram bot:
这在 php 中对我有用,可以向电报机器人发送带有表情符号的消息:
$message_text = " \xf0\x9f\x98\x81 ";