来自 javascript 中 charcode 的 Unicode 字符,用于 charcodes > 0xFFFF
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5446492/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Unicode characters from charcode in javascript for charcodes > 0xFFFF
提问by leemes
I need to get a string / char from a unicode charcode and finally put it into a DOM TextNode to add into an HTML page using client side JavaScript.
我需要从 unicode 字符码中获取字符串/字符,最后将其放入 DOM TextNode 以使用客户端 JavaScript 添加到 HTML 页面中。
Currently, I am doing:
目前,我正在做:
String.fromCharCode(parseInt(charcode, 16));
where charcode
is a hex string containing the charcode, e.g. "1D400"
. The unicode character which should be returned is , but a
?
is returned! Characters in the 16 bit range (0000
... FFFF
) are returned as expected.
其中charcode
是包含字符代码的十六进制字符串,例如"1D400"
. 应该返回的unicode字符是,但是返回的是a
?
!按预期返回16 位范围 ( 0000
... FFFF
) 中的字符。
Any explanation and / or proposals for correction?
任何解释和/或纠正建议?
Thanks in advance!
提前致谢!
回答by Anomie
String.fromCharCode can only handle code points in the BMP (i.e. up to U+FFFF). To handle higher code points, this function from Mozilla Developer Networkmay be used to return the surrogate pair representation:
String.fromCharCode 只能处理 BMP 中的代码点(即最多 U+FFFF)。为了处理更高的代码点,可以使用Mozilla Developer Network 的这个函数来返回代理对表示:
function fixedFromCharCode (codePt) {
if (codePt > 0xFFFF) {
codePt -= 0x10000;
return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
} else {
return String.fromCharCode(codePt);
}
}
回答by Tim Down
The problem is that characters in JavaScript are (mostly) UCS-2 encodedbut can represent a character outside the Basic Multilingual Plane in JavaScript as a UTF-16 surrogate pair.
问题是 JavaScript 中的字符(大部分)是 UCS-2 编码的,但可以将 JavaScript 中基本多语言平面之外的字符表示为 UTF-16 代理对。
The following function is adapted from Converting punycode with dash character to Unicode:
以下函数改编自Converting punycode with dash character to Unicode:
function utf16Encode(input) {
var output = [], i = 0, len = input.length, value;
while (i < len) {
value = input[i++];
if ( (value & 0xF800) === 0xD800 ) {
throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
}
if (value > 0xFFFF) {
value -= 0x10000;
output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
value = 0xDC00 | (value & 0x3FF);
}
output.push(String.fromCharCode(value));
}
return output.join("");
}
alert( utf16Encode([0x1D400]) );
回答by Mike Samuel
Section 8.4 of the EcmaScript language spec says
EcmaScript 语言规范的第 8.4 节说
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.
当 String 包含实际文本数据时,每个元素都被视为单个 UTF-16 代码单元。无论这是否是字符串的实际存储格式,字符串中的字符都按其初始代码单元元素位置编号,就像使用 UTF-16 表示一样。对字符串的所有操作(除非另有说明)将它们视为未区分的 16 位无符号整数序列;它们不确保生成的 String 是规范化的形式,也不确保对语言敏感的结果。
So you need to encode supplemental code-points as pairs of UTF-16 code units.
因此,您需要将补充代码点编码为成对的 UTF-16 代码单元。
The article "Supplementary Characters in the Java Platform"gives a good description of how to do this.
文章“增补字符在Java平台”提供了如何做一个很好的说明。
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.
The following table shows the different representations of a few characters in comparison:
code points / UTF-16 code units
U+0041 / 0041
U+00DF / 00DF
U+6771 / 6771
U+10400 / D801 DC00
UTF-16 使用一或两个无符号 16 位代码单元的序列来编码 Unicode 代码点。值 U+0000 到 U+FFFF 以具有相同值的一个 16 位单元进行编码。补充字符用两个代码单元编码,第一个来自高代理范围(U+D800 到 U+DBFF),第二个来自低代理范围(U+DC00 到 U+DFFF)。这在概念上似乎与多字节编码相似,但有一个重要区别:值 U+D800 到 U+DFFF 保留用于 UTF-16;没有将字符分配给它们作为代码点。这意味着,软件可以为字符串中的每个单独的代码单元判断它是代表一个单元字符还是它是双单元字符的第一个或第二个单元。这是对一些传统的多字节字符编码的重大改进,
下表比较了几个字符的不同表示:
代码点/UTF-16 代码单元
U+0041 / 0041
U+00DF / 00DF
U+6771 / 6771
U+10400 / D801 DC00
Once you know the UTF-16 code units, you can create a string using the javascript function String.fromCharCode
:
一旦你知道了 UTF-16 代码单元,你就可以使用 javascript 函数创建一个字符串String.fromCharCode
:
String.fromCharCode(0xd801, 0xdc00) === ''