JavaScript 字符串 - UTF-16 与 UCS-2?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8715980/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-26 04:20:09  来源:igfitidea点击:

JavaScript strings - UTF-16 vs UCS-2?

javascriptutf-16

提问by patorjk

I've read in some places that JavaScript strings are UTF-16, and in other places they're UCS-2. I did some searching around to try to figure out the difference and found this:

我在某些地方读到 JavaScript 字符串是 UTF-16,而在其他地方它们是 UCS-2。我做了一些搜索以试图找出差异并发现:

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.

UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 are identical for purposes of data exchange. Both are 16-bit, and have exactly the same code unit representation.

Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters.

问:UCS-2 和 UTF-16 有什么区别?

答:UCS-2 是一个过时的术语,它指代 Unicode 1.1 之前的 Unicode 实现,在代理代码点和 UTF-16 被添加到标准的 2.0 版之前。现在应该避免使用这个术语。

UCS-2 没有定义不同的数据格式,因为 UTF-16 和 UCS-2 在数据交换方面是相同的。两者都是 16 位,并且具有完全相同的代码单元表示。

有时在过去,一个实现被标记为“UCS-2”,以表明它不支持补充字符并且不将代理代码点对解释为字符。这样的实现不会处理补充字符的字符属性、代码点边界、排序规则等。

via: http://www.unicode.org/faq/utf_bom.html#utf16-11

通过:http: //www.unicode.org/faq/utf_bom.html#utf16-11

So my question is, is it because the JavaScript string object's methods and indexes act on 16-bit data values instead of characters what make some people consider it UCS-2? And if so, would a JavaScript string object oriented around characters instead of 16-bit data chunks be considered UTF-16? Or is there something else I'm missing?

所以我的问题是,是不是因为 JavaScript 字符串对象的方法和索引作用于 16 位数据值而不是字符,所以有些人认为它是 UCS-2?如果是这样,一个面向字符而不是 16 位数据块的 JavaScript 字符串对象会被视为 UTF-16 吗?或者还有什么我想念的吗?

Edit: As requested, here are some sources saying JavaScript strings are UCS-2:

编辑:根据要求,这里有一些消息来源说 JavaScript 字符串是 UCS-2:

http://blog.mozilla.com/nnethercote/2011/07/01/faster-javascript-parsing/http://terenceyim.wordpress.com/tag/ucs2/

http://blog.mozilla.com/nnethercote/2011/07/01/faster-javascript-parsing/ http://terenceyim.wordpress.com/tag/ucs2/

EDIT: For anyone who may come across this, be sure to check out this link:

编辑:对于可能遇到此问题的任何人,请务必查看此链接:

http://mathiasbynens.be/notes/javascript-encoding

http://mathiasbynens.be/notes/javascript-encoding

采纳答案by dgvid

JavaScript, strictly speaking, ECMAScript, pre-dates Unicode 2.0, so in some cases you may find references to UCS-2 simply because that was correct at the time the reference was written. Can you point us to specific citations of JavaScript being "UCS-2"?

JavaScript,严格来说,ECMAScript,早于 Unicode 2.0,所以在某些情况下,您可能会找到对 UCS-2 的引用,因为这在编写引用时是正确的。你能指出我们对 JavaScript 的具体引用是“UCS-2”吗?

Specifications for ECMAScript versions 3 and 5 at least both explicitly declare a String to be a collection unsigned 16-bit integers and that ifthose integer values are meant to represent textual data, then they are UTF-16 code units. See section 8.4 of the ECMAScript Language Specification.

ECMAScript 版本 3 和 5 的规范至少都明确声明 String 是一个无符号 16 位整数集合,并且如果这些整数值旨在表示文本数据,那么它们是 UTF-16 代码单元。请参阅ECMAScript 语言规范的第 8.4 节。



EDIT: I'm no longer sure my answer is entirely correct. See the excellent article mentioned above, http://mathiasbynens.be/notes/javascript-encoding, which in essence says that while a JavaScript engine may use UTF-16 internally, and most do, the language itself effectively exposes those characters as if they were UCS-2.

编辑:我不再确定我的答案是完全正确的。请参阅上面提到的优秀文章http://mathiasbynens.be/notes/javascript-encoding,其本质上说虽然 JavaScript 引擎可能在内部使用 UTF-16,并且大多数都这样做,但语言本身有效地暴露了这些字符,好像他们是 UCS-2。

回答by Daniel Moses

It's UTF-16/USC-2. It can handle surrogate pairs, but the charAt/charCodeAtreturns a 16-bit char and not the Unicode codepoint. If you want to have it handle surrogate pairs, I suggest a quick read through this.

它是 UTF-16/USC-2。它可以处理代理对,但charAt/charCodeAt返回一个 16 位字符而不是 Unicode 代码点。如果您想让它处理代理对,我建议您快速阅读.

回答by Daniel A. White

Its just a 16-bit value with no encoding specified in the ECMAScript standard.

它只是一个 16 位值,没有在 ECMAScript 标准中指定编码。

See section 7.8.4 String Literals in this document: http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

请参阅本文档中的第 7.8.4 节字符串文字:http: //www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf