默认 Javascript 字符编码?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11141136/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Default Javascript Character Encoding?
提问by Nick
After some frantic Googling, I can't seem to find a conclusive answer to a simple question. I apologize if this is question is answered somewhere, but if so I couldn't find it.
经过一番疯狂的谷歌搜索后,我似乎无法找到一个简单问题的结论性答案。如果这个问题在某处得到回答,我深表歉意,但如果是这样,我找不到它。
While writing an encryption method in Javascript, I came to wondering what character encoding my strings were using, and why.
在用 Javascript 编写加密方法时,我开始想知道我的字符串使用了什么字符编码,以及为什么。
So: what determines character encoding in Javascript? Is it a standard? By the browser? Determined by the header of the HTTP request? In the <META>
tag of HTML that encompasses it? The server that feeds the page?
那么:是什么决定了 Javascript 中的字符编码?是标准吗?通过浏览器?由HTTP请求的头部决定?在<META>
包含它的 HTML 标签中?提供页面的服务器?
By my empirical testing (changing different settings, then using charCodeAt
on a sufficiently strange character and seeing which encoding the value matches up with) it appears to always be UTF-8 or UTF-16, but I'm not sure why.
通过我的经验测试(更改不同的设置,然后使用charCodeAt
一个足够奇怪的字符并查看该值与哪种编码匹配)它似乎总是 UTF-8 或 UTF-16,但我不知道为什么.
Thanks for the help!
谢谢您的帮助!
采纳答案by Pointy
Section 8.4 of E262:
E262 第 8.4 节:
The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values (“elements”). The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a code unit value (see Clause 6). Each element is regarded as occupying a position within the sequence. These positions are indexed with nonnegative integers. The first element (if any) is at position 0, the next element (if any) at position 1, and so on. The length of a String is the number of elements (i.e., 16-bit values) within it. The empty String has length zero and therefore contains no elements.
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.
String 类型是零个或多个 16 位无符号整数值(“元素”)的所有有限有序序列的集合。String 类型通常用于表示正在运行的 ECMAScript 程序中的文本数据,在这种情况下,String 中的每个元素都被视为一个代码单元值(参见第 6 条)。每个元素都被视为在序列中占据一个位置。这些位置用非负整数索引。第一个元素(如果有)在位置 0,下一个元素(如果有)在位置 1,依此类推。字符串的长度是其中的元素数(即 16 位值)。空字符串的长度为零,因此不包含任何元素。
当 String 包含实际文本数据时,每个元素都被视为单个 UTF-16 代码单元。无论这是否是字符串的实际存储格式,字符串中的字符都按其初始代码单元元素位置编号,就像使用 UTF-16 表示一样。对字符串的所有操作(除非另有说明)将它们视为未区分的 16 位无符号整数序列;它们不确保生成的 String 是规范化的形式,也不确保对语言敏感的结果。
That wording is kind-of weasely; it seems to mean that everything that counts treats strings as if each character is a UTF-16 character, but at the same time nothing ensures that it'll all be valid.
这种措辞有点令人生厌。这似乎意味着所有重要的东西都将字符串视为每个字符都是 UTF-16 字符,但同时没有任何东西可以确保它都是有效的。
edit— to be clear, the intentionis that strings consist of UTF-16 codepoints. In ES2015, the definition of "string value" includes this note:
编辑- 明确地说,目的是字符串由 UTF-16 代码点组成。在 ES2015 中,“字符串值”的定义包括这个注释:
A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.
String 值是 String 类型的成员。序列中的每个整数值通常表示 UTF-16 文本的单个 16 位单元。但是,ECMAScript 对这些值没有任何限制或要求,只是它们必须是 16 位无符号整数。
So a string is still a string even when it contains values that don't work as correct unicode characters.
因此,即使字符串包含不能作为正确 unicode 字符工作的值,它仍然是一个字符串。
回答by Jukka K. Korpela
There is no default character encoding for JavaScript as such. A JavaScript program is, as far as specifications are concerned, a sequence of abstract characters. When transmitted over a network, or just stored in a computer, the abstract characters must be encoded somehow, but the mechanisms for it are not controlled by the ECMAScript standard.
JavaScript 本身没有默认的字符编码。就规范而言,JavaScript 程序是一系列抽象字符。当通过网络传输或仅存储在计算机中时,抽象字符必须以某种方式编码,但其机制不受 ECMAScript 标准控制。
Section 6 of the ECMAScriptstandard uses UTF-16 as a reference encoding, but does not designate it as default. Using UTF?16 as reference is logically unnecessary (it would suffice to refer to Unicode numbers) but it was probably assumed to help people.
ECMAScript标准的第 6 节使用 UTF-16 作为参考编码,但未将其指定为默认编码。使用 UTF?16 作为参考在逻辑上是不必要的(参考 Unicode 数字就足够了),但它可能被认为可以帮助人们。
This issue should not be confused with the interpretation of string literals or strings in general. A literal like 'Φ' needs to be in some encoding, along with the rest of the program; this can be any encoding, but after the encoding has been resolved, the literal will be interpreted as an integer according to the Unicode number of the character.
这个问题不应与字符串文字或字符串的一般解释混淆。像“Φ”这样的文字需要与程序的其余部分一起使用某种编码;这可以是任何编码,但在解析编码后,文字将根据字符的 Unicode 编号解释为整数。
When a JavaScript program is transmitted as such (as an “external JavaScript file”) over the Internet, RFC 4329, Scripting Media Types, applies. Clause 4 defines the mechanism: Primarily, headers such as HTTP headers are checked, and a charset
parameter there will be trusted on. (In practice, web servers usually don't specify such a parameter for JavaScript programs.) Second, BOM detection is applied. Failing that, UTF-8 is implied.
当 JavaScript 程序通过 Internet 传输时(作为“外部 JavaScript 文件”),RFC 4329,脚本媒体类型,适用。第 4 条定义了机制:首先,检查 HTTP 标头等标头,并charset
信任那里的参数。(实际上,Web 服务器通常不会为 JavaScript 程序指定这样的参数。) 其次,应用 BOM 检测。否则,将暗示 UTF-8。
The first part of the mechanism is somewhat ambiguous. It might be interpreted as relating to charset
parameter in an actual HTTP header only, or might might be extended to charset
parameters in script
elements.
该机制的第一部分有些模棱两可。它可能被解释为仅与charset
实际 HTTP 标头中的参数相关,或者可能会扩展到元素中的charset
参数script
。
If a JavaScript program appears as embedded in HTML, either via a script
element or some event attribute, then its character encoding is of course the same as that of the HTML document. Section Specifying the character encodingof the HTML 4.01 spec defines the resolution mechanism, in this order: charset
in HTTP header, charset
in meta
, charset
in a link that was followed to access the document, and finally heuristics (guesswork), which may involved many things; cf. to the complex resolution mechanism in the HTML5 draft.
如果 JavaScript 程序通过script
元素或某些事件属性显示为嵌入在 HTML 中,那么它的字符编码当然与 HTML 文档的字符编码相同。部分指定HTML 4.01 规范的字符编码定义了解析机制,按此顺序:charset
在 HTTP 标头中,charset
在meta
,charset
在访问文档所遵循的链接中,最后是启发式(猜测),这可能涉及很多事情;参见 HTML5 草案中复杂的解析机制。