JavaScript 字符串中有多少个字节?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2219526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-22 23:17:40  来源:igfitidea点击:

How many bytes in a JavaScript string?

javascriptstringsizebyte

提问by Paul Biggar

I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?

我有一个 javascript 字符串,从服务器以 UTF-8 格式发送时大约为 500K。如何在 JavaScript 中判断它的大小?

I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?

我知道 JavaScript 使用 UCS-2,所以这是否意味着每个字符 2 个字节。但是,它是否取决于 JavaScript 实现?或者在页面编码或内容类型上?

采纳答案by CMS

Stringvalues are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:

String值不依赖于实现,根据ECMA-262 第 3 版规范,每个字符代表UTF-16 文本单个 16 位单元

4.3.16 String Value

A string value is a member of the type String and is a finite ordered sequence of zero or more 16-bit unsigned integer values.

NOTE Although each value usually represents a single 16-bit unit of UTF-16 text, the language does not place any restrictions or requirements on the values except that they be 16-bit unsigned integers.

4.3.16 字符串值

字符串值是 String 类型的成员,是零个或多个 16 位无符号整数值的有限有序序列。

注意 尽管每个值通常代表一个 UTF-16 文本的单个 16 位单元,但语言对这些值没有任何限制或要求,只是它们是 16 位无符号整数。

回答by Lauri Oherd

This function will return the byte size of any UTF-8 string you pass to it.

此函数将返回您传递给它的任何 UTF-8 字符串的字节大小。

function byteCount(s) {
    return encodeURI(s).split(/%..|./).length - 1;
}

Source

来源

JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it's just an implementation detail that won't affect the language's characteristics.

JavaScript 引擎可以在内部免费使用 UCS-2 或 UTF-16。我所知道的大多数引擎都使用 UTF-16,但无论他们做出何种选择,这只是一个不会影响语言特性的实现细节。

The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.

然而,ECMAScript/JavaScript 语言本身根据 UCS-2 而不是 UTF-16 公开字符。

Source

来源

回答by Offirmo

If you're using node.js, there is a simpler solution using buffers:

如果您使用的是 node.js,则使用buffers有一个更简单的解决方案:

function getBinarySize(string) {
    return Buffer.byteLength(string, 'utf8');
}

There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter(from yours faithfully)

有一个 npm 库:https: //www.npmjs.org/package/utf8-binary-cutter(忠实地来自你)

回答by P Roitto

You can use the Blobto get the string size in bytes.

您可以使用Blob获取以字节为单位的字符串大小。

Examples:

例子:

console.info(
  new Blob(['']).size,                             // 4
  new Blob(['']).size,                             // 4
  new Blob(['']).size,                           // 8
  new Blob(['']).size,                           // 8
  new Blob(['I\'m a string']).size,                  // 12

  // from Premasagar correction of Lauri's answer for
  // strings containing lone characters in the surrogate pair range:
  // https://stackoverflow.com/a/39488643/6225838
  new Blob([String.fromCharCode(55555)]).size,       // 3
  new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);

回答by Kinjeiro

Try this combination with using unescapejs function:

试试这个结合使用unescapejs 函数:

const byteAmount = unescape(encodeURIComponent(yourString)).length

const byteAmount = unescape(encodeURIComponent(yourString)).length

Full encode proccess example:

完整编码过程示例:

const s  = "1 a ф № @ ?"; //length is 11
const s2 = encodeURIComponent(s); //length is 41
const s3 = unescape(s2); //length is 15 [1-1,a-1,ф-2,№-3,@-1,?-2]
const s4 = escape(s3); //length is 39
const s5 = decodeURIComponent(s4); //length is 11

回答by maerics

Note that if you're targeting node.js you can use Buffer.from(string).length:

请注意,如果您的目标是 node.js,则可以使用Buffer.from(string).length

var str = "\u2620"; // => "?"
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)

回答by Mac

UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).

UTF-8 使用每个代码点 1 到 4 个字节对字符进行编码。正如 CMS 在接受的答案中指出的那样,JavaScript 将使用 16 位(2 个字节)在内部存储每个字符。

If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:

如果您通过循环解析字符串中的每个字符并计算每个代码点使用的字节数,然后将总计数乘以 2,则您应该拥有该 UTF-8 编码字符串的 JavaScript 内存使用量(以字节为单位)。也许是这样的:

      getStringMemorySize = function( _string ) {
        "use strict";

        var codePoint
            , accum = 0
        ;

        for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
            codePoint = _string.charCodeAt( stringIndex );

            if( codePoint < 0x100 ) {
                accum += 1;
                continue;
            }

            if( codePoint < 0x10000 ) {
                accum += 2;
                continue;
            }

            if( codePoint < 0x1000000 ) {
                accum += 3;
            } else {
                accum += 4;
            }
        }

        return accum * 2;
    }

Examples:

例子:

getStringMemorySize( 'I'    );     //  2
getStringMemorySize( '?'    );     //  4
getStringMemorySize( ''   );     //  8
getStringMemorySize( 'I?' );     // 14

回答by Hong Ly

These are 3 ways I use:

这些是我使用的 3 种方式:

  1. TextEncoder()

    (new TextEncoder().encode("myString")).length)

  2. Blob

    new Blob(["myString"]).size)

  3. Buffer

    Buffer.byteLength("myString", 'utf8'))

  1. 文本编码器()

    (new TextEncoder().encode("myString")).length)

  2. 斑点

    new Blob(["myString"]).size)

  3. 缓冲

    Buffer.byteLength("myString", 'utf8'))

回答by whitneyland

The size of a JavaScript string is

JavaScript 字符串的大小是

  • Pre-ES6: 2 bytes per character
  • ES6and later: 2 bytes per character, or 5 or more bytes per character

  • ES6 之前:每个字符 2 个字节
  • ES6及更高版本:每个字符 2 个字节,或每个字符 5 个或更多字节

Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.

ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}

ES6 之前的
每个字符始终为 2 个字节。UTF-16 是不允许的,因为规范说“值必须是 16 位无符号整数”。由于 UTF-16 字符串可以使用 3 或 4 字节字符,因此会违反 2 字节要求。至关重要的是,虽然不能完全支持 UTF-16,但标准确实要求使用的两个字节字符是有效的 UTF-16 字符。换句话说,Pre-ES6 JavaScript 字符串支持 UTF-16 字符的子集。

ES6 及更高版本
每个字符 2 个字节,或每个字符 5 个或更多字节。因为 ES6 (ECMAScript 6) 添加了对Unicode 代码点转义的支持,所以额外的大小开始发挥作用。使用 unicode 转义看起来像这样:\u{1D306}

Practical notes

实用笔记

  • This doesn't relate to the internal implemention of a particular engine. For example, some engines use data structures and libraries with full UTF-16 support, but what they provide externally doesn't have to be full UTF-16 support. Also an engine may provide external UTF-16 support as well but is not mandated to do so.

  • For ES6, practically speaking characters will never be more than 5 bytes long (2 bytes for the escape point + 3 bytes for the Unicode code point) because the latest version of Unicode only has 136,755 possible characters, which fits easily into 3 bytes. However this is technically not limited by the standard so in principal a single character could use say, 4 bytes for the code point and 6 bytes total.

  • Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.

  • 这与特定引擎的内部实现无关。例如,一些引擎使用完全支持 UTF-16 的数据结构和库,但它们外部提供的内容不一定是完全支持 UTF-16。此外,引擎也可以提供外部 UTF-16 支持,但并非强制要求这样做。

  • 对于 ES6,实际上字符的长度永远不会超过 5 个字节(转义点为 2 个字节 + Unicode 代码点为 3 个字节),因为最新版本的 Unicode 只有 136,755 个可能的字符,这很容易放入 3 个字节。然而,这在技术上不受标准的限制,因此原则上可以使用单个字符,例如 4 个字节用于代码点,总共 6 个字节。

  • 这里用于计算字节大小的大多数代码示例似乎没有考虑 ES6 Unicode 代码点转义,因此在某些情况下结果可能不正确。

回答by Premasagar

The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.

Lauri Oherd 的答案适用于在野外看到的大多数字符串,但如果字符串包含代理对范围 0xD800 到 0xDFFF 中的单独字符,则会失败。例如

byteCount(String.fromCharCode(55555))
// URIError: URI malformed

This longer function should handle all strings:

这个较长的函数应该处理所有字符串:

function bytes (str) {
  var bytes=0, len=str.length, codePoint, next, i;

  for (i=0; i < len; i++) {
    codePoint = str.charCodeAt(i);

    // Lone surrogates cannot be passed to encodeURI
    if (codePoint >= 0xD800 && codePoint < 0xE000) {
      if (codePoint < 0xDC00 && i + 1 < len) {
        next = str.charCodeAt(i + 1);

        if (next >= 0xDC00 && next < 0xE000) {
          bytes += 4;
          i++;
          continue;
        }
      }
    }

    bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
  }

  return bytes;
}

E.g.

例如

bytes(String.fromCharCode(55555))
// 3

It will correctly calculate the size for strings containing surrogate pairs:

它将正确计算包含代理对的字符串的大小:

bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)

The results can be compared with Node's built-in function Buffer.byteLength:

结果可以与 Node 的内置函数进行比较Buffer.byteLength

Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3

Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)