JavaScript 中的字符串长度(以字节为单位)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5515869/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-23 17:34:15  来源:igfitidea点击:

String length in bytes in JavaScript

javascriptunicode

提问by Alexander Gladysh

In my JavaScript code I need to compose a message to server in this format:

在我的 JavaScript 代码中,我需要以这种格式向服务器发送消息:

<size in bytes>CRLF
<data>CRLF

Example:

例子:

3
foo

The data may contain unicode characters. I need to send them as UTF-8.

数据可能包含 unicode 字符。我需要将它们作为 UTF-8 发送。

I'm looking for the most cross-browser way to calculate the length of the string in bytes in JavaScript.

我正在寻找最跨浏览器的方式来计算 JavaScript 中字符串的长度(以字节为单位)。

I've tried this to compose my payload:

我试过这个来组成我的有效载荷:

return unescape(encodeURIComponent(str)).length + "\n" + str + "\n"

But it does not give me accurate results for the older browsers (or, maybe the strings in those browsers in UTF-16?).

但它并没有为我提供旧浏览器的准确结果(或者,这些浏览器中的字符串可能是 UTF-16?)。

Any clues?

有什么线索吗?

Update:

更新:

Example: length in bytes of the string ЭЭХ! Na?ve?in UTF-8 is 15 bytes, but some browsers report 23 bytes instead.

示例:ЭЭХ! Na?ve?UTF-8字符串的字节长度为 15 个字节,但某些浏览器报告为 23 个字节。

采纳答案by Mike Samuel

There is no way to do it in JavaScript natively.(See Riccardo Galli's answerfor a modern approach.)

原生 JavaScript 无法做到这一点。(有关现代方法,请参阅Riccardo Galli 的回答。)



For historical reference or where TextEncoder APIs are still unavailable.

用于历史参考或 TextEncoder API仍然不可用的地方

If you know the character encoding, you can calculate it yourself though.

如果您知道字符编码,则可以自己计算。

encodeURIComponentassumes UTF-8 as the character encoding, so if you need that encoding, you can do,

encodeURIComponent假设 UTF-8 作为字符编码,所以如果你需要这种编码,你可以这样做,

function lengthInUtf8Bytes(str) {
  // Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
  var m = encodeURIComponent(str).match(/%[89ABab]/g);
  return str.length + (m ? m.length : 0);
}

This should work because of the way UTF-8 encodes multi-byte sequences. The first encoded byte always starts with either a high bit of zero for a single byte sequence, or a byte whose first hex digit is C, D, E, or F. The second and subsequent bytes are the ones whose first two bits are 10. Those are the extra bytes you want to count in UTF-8.

由于 UTF-8 编码多字节序列的方式,这应该有效。第一个编码字节总是以单个字节序列的高位 0 或第一个十六进制数字为 C、D、E 或 F 的字节开始。第二个和后续字节是前两位为 10 的字节. 这些是您要在 UTF-8 中计算的额外字节。

The table in wikipediamakes it clearer

维基百科的表格更清楚

Bits        Last code point Byte 1          Byte 2          Byte 3
  7         U+007F          0xxxxxxx
 11         U+07FF          110xxxxx        10xxxxxx
 16         U+FFFF          1110xxxx        10xxxxxx        10xxxxxx
...

If instead you need to understand the page encoding, you can use this trick:

如果您需要了解页面编码,则可以使用以下技巧:

function lengthInPageEncoding(s) {
  var a = document.createElement('A');
  a.href = '#' + s;
  var sEncoded = a.href;
  sEncoded = sEncoded.substring(sEncoded.indexOf('#') + 1);
  var m = sEncoded.match(/%[0-9a-f]{2}/g);
  return sEncoded.length - (m ? m.length * 2 : 0);
}

回答by Riccardo Galli

Years passed and nowadays you can do it natively

几年过去了,现在你可以在本地做到这一点

(new TextEncoder().encode('foo')).length

Note that it's not supported yet by IE (or Edge) (you may use a polyfillfor that).

请注意,IE(或 Edge)尚不支持它(您可以为此使用 polyfill)。

MDN documentation

MDN 文档

Standard specifications

标准规格

回答by lovasoa

Here is a much faster version, which doesn't use regular expressions, nor encodeURIComponent():

这是一个更快的版本,它不使用正则表达式,也不使用encodeURIComponent()

function byteLength(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

Here is a performancecomparison.

这是一个性能比较

It just computes the length in UTF8 of each unicode codepoints returned by charCodeAt()(based on wikipedia's descriptions of UTF8, and UTF16 surrogate characters).

它只是计算charCodeAt()返回的每个 unicode 代码点的 UTF8 长度(基于维基百科对UTF8和 UTF16 代理字符的描述)。

It follows RFC3629(where UTF-8 characters are at most 4-bytes long).

它遵循RFC3629(其中 UTF-8 字符最多 4 个字节长)。

回答by simap

For simple UTF-8 encoding, with slightly better compatibility than TextEncoder, Blob does the trick. Won't work in very old browsers though.

对于简单的 UTF-8 编码,兼容性略好于TextEncoder,Blob 可以解决问题。但是在非常旧的浏览器中不起作用。

new Blob([""]).size; // -> 4  

回答by Lauri Oherd

This function will return the byte size of any UTF-8 string you pass to it.

此函数将返回您传递给它的任何 UTF-8 字符串的字节大小。

function byteCount(s) {
    return encodeURI(s).split(/%..|./).length - 1;
}

Source

来源

回答by Iván Pérez

Another very simple approach using Buffer(only for NodeJS):

另一种非常简单的方法使用Buffer(仅适用于 NodeJS):

Buffer.byteLength(string, 'utf8')

Buffer.from(string).length

回答by laurent

Took me a while to find a solution for React Nativeso I'll put it here:

我花了一段时间才找到React Native的解决方案,所以我把它放在这里:

First install the bufferpackage:

首先安装buffer软件包:

npm install --save buffer

Then user the node method:

然后使用节点方法:

const { Buffer } = require('buffer');
const length = Buffer.byteLength(string, 'utf-8');

回答by Alexander Gladysh

Actually, I figured out what's wrong. For the code to work the page <head>should have this tag:

事实上,我想出了什么问题。为了使代码工作,页面<head>应该有这个标签:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Or, as suggested in comments, if server sends HTTP Content-Encodingheader, it should work as well.

或者,正如评论中所建议的,如果服务器发送 HTTPContent-Encoding标头,它也应该可以正常工作。

Then results from different browsers are consistent.

那么不同浏览器的结果是一致的。

Here is an example:

下面是一个例子:

<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
  <title>mini string length test</title>
</head>
<body>

<script type="text/javascript">
document.write('<div style="font-size:100px">' 
    + (unescape(encodeURIComponent("ЭЭХ! Na?ve?")).length) + '</div>'
  );
</script>
</body>
</html>

Note: I suspect that specifying any(accurate) encoding would fix the encoding problem. It is just a coincidence that I need UTF-8.

注意:我怀疑指定任何(准确的)编码都会解决编码问题。我需要UTF-8只是一个巧合。

回答by fuweichin

Here is an independent and efficient method to count UTF-8 bytes of a string.

这是一种独立且有效的方法来计算字符串的 UTF-8 字节数。

//count UTF-8 bytes of a string
function byteLengthOf(s){
 //assuming the String is UCS-2(aka UTF-16) encoded
 var n=0;
 for(var i=0,l=s.length; i<l; i++){
  var hi=s.charCodeAt(i);
  if(hi<0x0080){ //[0x0000, 0x007F]
   n+=1;
  }else if(hi<0x0800){ //[0x0080, 0x07FF]
   n+=2;
  }else if(hi<0xD800){ //[0x0800, 0xD7FF]
   n+=3;
  }else if(hi<0xDC00){ //[0xD800, 0xDBFF]
   var lo=s.charCodeAt(++i);
   if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
    n+=4;
   }else{
    throw new Error("UCS-2 String malformed");
   }
  }else if(hi<0xE000){ //[0xDC00, 0xDFFF]
   throw new Error("UCS-2 String malformed");
  }else{ //[0xE000, 0xFFFF]
   n+=3;
  }
 }
 return n;
}

var s="\u0000\u007F\u07FF\uD7FF\uDBFF\uDFFF\uFFFF";
console.log("expect byteLengthOf(s) to be 14, actually it is %s.",byteLengthOf(s));

Notethat the method may throw error if an input string is UCS-2 malformed

请注意,如果输入字符串是 UCS-2 格式错误,该方法可能会抛出错误

回答by Boaz - Reinstate Monica

In NodeJS, Buffer.byteLengthis a method specifically for this purpose:

在 NodeJS 中,Buffer.byteLength有一个专门用于此目的的方法:

let strLengthInBytes = Buffer.byteLength(str); // str is UTF-8

Note that by default the method assumes the string is in UTF-8 encoding. If a different encoding is required, pass it as the second argument.

请注意,默认情况下,该方法假定字符串采用 UTF-8 编码。如果需要不同的编码,请将其作为第二个参数传递。