Javascript 如何将字符串转换为字节数组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6226189/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-23 20:50:31  来源:igfitidea点击:

How to convert a String to Bytearray

javascript

提问by shas

How can I convert a string in bytearray using JavaScript. Output should be equivalent of the below C# code.

如何使用 JavaScript 转换 bytearray 中的字符串。输出应等效于以下 C# 代码。

UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes(AnyString);

As UnicodeEncoding is by default of UTF-16 with Little-Endianness.

因为 UnicodeEncoding 默认是 UTF-16 和 Little-Endianness。

Edit:I have a requirement to match the bytearray generated client side with the one generated at server side using the above C# code.

编辑:我需要将字节数组生成的客户端与使用上述 C# 代码在服务器端生成的匹配。

采纳答案by BrunoLM

In C# running this

在 C# 中运行这个

UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes("Hello");

Will create an array with

将创建一个数组

72,0,101,0,108,0,108,0,111,0

byte array

字节数组

For a character which the code is greater than 255 it will look like this

对于代码大于 255 的字符,它看起来像这样

byte array

字节数组

If you want a very similar behavior in JavaScript you can do this (v2 is a bit more robust solution, while the original version will only work for 0x00 ~ 0xff)

如果你想在 JavaScript 中有一个非常相似的行为,你可以这样做(v2 是一个更强大的解决方案,而原始版本只适用于 0x00 ~ 0xff)

var str = "Hello竜";
var bytes = []; // char codes
var bytesv2 = []; // char codes

for (var i = 0; i < str.length; ++i) {
  var code = str.charCodeAt(i);
  
  bytes = bytes.concat([code]);
  
  bytesv2 = bytesv2.concat([code & 0xff, code / 256 >>> 0]);
}

// 72, 101, 108, 108, 111, 31452
console.log('bytes', bytes.join(', '));

// 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 220, 122
console.log('bytesv2', bytesv2.join(', '));

回答by Jin

If you are looking for a solution that works in node.js, you can use this:

如果您正在寻找适用于 node.js 的解决方案,您可以使用:

var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
}

console.log(myBuffer);

回答by hgoebl

I suppose C# and Java produce equal byte arrays. If you have non-ASCII characters, it's not enough to add an additional 0. My example contains a few special characters:

我想 C# 和 Java 产生相等的字节数组。如果您有非 ASCII 字符,仅添加额外的 0 是不够的。我的示例包含一些特殊字符:

var str = "Hell ?  Ω ";
var bytes = [];
var charCode;

for (var i = 0; i < str.length; ++i)
{
    charCode = str.charCodeAt(i);
    bytes.push((charCode & 0xFF00) >> 8);
    bytes.push(charCode & 0xFF);
}

alert(bytes.join(' '));
// 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30

I don't know if C# places BOM (Byte Order Marks), but if using UTF-16, Java String.getBytesadds following bytes: 254 255.

我不知道 C# 是否放置 BOM(字节顺序标记),但如果使用 UTF-16,JavaString.getBytes会添加以下字节:254 255。

String s = "Hell ?  Ω ";
// now add a character outside the BMP (Basic Multilingual Plane)
// we take the violin-symbol (U+1D11E) MUSICAL SYMBOL G CLEF
s += new String(Character.toChars(0x1D11E));
// surrogate codepoints are: d834, dd1e, so one could also write "\ud834\udd1e"

byte[] bytes = s.getBytes("UTF-16");
for (byte aByte : bytes) {
    System.out.print((0xFF & aByte) + " ");
}
// 254 255 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30

Edit:

编辑:

Added a special character (U+1D11E) MUSICAL SYMBOL G CLEF (outside BPM, so taking not only 2 bytes in UTF-16, but 4.

添加了一个特殊字符 (U+1D11E) MUSICAL SYMBOL G CLEF(在 BPM 之外,因此在 UTF-16 中不仅占用 2 个字节,而且占用 4 个字节。

Current JavaScript versions use "UCS-2" internally, so this symbol takes the space of 2 normal characters.

当前的 JavaScript 版本在内部使用“UCS-2”,因此该符号占用 2 个普通字符的空间。

I'm not sure but when using charCodeAtit seems we get exactly the surrogate codepoints also used in UTF-16, so non-BPM characters are handled correctly.

我不确定,但是在使用charCodeAt它时,我们似乎准确地获得了也在 UTF-16 中使用的代理代码点,因此可以正确处理非 BPM 字符。

This problem is absolutely non-trivial. It might depend on the used JavaScript versions and engines. So if you want reliable solutions, you should have a look at:

这个问题绝对不平凡。这可能取决于使用的 JavaScript 版本和引擎。所以如果你想要可靠的解决方案,你应该看看:

回答by code4j

The easiest way in 2018 should be TextEncoderbut the returned element is not byte array, it is Uint8Array. (And not all browsers support it)

2018年最简单的方法应该是TextEncoder,但是返回的元素不是字节数组,而是Uint8Array。(并不是所有浏览器都支持它)

let utf8Encode = new TextEncoder();
utf8Encode.encode("eee")
> Uint8Array [ 101, 101, 101 ]

回答by jchook

UTF-16 Byte Array

UTF-16 字节数组

JavaScript encodes strings as UTF-16, just like C#'s UnicodeEncoding, so the byte arrays should match exactly using charCodeAt(), and splitting each returned byte pair into 2 separate bytes, as in:

JavaScript 将字符串编码为UTF-16,就像 C# 一样UnicodeEncoding,因此字节数组应该使用 完全匹配charCodeAt(),并将每个返回的字节对拆分为 2 个单独的字节,如下所示:

function strToUtf16Bytes(str) {
  const bytes = [];
  for (ii = 0; ii < str.length; ii++) {
    const code = str.charCodeAt(ii); // x00-xFFFF
    bytes.push(code & 255, code >> 8); // low, high
  }
  return bytes;
}

For example:

例如:

strToUtf16Bytes(''); 
// [ 60, 216, 53, 223 ]

However, If you want to get a UTF-8 byte array, you must transcode the bytes.

但是,如果要获得 UTF-8 字节数组,则必须对字节进行转码。

UTF-8 Byte Array

UTF-8 字节数组

The solution feels somewhat non-trivial, but I used the code below in a high-traffic production environment with great success (original source).

该解决方案感觉有些不平凡,但我在高流量生产环境中使用了以下代码并取得了巨大成功(原始来源)。

Also, for the interested reader, I published my unicode helpersthat help me work with string lengths reported by other languages such as PHP.

此外,对于感兴趣的读者,我发布了我的 unicode 助手,它们帮助我处理其他语言(如 PHP)报告的字符串长度。

/**
 * Convert a string to a unicode byte array
 * @param {string} str
 * @return {Array} of bytes
 */
export function strToUtf8Bytes(str) {
  const utf8 = [];
  for (let ii = 0; ii < str.length; ii++) {
    let charCode = str.charCodeAt(ii);
    if (charCode < 0x80) utf8.push(charCode);
    else if (charCode < 0x800) {
      utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
    } else if (charCode < 0xd800 || charCode >= 0xe000) {
      utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
    } else {
      ii++;
      // Surrogate pair:
      // UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
      // splitting the 20 bits of 0x0-0xFFFFF into two halves
      charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
      utf8.push(
        0xf0 | (charCode >> 18),
        0x80 | ((charCode >> 12) & 0x3f),
        0x80 | ((charCode >> 6) & 0x3f),
        0x80 | (charCode & 0x3f),
      );
    }
  }
  return utf8;
}

回答by SkySpiral7

Inspired by @hgoebl's answer. His code is for UTF-16 and I needed something for US-ASCII. So here's a more complete answer covering US-ASCII, UTF-16, and UTF-32.

灵感来自@hgoebl 的回答。他的代码用于 UTF-16,而我需要一些用于 US-ASCII 的代码。所以这里有一个更完整的答案,涵盖 US-ASCII、UTF-16 和 UTF-32。

/**@returns {Array} bytes of US-ASCII*/
function stringToAsciiByteArray(str)
{
    var bytes = [];
   for (var i = 0; i < str.length; ++i)
   {
       var charCode = str.charCodeAt(i);
      if (charCode > 0xFF)  // char > 1 byte since charCodeAt returns the UTF-16 value
      {
          throw new Error('Character ' + String.fromCharCode(charCode) + ' can\'t be represented by a US-ASCII byte.');
      }
       bytes.push(charCode);
   }
    return bytes;
}
/**@returns {Array} bytes of UTF-16 Big Endian without BOM*/
function stringToUtf16ByteArray(str)
{
    var bytes = [];
    //currently the function returns without BOM. Uncomment the next line to change that.
    //bytes.push(254, 255);  //Big Endian Byte Order Marks
   for (var i = 0; i < str.length; ++i)
   {
       var charCode = str.charCodeAt(i);
       //char > 2 bytes is impossible since charCodeAt can only return 2 bytes
       bytes.push((charCode & 0xFF00) >>> 8);  //high byte (might be 0)
       bytes.push(charCode & 0xFF);  //low byte
   }
    return bytes;
}
/**@returns {Array} bytes of UTF-32 Big Endian without BOM*/
function stringToUtf32ByteArray(str)
{
    var bytes = [];
    //currently the function returns without BOM. Uncomment the next line to change that.
    //bytes.push(0, 0, 254, 255);  //Big Endian Byte Order Marks
   for (var i = 0; i < str.length; i+=2)
   {
       var charPoint = str.codePointAt(i);
       //char > 4 bytes is impossible since codePointAt can only return 4 bytes
       bytes.push((charPoint & 0xFF000000) >>> 24);
       bytes.push((charPoint & 0xFF0000) >>> 16);
       bytes.push((charPoint & 0xFF00) >>> 8);
       bytes.push(charPoint & 0xFF);
   }
    return bytes;
}

UTF-8 is variable length and isn't included because I would have to write the encoding myself. UTF-8 and UTF-16 are variable length. UTF-8, UTF-16, and UTF-32 have a minimum number of bits as their name indicates. If a UTF-32 character has a code point of 65 then that means there are 3 leading 0s. But the same code for UTF-16 has only 1 leading 0. US-ASCII on the other hand is fixed width 8-bits which means it can be directly translated to bytes.

UTF-8 是可变长度的,不包括在内,因为我必须自己编写编码。UTF-8 和 UTF-16 是可变长度的。正如其名称所示,UTF-8、UTF-16 和 UTF-32 具有最小位数。如果 UTF-32 字符的代码点为 65,则意味着有 3 个前导 0。但是 UTF-16 的相同代码只有 1 个前导 0。另一方面,US-ASCII 是固定宽度的 8 位,这意味着它可以直接转换为字节。

String.prototype.charCodeAtreturns a maximum number of 2 bytes and matches UTF-16 exactly. However for UTF-32 String.prototype.codePointAtis needed which is part of the ECMAScript 6 (Harmony) proposal. Because charCodeAt returns 2 bytes which is more possible characters than US-ASCII can represent, the function stringToAsciiByteArraywill throw in such cases instead of splitting the character in half and taking either or both bytes.

String.prototype.charCodeAt返回最多 2 个字节并与 UTF-16 完全匹配。但是,String.prototype.codePointAt需要UTF-32 ,这是 ECMAScript 6 (Harmony) 提案的一部分。因为 charCodeAt 返回 2 个字节,这比 US-ASCII 可以表示的字符多,所以函数stringToAsciiByteArray将在这种情况下抛出,而不是将字符分成两半并取其中一个或两个字节。

Note that this answer is non-trivial because character encoding is non-trivial. What kind of byte array you want depends on what character encoding you want those bytes to represent.

请注意,这个答案很重要,因为字符编码很重要。你想要什么样的字节数组取决于你想要这些字节代表什么字符编码。

javascript has the option of internally using either UTF-16 or UCS-2 but since it has methods that act like it is UTF-16 I don't see why any browser would use UCS-2. Also see: https://mathiasbynens.be/notes/javascript-encoding

javascript 可以选择在内部使用 UTF-16 或 UCS-2,但由于它的方法类似于 UTF-16,我不明白为什么任何浏览器都会使用 UCS-2。另请参阅:https: //mathiasbynens.be/notes/javascript-encoding

Yes I know the question is 4 years old but I needed this answer for myself.

是的,我知道这个问题已经有 4 年历史了,但我自己也需要这个答案。

回答by mmdts

Since I cannot comment on the answer, I'd build on Jin Izzraeel's answer

由于我无法对答案发表评论,因此我会以 Jin Izzraeel 的答案为基础

var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
}

console.log(myBuffer);
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
}

console.log(myBuffer);

by saying that you could use this if you want to use a Node.js buffer in your browser.

如果你想在你的浏览器中使用 Node.js 缓冲区,你可以使用它。

https://github.com/feross/buffer

https://github.com/feross/buffer

Therefore, Tom Stickel's objection is not valid, and the answer is indeed a valid answer.

因此,Tom Stickel 的反对是无效的,答案确实是有效的答案。

回答by Fabio Maciel

String.prototype.encodeHex = function () {
    return this.split('').map(e => e.charCodeAt())
};

String.prototype.decodeHex = function () {    
    return this.map(e => String.fromCharCode(e)).join('')
};

回答by Hasan A Yousef

I know the question is almost 4 years old, but this is what worked smoothly with me:

我知道这个问题已经有将近 4 年的历史了,但这对我来说很顺利:

String.prototype.encodeHex = function () {
  var bytes = [];
  for (var i = 0; i < this.length; ++i) {
    bytes.push(this.charCodeAt(i));
  }
  return bytes;
};

Array.prototype.decodeHex = function () {    
  var str = [];
  var hex = this.toString().split(',');
  for (var i = 0; i < hex.length; i++) {
    str.push(String.fromCharCode(hex[i]));
  }
  return str.toString().replace(/,/g, "");
};

var str = "Hello World!";
var bytes = str.encodeHex();

alert('The Hexa Code is: '+bytes+' The original string is:  '+bytes.decodeHex());

or, if you want to work with strings only, and no Array, you can use:

或者,如果您只想使用字符串而不使用数组,则可以使用:

String.prototype.encodeHex = function () {
  var bytes = [];
  for (var i = 0; i < this.length; ++i) {
    bytes.push(this.charCodeAt(i));
  }
  return bytes.toString();
};

String.prototype.decodeHex = function () {    
  var str = [];
  var hex = this.split(',');
  for (var i = 0; i < hex.length; i++) {
    str.push(String.fromCharCode(hex[i]));
  }
  return str.toString().replace(/,/g, "");
};

var str = "Hello World!";
var bytes = str.encodeHex();

alert('The Hexa Code is: '+bytes+' The original string is:  '+bytes.decodeHex());

回答by Whosdr

The best solution I've come up with at on the spot (though most likely crude) would be:

我在现场提出的最佳解决方案(尽管很可能是粗略的)是:

String.prototype.getBytes = function() {
    var bytes = [];
    for (var i = 0; i < this.length; i++) {
        var charCode = this.charCodeAt(i);
        var cLen = Math.ceil(Math.log(charCode)/Math.log(256));
        for (var j = 0; j < cLen; j++) {
            bytes.push((charCode << (j*8)) & 0xFF);
        }
    }
    return bytes;
}

Though I notice this question has been here for over a year.

虽然我注意到这个问题已经存在一年多了。