Javascript 如何将字符串转换为字节数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6226189/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert a String to Bytearray
提问by shas
How can I convert a string in bytearray using JavaScript. Output should be equivalent of the below C# code.
如何使用 JavaScript 转换 bytearray 中的字符串。输出应等效于以下 C# 代码。
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes(AnyString);
As UnicodeEncoding is by default of UTF-16 with Little-Endianness.
因为 UnicodeEncoding 默认是 UTF-16 和 Little-Endianness。
Edit:I have a requirement to match the bytearray generated client side with the one generated at server side using the above C# code.
编辑:我需要将字节数组生成的客户端与使用上述 C# 代码在服务器端生成的匹配。
采纳答案by BrunoLM
In C# running this
在 C# 中运行这个
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes("Hello");
Will create an array with
将创建一个数组
72,0,101,0,108,0,108,0,111,0
For a character which the code is greater than 255 it will look like this
对于代码大于 255 的字符,它看起来像这样
If you want a very similar behavior in JavaScript you can do this (v2 is a bit more robust solution, while the original version will only work for 0x00 ~ 0xff)
如果你想在 JavaScript 中有一个非常相似的行为,你可以这样做(v2 是一个更强大的解决方案,而原始版本只适用于 0x00 ~ 0xff)
var str = "Hello竜";
var bytes = []; // char codes
var bytesv2 = []; // char codes
for (var i = 0; i < str.length; ++i) {
var code = str.charCodeAt(i);
bytes = bytes.concat([code]);
bytesv2 = bytesv2.concat([code & 0xff, code / 256 >>> 0]);
}
// 72, 101, 108, 108, 111, 31452
console.log('bytes', bytes.join(', '));
// 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 220, 122
console.log('bytesv2', bytesv2.join(', '));
回答by Jin
If you are looking for a solution that works in node.js, you can use this:
如果您正在寻找适用于 node.js 的解决方案,您可以使用:
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
console.log(myBuffer);
回答by hgoebl
I suppose C# and Java produce equal byte arrays. If you have non-ASCII characters, it's not enough to add an additional 0. My example contains a few special characters:
我想 C# 和 Java 产生相等的字节数组。如果您有非 ASCII 字符,仅添加额外的 0 是不够的。我的示例包含一些特殊字符:
var str = "Hell ? Ω ";
var bytes = [];
var charCode;
for (var i = 0; i < str.length; ++i)
{
charCode = str.charCodeAt(i);
bytes.push((charCode & 0xFF00) >> 8);
bytes.push(charCode & 0xFF);
}
alert(bytes.join(' '));
// 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
I don't know if C# places BOM (Byte Order Marks), but if using UTF-16, Java String.getBytes
adds following bytes: 254 255.
我不知道 C# 是否放置 BOM(字节顺序标记),但如果使用 UTF-16,JavaString.getBytes
会添加以下字节:254 255。
String s = "Hell ? Ω ";
// now add a character outside the BMP (Basic Multilingual Plane)
// we take the violin-symbol (U+1D11E) MUSICAL SYMBOL G CLEF
s += new String(Character.toChars(0x1D11E));
// surrogate codepoints are: d834, dd1e, so one could also write "\ud834\udd1e"
byte[] bytes = s.getBytes("UTF-16");
for (byte aByte : bytes) {
System.out.print((0xFF & aByte) + " ");
}
// 254 255 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
Edit:
编辑:
Added a special character (U+1D11E) MUSICAL SYMBOL G CLEF (outside BPM, so taking not only 2 bytes in UTF-16, but 4.
添加了一个特殊字符 (U+1D11E) MUSICAL SYMBOL G CLEF(在 BPM 之外,因此在 UTF-16 中不仅占用 2 个字节,而且占用 4 个字节。
Current JavaScript versions use "UCS-2" internally, so this symbol takes the space of 2 normal characters.
当前的 JavaScript 版本在内部使用“UCS-2”,因此该符号占用 2 个普通字符的空间。
I'm not sure but when using charCodeAt
it seems we get exactly the surrogate codepoints also used in UTF-16, so non-BPM characters are handled correctly.
我不确定,但是在使用charCodeAt
它时,我们似乎准确地获得了也在 UTF-16 中使用的代理代码点,因此可以正确处理非 BPM 字符。
This problem is absolutely non-trivial. It might depend on the used JavaScript versions and engines. So if you want reliable solutions, you should have a look at:
这个问题绝对不平凡。这可能取决于使用的 JavaScript 版本和引擎。所以如果你想要可靠的解决方案,你应该看看:
- https://github.com/koichik/node-codepoint/
- http://mathiasbynens.be/notes/javascript-escapes
- Mozilla Developer Network: charCodeAt
- BigEndian vs. LittleEndian
- https://github.com/koichik/node-codepoint/
- http://mathiasbynens.be/notes/javascript-escapes
- Mozilla 开发者网络:charCodeAt
- BigEndian 与 LittleEndian
回答by code4j
The easiest way in 2018 should be TextEncoderbut the returned element is not byte array, it is Uint8Array. (And not all browsers support it)
2018年最简单的方法应该是TextEncoder,但是返回的元素不是字节数组,而是Uint8Array。(并不是所有浏览器都支持它)
let utf8Encode = new TextEncoder();
utf8Encode.encode("eee")
> Uint8Array [ 101, 101, 101 ]
回答by jchook
UTF-16 Byte Array
UTF-16 字节数组
JavaScript encodes strings as UTF-16, just like C#'s UnicodeEncoding
, so the byte arrays should match exactly using charCodeAt()
, and splitting each returned byte pair into 2 separate bytes, as in:
JavaScript 将字符串编码为UTF-16,就像 C# 一样UnicodeEncoding
,因此字节数组应该使用 完全匹配charCodeAt()
,并将每个返回的字节对拆分为 2 个单独的字节,如下所示:
function strToUtf16Bytes(str) {
const bytes = [];
for (ii = 0; ii < str.length; ii++) {
const code = str.charCodeAt(ii); // x00-xFFFF
bytes.push(code & 255, code >> 8); // low, high
}
return bytes;
}
For example:
例如:
strToUtf16Bytes('');
// [ 60, 216, 53, 223 ]
However, If you want to get a UTF-8 byte array, you must transcode the bytes.
但是,如果要获得 UTF-8 字节数组,则必须对字节进行转码。
UTF-8 Byte Array
UTF-8 字节数组
The solution feels somewhat non-trivial, but I used the code below in a high-traffic production environment with great success (original source).
该解决方案感觉有些不平凡,但我在高流量生产环境中使用了以下代码并取得了巨大成功(原始来源)。
Also, for the interested reader, I published my unicode helpersthat help me work with string lengths reported by other languages such as PHP.
此外,对于感兴趣的读者,我发布了我的 unicode 助手,它们帮助我处理其他语言(如 PHP)报告的字符串长度。
/**
* Convert a string to a unicode byte array
* @param {string} str
* @return {Array} of bytes
*/
export function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}
回答by SkySpiral7
Inspired by @hgoebl's answer. His code is for UTF-16 and I needed something for US-ASCII. So here's a more complete answer covering US-ASCII, UTF-16, and UTF-32.
灵感来自@hgoebl 的回答。他的代码用于 UTF-16,而我需要一些用于 US-ASCII 的代码。所以这里有一个更完整的答案,涵盖 US-ASCII、UTF-16 和 UTF-32。
/**@returns {Array} bytes of US-ASCII*/
function stringToAsciiByteArray(str)
{
var bytes = [];
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
if (charCode > 0xFF) // char > 1 byte since charCodeAt returns the UTF-16 value
{
throw new Error('Character ' + String.fromCharCode(charCode) + ' can\'t be represented by a US-ASCII byte.');
}
bytes.push(charCode);
}
return bytes;
}
/**@returns {Array} bytes of UTF-16 Big Endian without BOM*/
function stringToUtf16ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
//char > 2 bytes is impossible since charCodeAt can only return 2 bytes
bytes.push((charCode & 0xFF00) >>> 8); //high byte (might be 0)
bytes.push(charCode & 0xFF); //low byte
}
return bytes;
}
/**@returns {Array} bytes of UTF-32 Big Endian without BOM*/
function stringToUtf32ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(0, 0, 254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; i+=2)
{
var charPoint = str.codePointAt(i);
//char > 4 bytes is impossible since codePointAt can only return 4 bytes
bytes.push((charPoint & 0xFF000000) >>> 24);
bytes.push((charPoint & 0xFF0000) >>> 16);
bytes.push((charPoint & 0xFF00) >>> 8);
bytes.push(charPoint & 0xFF);
}
return bytes;
}
UTF-8 is variable length and isn't included because I would have to write the encoding myself. UTF-8 and UTF-16 are variable length. UTF-8, UTF-16, and UTF-32 have a minimum number of bits as their name indicates. If a UTF-32 character has a code point of 65 then that means there are 3 leading 0s. But the same code for UTF-16 has only 1 leading 0. US-ASCII on the other hand is fixed width 8-bits which means it can be directly translated to bytes.
UTF-8 是可变长度的,不包括在内,因为我必须自己编写编码。UTF-8 和 UTF-16 是可变长度的。正如其名称所示,UTF-8、UTF-16 和 UTF-32 具有最小位数。如果 UTF-32 字符的代码点为 65,则意味着有 3 个前导 0。但是 UTF-16 的相同代码只有 1 个前导 0。另一方面,US-ASCII 是固定宽度的 8 位,这意味着它可以直接转换为字节。
String.prototype.charCodeAt
returns a maximum number of 2 bytes and matches UTF-16 exactly. However for UTF-32 String.prototype.codePointAt
is needed which is part of the ECMAScript 6 (Harmony) proposal. Because charCodeAt returns 2 bytes which is more possible characters than US-ASCII can represent, the function stringToAsciiByteArray
will throw in such cases instead of splitting the character in half and taking either or both bytes.
String.prototype.charCodeAt
返回最多 2 个字节并与 UTF-16 完全匹配。但是,String.prototype.codePointAt
需要UTF-32 ,这是 ECMAScript 6 (Harmony) 提案的一部分。因为 charCodeAt 返回 2 个字节,这比 US-ASCII 可以表示的字符多,所以函数stringToAsciiByteArray
将在这种情况下抛出,而不是将字符分成两半并取其中一个或两个字节。
Note that this answer is non-trivial because character encoding is non-trivial. What kind of byte array you want depends on what character encoding you want those bytes to represent.
请注意,这个答案很重要,因为字符编码很重要。你想要什么样的字节数组取决于你想要这些字节代表什么字符编码。
javascript has the option of internally using either UTF-16 or UCS-2 but since it has methods that act like it is UTF-16 I don't see why any browser would use UCS-2. Also see: https://mathiasbynens.be/notes/javascript-encoding
javascript 可以选择在内部使用 UTF-16 或 UCS-2,但由于它的方法类似于 UTF-16,我不明白为什么任何浏览器都会使用 UCS-2。另请参阅:https: //mathiasbynens.be/notes/javascript-encoding
Yes I know the question is 4 years old but I needed this answer for myself.
是的,我知道这个问题已经有 4 年历史了,但我自己也需要这个答案。
回答by mmdts
Since I cannot comment on the answer, I'd build on Jin Izzraeel's answer
由于我无法对答案发表评论,因此我会以 Jin Izzraeel 的答案为基础
var myBuffer = []; var str = 'Stack Overflow'; var buffer = new Buffer(str, 'utf16le'); for (var i = 0; i < buffer.length; i++) { myBuffer.push(buffer[i]); } console.log(myBuffer);
var myBuffer = []; var str = 'Stack Overflow'; var buffer = new Buffer(str, 'utf16le'); for (var i = 0; i < buffer.length; i++) { myBuffer.push(buffer[i]); } console.log(myBuffer);
by saying that you could use this if you want to use a Node.js buffer in your browser.
如果你想在你的浏览器中使用 Node.js 缓冲区,你可以使用它。
https://github.com/feross/buffer
https://github.com/feross/buffer
Therefore, Tom Stickel's objection is not valid, and the answer is indeed a valid answer.
因此,Tom Stickel 的反对是无效的,答案确实是有效的答案。
回答by Fabio Maciel
String.prototype.encodeHex = function () {
return this.split('').map(e => e.charCodeAt())
};
String.prototype.decodeHex = function () {
return this.map(e => String.fromCharCode(e)).join('')
};
回答by Hasan A Yousef
I know the question is almost 4 years old, but this is what worked smoothly with me:
我知道这个问题已经有将近 4 年的历史了,但这对我来说很顺利:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes;
};
Array.prototype.decodeHex = function () {
var str = [];
var hex = this.toString().split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());
or, if you want to work with strings only, and no Array, you can use:
或者,如果您只想使用字符串而不使用数组,则可以使用:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes.toString();
};
String.prototype.decodeHex = function () {
var str = [];
var hex = this.split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());
回答by Whosdr
The best solution I've come up with at on the spot (though most likely crude) would be:
我在现场提出的最佳解决方案(尽管很可能是粗略的)是:
String.prototype.getBytes = function() {
var bytes = [];
for (var i = 0; i < this.length; i++) {
var charCode = this.charCodeAt(i);
var cLen = Math.ceil(Math.log(charCode)/Math.log(256));
for (var j = 0; j < cLen; j++) {
bytes.push((charCode << (j*8)) & 0xFF);
}
}
return bytes;
}
Though I notice this question has been here for over a year.
虽然我注意到这个问题已经存在一年多了。