javascript 如何将 UTF8 字符串转换为字节数组?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18729405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-27 12:59:53  来源:igfitidea点击:

How to convert UTF8 string to byte array?

javascriptutf-8

提问by don kaka

The .charCodeAtfunction returns with the unicode code of the caracter. But I would like to get the byte array instead. I know, if the charcode is over 127, then the character is stored in two or more bytes.

.charCodeAt函数返回字符的 unicode 代码。但我想改为获取字节数组。我知道,如果字符码超过 127,则字符存储在两个或更多字节中。

var arr=[];
for(var i=0; i<str.length; i++) {
    arr.push(str.charCodeAt(i))
}

回答by Joni

The logic of encoding Unicode in UTF-8 is basically:

UTF-8编码Unicode的逻辑基本上是:

  • Up to 4 bytes per character can be used. The fewest number of bytes possible is used.
  • Characters up to U+007F are encoded with a single byte.
  • For multibyte sequences, the number of leading 1 bits in the first byte gives the number of bytes for the character. The rest of the bits of the first byte can be used to encode bits of the character.
  • The continuation bytes begin with 10, and the other 6 bits encode bits of the character.
  • 每个字符最多可以使用 4 个字节。使用尽可能少的字节数。
  • 最多 U+007F 的字符使用单个字节进行编码。
  • 对于多字节序列,第一个字节中的前导 1 位的数量给出了字符的字节数。第一个字节的其余位可用于对字符的位进行编码。
  • 连续字节以 10 开头,其他 6 位对字符的位进行编码。

Here's a function I wrote a while back for encoding a JavaScript UTF-16 string in UTF-8:

这是我不久前编写的一个函数,用于在 UTF-8 中编码 JavaScript UTF-16 字符串:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6), 
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
            i++;
            // UTF-16 encodes 0x10000-0x10FFFF by
            // subtracting 0x10000 and splitting the
            // 20 bits of 0x0-0xFFFFF into two halves
            charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >>18), 
                      0x80 | ((charcode>>12) & 0x3f), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}

回答by Jonathan Lonowski

JavaScript Strings are stored in UTF-16. To get UTF-8, you'll have to convert the Stringyourself.

JavaScript的Strings的存储在UTF-16。要获得 UTF-8,您必须String自己转换。

One way is to mix encodeURIComponent(), which will output UTF-8 bytes URL-encoded, with unescape, as mentioned on ecmanaut.

一种方法是混合encodeURIComponent(),这将输出UTF-8字节URL编码,以unescape作为上ecmanaut提到

var utf8 = unescape(encodeURIComponent(str));

var arr = [];
for (var i = 0; i < utf8.length; i++) {
    arr.push(utf8.charCodeAt(i));
}

回答by bryc

The new Encoding APIseems to let you both encode and decode UTF-8 easily (using typed arrays):

新的编码 API似乎可以让您轻松地对 UTF-8 进行编码和解码(使用类型化数组):

var encoded = new TextEncoder("utf-8").encode("Γεια σου κ?σμε");
var decoded = new TextDecoder("utf-8").decode(encoded);

console.log(encoded, decoded);

Browser support isn't too bad, and there's a polyfillthat should work in IE11 and older versions of Edge.

浏览器支持还不错,并且有一个polyfill应该可以在 IE11 和旧版本的 Edge 中使用。

The API supports many different encodings, too. I used it to decode/encode Japanese text (Shift-JIS) with this:

API 也支持许多不同的编码。我用它来解码/编码日语文本(Shift-JIS):

new TextDecoder("shift-jis").decode(new Uint8Array(textbuffer))

回答by optevo

The Google Closure library has functions to convert to/from UTF-8 and byte arrays. If you don't want to use the whole library, you can copy the functions from here. For completeness, the code to convert to a string to a UTF-8 byte array is:

Google Closure 库具有与 UTF-8 和字节数组相互转换的函数。如果不想使用整个库,可以从这里复制函数。为了完整起见,将字符串转换为 UTF-8 字节数组的代码是:

goog.crypt.stringToUtf8ByteArray = function(str) {
  // TODO(user): Use native implementations if/when available
  var out = [], p = 0;
  for (var i = 0; i < str.length; i++) {
    var c = str.charCodeAt(i);
    if (c < 128) {
      out[p++] = c;
    } else if (c < 2048) {
      out[p++] = (c >> 6) | 192;
      out[p++] = (c & 63) | 128;
    } else if (
        ((c & 0xFC00) == 0xD800) && (i + 1) < str.length &&
        ((str.charCodeAt(i + 1) & 0xFC00) == 0xDC00)) {
      // Surrogate Pair
      c = 0x10000 + ((c & 0x03FF) << 10) + (str.charCodeAt(++i) & 0x03FF);
      out[p++] = (c >> 18) | 240;
      out[p++] = ((c >> 12) & 63) | 128;
      out[p++] = ((c >> 6) & 63) | 128;
      out[p++] = (c & 63) | 128;
    } else {
      out[p++] = (c >> 12) | 224;
      out[p++] = ((c >> 6) & 63) | 128;
      out[p++] = (c & 63) | 128;
    }
  }
  return out;
};

回答by Rainer Rillke

Assuming the question is about a DOMString as input and the goal is to get an Array, that when interpreted as string (e.g. written to a file on disk), would be UTF-8 encoded:

假设问题是关于作为输入的 DOMString 并且目标是获得一个数组,当被解释为字符串(例如写入磁盘上的文件)时,将被 UTF-8 编码:

Now that nearly all modern browsers support Typed Arrays, it'd be ashamed if this approach is not listed:

现在几乎所有的现代浏览器都支持 Typed Arrays,如果没有列出这种方法会很惭愧:

  • According to the W3C, software supporting the File API should accept DOMStrings in their Blob constructor(see also: String encoding when constructing a Blob)
  • Blobs can be converted to an ArrayBuffer using the .readAsArrayBuffer()function of a File Reader
  • Using a DataViewor constructing a Typed Arraywith the buffer read by the File Reader, one can access every single byte of the ArrayBuffer
  • 根据W3C,支持 File API 的软件应该在其Blob 构造函数中接受DOMStrings (另请参阅:构造 Blob 时的字符串编码
  • 可以使用文件读取器.readAsArrayBuffer()功能将Blob 转换为 ArrayBuffer
  • 使用DataView或使用File Reader 读取的缓冲区构造类型化数组,可以访问 ArrayBuffer 的每个字节

Example:

例子:

// Create a Blob with an Euro-char (U+20AC)
var b = new Blob(['']);
var fr = new FileReader();

fr.onload = function() {
    ua = new Uint8Array(fr.result);
    // This will log "3|226|130|172"
    //                  E2  82  AC
    // In UTF-16, it would be only 2 bytes long
    console.log(
        fr.result.byteLength + '|' + 
        ua[0]  + '|' + 
        ua[1] + '|' + 
        ua[2] + ''
    );
};
fr.readAsArrayBuffer(b);

Play with that on JSFiddle. I haven't benchmarked this yet but I can imagine this being efficient for large DOMStrings as input.

JSFiddle上玩这个。我还没有对此进行基准测试,但我可以想象这对于大型 DOMStrings 作为输入是有效的。

回答by Martin Wantke

You can save a string raw as is by using FileReader.

您可以使用FileReader按原样保存字符串。

Save the string in a blob and call readAsArrayBuffer(). Then the onload-event results an arraybuffer, which can converted in a Uint8Array. Unfortunately this call is asynchronous.

将字符串保存在 blob 中并调用readAsArrayBuffer()。然后 onload-event 产生一个 arraybuffer,它可以转换为 Uint8Array。不幸的是,这个调用是异步的。

This little function will help you:

这个小功能将帮助您:

function stringToBytes(str)
{
    let reader = new FileReader();
    let done = () => {};

    reader.onload = event =>
    {
        done(new Uint8Array(event.target.result), str);
    };
    reader.readAsArrayBuffer(new Blob([str], { type: "application/octet-stream" }));

    return { done: callback => { done = callback; } };
}

Call it like this:

像这样调用它:

stringToBytes("\u{1f4a9}").done(bytes =>
{
    console.log(bytes);
});

output:[240, 159, 146, 169]

输出:[240, 159, 146, 169]

explanation:

解释:

JavaScript use UTF-16 and surrogate-pairs to store unicode characters in memory. To save unicode character in raw binary byte streams an encoding is necessary. Usually and in the most case, UTF-8 is used for this. If you not use an enconding you can't save unicode character, just ASCII up to 0x7f.

JavaScript 使用 UTF-16 和代理对在内存中存储 unicode 字符。要在原始二进制字节流中保存 unicode 字符,需要进行编码。通常并且在大多数情况下,为此使用 UTF-8。如果不使用编码,则无法保存 unicode 字符,只能保存 ASCII 到 0x7f。

FileReader.readAsArrayBuffer() uses UTF-8.

FileReader.readAsArrayBuffer() 使用 UTF-8。

回答by jk7

I was using Joni's solutionand it worked fine, but this one is much shorter.

我正在使用Joni 的解决方案,效果很好,但这个要短得多。

This was inspired by the atobUTF16() function of Solution #3 of Mozilla's Base64 Unicode discussion

这是受到Mozilla 的 Base64 Unicode 讨论的解决方案 #3 的 atobUTF16() 函数的启发

function convertStringToUTF8ByteArray(str) {
    let binaryArray = new Uint8Array(str.length)
    Array.prototype.forEach.call(binaryArray, function (el, idx, arr) { arr[idx] = str.charCodeAt(idx) })
    return binaryArray
}

回答by Yordan Nedelchev

As there is no pure bytetype in JavaScript we can represent a byte array as an array of numbers, where each number represents a byte and thus will have an integer value between 0 and 255 inclusive.

由于byteJavaScript 中没有纯类型,我们可以将字节数组表示为数字数组,其中每个数字代表一个字节,因此将具有 0 到 255 之间的整数值。

Here is a simple function that does convert a JavaScript string into an Array of numbers that contain the UTF-8 encoding of the string:

这是一个简单的函数,它将 JavaScript 字符串转换为包含字符串的 UTF-8 编码的数字数组:

function toUtf8(str) {
    var value = [];
    var destIndex = 0;
    for (var index = 0; index < str.length; index++) {
        var code = str.charCodeAt(index);
        if (code <= 0x7F) {
            value[destIndex++] = code;
        } else if (code <= 0x7FF) {
            value[destIndex++] = ((code >> 6 ) & 0x1F) | 0xC0;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else if (code <= 0xFFFF) {
            value[destIndex++] = ((code >> 12) & 0x0F) | 0xE0;
            value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else if (code <= 0x1FFFFF) {
            value[destIndex++] = ((code >> 18) & 0x07) | 0xF0;
            value[destIndex++] = ((code >> 12) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else if (code <= 0x03FFFFFF) {
            value[destIndex++] = ((code >> 24) & 0x03) | 0xF0;
            value[destIndex++] = ((code >> 18) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 12) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else if (code <= 0x7FFFFFFF) {
            value[destIndex++] = ((code >> 30) & 0x01) | 0xFC;
            value[destIndex++] = ((code >> 24) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 18) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 12) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else {
            throw new Error("Unsupported Unicode character \"" 
                + str.charAt(index) + "\" with code " + code + " (binary: " 
                + toBinary(code) + ") at index " + index
                + ". Cannot represent it as UTF-8 byte sequence.");
        }
    }
    return value;
}

function toBinary(byteValue) {
    if (byteValue < 0) {
        byteValue = byteValue & 0x00FF;
    }
    var str = byteValue.toString(2);
    var len = str.length;
    var prefix = "";
    for (var i = len; i < 8; i++) {
        prefix += "0";
    }
    return prefix + str;
}