Java String.getBytes("UTF8") JavaScript 模拟

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12518830/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-26 16:25:30  来源:igfitidea点击:

Java String.getBytes("UTF8") JavaScript analog

javajavascriptstringutf-8byte

提问by ivkremer

Bytes to string and backward

字节到字符串和向后

Functions written there work properly that is pack(unpack("string"))yields to "string". But I would like to have the same result as "string".getBytes("UTF8")gives in Java.

在那里编写的函数可以正常工作,即pack(unpack("string"))产生"string". 但我希望"string".getBytes("UTF8")得到与 Java 中给出的结果相同的结果。

The question is how to make a function giving the same functionality as Java getBytes("UTF8") in JavaScript?

问题是如何在 JavaScript 中创建一个与 Java getBytes("UTF8") 具有相同功能的函数?

For Latin strings unpack(str)from the article mentioned above provides the same result as getBytes("UTF8")except it adds 0for odd positions. But with non-Latin strings it works completely different as it seems to me. Is there a way to work with string data in JavaScript like Java does?

对于unpack(str)上面提到的文章中的拉丁字符串,getBytes("UTF8")除了它0为奇数位置添加之外,提供了相同的结果。但是对于非拉丁字符串,它的工作方式在我看来完全不同。有没有办法像 Java 那样在 JavaScript 中处理字符串数据?

采纳答案by Joni

You can use this function (gist):

您可以使用此功能(要点):

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6), 
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
        else {
            // let's keep things simple and only handle chars up to U+FFFF...
            utf8.push(0xef, 0xbf, 0xbd); // U+FFFE "replacement character"
        }
    }
    return utf8;
}

Example of use:

使用示例:

>>> toUTF8Array("中")
[228, 184, 173, 226, 130, 172]

If you want negative numbers for values over 127, like Java's byte-to-int conversion does, you have to tweak the constants and use

如果你想要负数超过 127 的值,就像 Java 的字节到整数转换那样,你必须调整常量并使用

            utf8.push(0xffffffc0 | (charcode >> 6), 
                      0xffffff80 | (charcode & 0x3f));

and

            utf8.push(0xffffffe0 | (charcode >> 12), 
                      0xffffff80 | ((charcode>>6) & 0x3f), 
                      0xffffff80 | (charcode & 0x3f));

回答by bobince

You don't need to write a full-on UTF-8 encoder; there is a much easier JS idiom to convert a Unicode string into a string of bytes representing UTF-8 code units:

您不需要编写完整的 UTF-8 编码器;有一个更简单的 JS 习惯用法可以将 Unicode 字符串转换为表示 UTF-8 代码单元的字节字符串:

unescape(encodeURIComponent(str))

(This works because the odd encoding used by escape/unescapeuses %xxhex sequences to represent ISO-8859-1 characters with that code, instead of UTF-8 as used by URI-component escaping. Similarly decodeURIComponent(escape(bytes))goes in the other direction.)

(这是有效的,因为escape/unescape使用的奇数编码使用%xx十六进制序列来表示具有该代码的 ISO-8859-1 字符,而不是 URI 组件转义使用的 UTF-8。同样decodeURIComponent(escape(bytes))在另一个方向。)

So if you want an Array out it would be:

因此,如果您想要一个数组,它将是:

function toUTF8Array(str) {
    var utf8= unescape(encodeURIComponent(str));
    var arr= new Array(utf8.length);
    for (var i= 0; i<utf8.length; i++)
        arr[i]= utf8.charCodeAt(i);
    return arr;
}

回答by Kevin Hakanson

TextEncoderis part of the Encoding Living Standardand according to the Encoding APIentry from the Chromium Dashboard, it shipped in Firefox and will ship in Chrome 38. There is also a text-encodingpolyfill available for other browsers.

TextEncoder编码生活标准的一部分,根据Chromium 仪表板的编码 API条目,它在 Firefox 中提供,并将在 Chrome 38 中提供。还有一个文本编码polyfill 可用于其他浏览器。

The JavaScript code sample below returns a Uint8Arrayfilled with the values you expect.

下面的 JavaScript 代码示例返回一个Uint8Array填充了您期望的值。

(new TextEncoder()).encode("string") 
// [115, 116, 114, 105, 110, 103]

A more interesting example that betters shows UTF-8 replaces the inin stringwith ??:

一个更有趣的例子,更佳显示UTF-8替换instring??

(new TextEncoder()).encode("str??g")
[115, 116, 114, 195, 174, 195, 177, 103]

回答by HelloSam

The following function will deal with those above U+FFFF.

下面的函数将处理 U+FFFF 以上的那些。

Because javascript text are in UTF-16, two "characters" are used in a string to represent a character above BMP, and charCodeAt returns the corresponding surrogate code. The fixedCharCodeAt handles this.

因为javascript文本是UTF-16,所以在一个字符串中用两个“字符”来表示BMP以上的一个字符,charCodeAt返回对应的代理码。fixedCharCodeAt 处理这个。

function encodeTextToUtf8(text) {
    var bin = [];
    for (var i = 0; i < text.length; i++) {
        var v = fixedCharCodeAt(text, i);
        if (v === false) continue;
        encodeCharCodeToUtf8(v, bin);
    }
    return bin;
}

function encodeCharCodeToUtf8(codePt, bin) {
    if (codePt <= 0x7F) {
        bin.push(codePt);
    } else if (codePt <= 0x7FF) {
        bin.push(192 | (codePt >> 6), 128 | (codePt & 63));
    } else if (codePt <= 0xFFFF) {
        bin.push(224 | (codePt >> 12),
            128 | ((codePt >> 6) & 63),
            128 | (codePt & 63));
    } else if (codePt <= 0x1FFFFF) {
        bin.push(240 | (codePt >> 18),
            128 | ((codePt >> 12) & 63), 
            128 | ((codePt >> 6) & 63),
            128 | (codePt & 63));
    }
}

function fixedCharCodeAt (str, idx) {  
    // ex. fixedCharCodeAt ('\uD800\uDC00', 0); // 65536  
    // ex. fixedCharCodeAt ('\uD800\uDC00', 1); // 65536  
    idx = idx || 0;  
    var code = str.charCodeAt(idx);  
    var hi, low;  
    if (0xD800 <= code && code <= 0xDBFF) { // High surrogate (could change last hex to 0xDB7F to treat high private surrogates as single characters)  
        hi = code;  
        low = str.charCodeAt(idx+1);  
        if (isNaN(low)) {  
            throw(encoding_error.invalid_surrogate_pair.replace('%pos%', idx));
        }  
        return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;  
    }  
    if (0xDC00 <= code && code <= 0xDFFF) { // Low surrogate  
        // We return false to allow loops to skip this iteration since should have already handled high surrogate above in the previous iteration  
        return false;  
        /*hi = str.charCodeAt(idx-1); 
          low = code; 
          return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;*/  
    }  
    return code;  
}