javascript 按 utf-8 字节位置提取子字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11200451/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-26 12:22:21  来源:igfitidea点击:

Extract substring by utf-8 byte positions

javascriptstringutf-8character-encodingutf-16

提问by tofutim

I have a string and start and length with which to extract a substring. Both positions (start and length) are based on the byte offsets in the original UTF8 string.

我有一个字符串、开头和长度,用于提取子字符串。两个位置(开始和长度)都基于原始 UTF8 字符串中的字节偏移量。

However, there is a problem:

但是,有一个问题:

The start and length are in bytes, so I cannot use "substring". The UTF8 string contains several multi-byte characters. Is there a hyper-efficient way of doing this? (I don't need to decode the bytes...)

开始和长度以字节为单位,所以我不能使用“子字符串”。UTF8 字符串包含多个多字节字符。有没有一种超高效的方法来做到这一点?(我不需要解码字节......)

Example: var orig = '你好吗?'

例子: var orig = '你好吗?'

The s,e might be 3,3 to extract the second character (好). I'm looking for

s,e 可能是 3,3 以提取第二个字符 (好)。我在找

var result = orig.substringBytes(3,3);

Help!

帮助!

Update #1In C/C++ I would just cast it to a byte array, but not sure if there is an equivalent in javascript. BTW, yes we could parse it into a byte array and parse it back to a string, but it seems that there should be a quick way to cut it at the right place. Imagine that 'orig' is 1000000 characters, and s = 6 bytes and l = 3 bytes.

更新 #1在 C/C++ 中,我只是将它转换为一个字节数组,但不确定在 javascript 中是否有等价物。顺便说一句,是的,我们可以将其解析为字节数组并将其解析回字符串,但似乎应该有一种快速的方法可以在正确的位置切割它。假设 'orig' 是 1000000 个字符,s = 6 个字节,l = 3 个字节。

Update #2Thanks to zerkms helpful re-direction, I ended up with the following, which does NOTwork right - works right for multibyte but messed up for single byte.

更新#2由于zerkms有益的重新定向,我结束了;下面,就不是正确的工作-工作正确的多字节,但搞砸了单字节。

function substrBytes(str, start, length)
{
    var ch, startIx = 0, endIx = 0, re = '';
    for (var i = 0; 0 < str.length; i++)
    {
        startIx = endIx++;

        ch = str.charCodeAt(i);
        do {
            ch = ch >> 8;   // a better way may exist to measure ch len
            endIx++;
        }
        while (ch);

        if (endIx > start + length)
        {
            return re;
        }
        else if (startIx >= start)
        {
            re += str[i];
        }
    }
}

Update #3I don't think shifting the char code really works. I'm reading two bytes when the correct answer is three... somehow I always forget this. The codepoint is the same for UTF8 and UTF16, but the number of bytes taken up on encoding depends on the encoding!!! So this is not the right way to do this.

更新 #3我不认为转换字符代码真的有效。当正确答案是三个时,我正在读取两个字节......不知何故我总是忘记这一点。UTF8和UTF16的codepoint是一样的,但是编码占用的字节数取决于编码!!!所以这不是正确的方法。

回答by Kaii

I had a fun time fiddling with this. Hope this helps.

我玩得很开心。希望这可以帮助。

Because Javascript does not allow direct byte access on a string, the only way to find the start position is a forward scan.

因为 Javascript 不允许对字符串进行直接字节访问,所以找到起始位置的唯一方法是向前扫描。



Update #3I don't think shifting the char code really works. I'm reading two bytes when the correct answer is three... somehow I always forget this. The codepoint is the same for UTF8 and UTF16, but the number of bytes taken up on encoding depends on the encoding!!! So this is not the right way to do this.

更新 #3我不认为转换字符代码真的有效。当正确答案是三个时,我正在读取两个字节......不知何故我总是忘记这一点。UTF8和UTF16的codepoint是一样的,但是编码占用的字节数取决于编码!!!所以这不是正确的方法。

That is not correct - Actually there is no UTF-8 string in javascript. According to the ECMAScript 262 specification all strings - regardless of the input encoding - must be internally stored as UTF-16 ("[sequence of] 16-bit unsigned integers").

这是不正确的 - 实际上在 javascript 中没有 UTF-8 字符串。根据 ECMAScript 262 规范,所有字符串 - 无论输入编码如何 - 都必须在内部存储为 UTF-16(“[序列] 16 位无符号整数”)。

Considering this, the 8 bit shift is correct (but unnecessary).

考虑到这一点,8 位移位是正确的(但不必要)。

Wrong is the assumption that your character is stored as a 3-byte sequence...
In fact, allcharacters in a JS (ECMA-262) string are 16 bit (2 byte) long.

错误的是假设您的字符存储为 3 字节序列...
实际上,JS (ECMA-262) 字符串中的所有字符都是 16 位(2 字节)长。

This can be worked around by converting the multibyte characters to utf-8 manually, as shown in the code below.

这可以通过手动将多字节字符转换为 utf-8 来解决,如下面的代码所示。



See the details explained in my example code:

请参阅我的示例代码中解释的详细信息:

function encode_utf8( s )
{
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

   /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes.
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored
    * in utf-16 internally - so we need to convert characters to utf-8
    * to detect their length in utf-8 encoding.
    *
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string.
    * in utf-8, for example: 
    *       "a" is 1 byte, 
            "ü" is 2 byte, 
       and  "你" is 3 byte.
    *
    * NOTE:
    * according to ECMAScript 262 all strings are stored as a sequence
    * of 16-bit characters. so we need a encode_utf8() function to safely
    * detect the length our character would have in a utf8 representation.
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
    * see "4.3.16 String Value":
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers.
    */

    var resultStr = '';
    var startInChars = 0;

    // scan string forward to find index of first character
    // (convert start position in byte to start position in characters)

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {

        // get numeric code of character (is >128 for multibyte character)
        // and increase "bytePos" for each byte of the character sequence

        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
    }

    // now that we have the position of the starting character,
    // we can built the resulting substring

    // as we don't know the end position in chars yet, we start with a mix of
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position
    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        // get numeric code of character (is >128 for multibyte character)
        // and decrease "end" for each byte of the character sequence
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;

        resultStr += str[n];
    }

    return resultStr;
}

var orig = 'abc你好吗?';

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"

回答by sunzhuoshi

@Kaii's answer is almost correct, but there is a bug in it. It fails to handle the characters Unicode of which are from 128 to 255. Here is the revised version(just change 256 to 128):

@Kaii的回答几乎是正确的,但其中有一个错误。它无法处理Unicode从128到255的字符。这是修改后的版本(只是将256更改为128):

function encode_utf8( s )
{
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

   /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes.
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored
    * in utf-16 internally - so we need to convert characters to utf-8
    * to detect their length in utf-8 encoding.
    *
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string.
    * in utf-8, for example: 
    *       "a" is 1 byte, 
            "ü" is 2 byte, 
       and  "你" is 3 byte.
    *
    * NOTE:
    * according to ECMAScript 262 all strings are stored as a sequence
    * of 16-bit characters. so we need a encode_utf8() function to safely
    * detect the length our character would have in a utf8 representation.
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
    * see "4.3.16 String Value":
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers.
    */

    var resultStr = '';
    var startInChars = 0;

    // scan string forward to find index of first character
    // (convert start position in byte to start position in characters)

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {

        // get numeric code of character (is >= 128 for multibyte character)
        // and increase "bytePos" for each byte of the character sequence

        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
    }

    // now that we have the position of the starting character,
    // we can built the resulting substring

    // as we don't know the end position in chars yet, we start with a mix of
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position
    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        // get numeric code of character (is >= 128 for multibyte character)
        // and decrease "end" for each byte of the character sequence
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;

        resultStr += str[n];
    }

    return resultStr;
}

var orig = 'abc你好吗??';

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"
alert('res: ' + substr_utf8_bytes(orig, 15, 2)); // alerts: "?"

By the way, it is a bug fix, and it SHOULD be useful for the ones who have the same problem. Why did the reviewers reject my edit suggestion due to change "too much" or "too minor"? @Adam Eberlin@Kjuly@Jasonw

顺便说一句,这是一个错误修复,它应该对有同样问题的人有用。为什么审稿人会因为改动“太多”或“太小”而拒绝我的编辑建议?@Adam Eberlin @Kjuly @Jasonw

回答by tofutim

function substrBytes(str, start, length)
{
    var buf = new Buffer(str);
    return buf.slice(start, start+length).toString();
}

AYB

艾比

回答by Hüseyin BABAL

For IE users, the codes in above answer will output undefined. Because, in IE, it is not supported str[n], in other words, you cannot use string as array. Your need to replace str[n]with str.charAt(n). The code should be;

对于 IE 用户,上述答案中的代码将输出undefined. 因为,在 IE 中,它不受支持str[n],换句话说,您不能将字符串用作数组。您需要替换str[n]str.charAt(n). 代码应该是;

function encode_utf8( s ) {
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

    var resultStr = '';
    var startInChars = 0;

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {
        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str.charAt(startInChars)).length;
    }

    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str.charAt(n)).length;

        resultStr += str.charAt(n);
    }

    return resultStr;
}

回答by May Weather VN

Maybe use this to count byte and example. It counts 你 character is 2 bytes, instead 3 bytes follow @Kaii's function:

也许用它来计算字节和示例。它计算你的字符是 2 个字节,而不是 3 个字节跟随@Kaii 的函数:

jQuery.byteLength = function(target) {
    try {
        var i = 0;
        var length = 0;
        var count = 0;
        var character = '';
        //
        target = jQuery.castString(target);
        length = target.length;
        //
        for (i = 0; i < length; i++) {
            // 1 文字を切り出し Unicode に変換
            character = target.charCodeAt(i);
            //
            // Unicode の半角 : 0x0 - 0x80, 0xf8f0, 0xff61 - 0xff9f, 0xf8f1 -
            // 0xf8f3
            if ((character >= 0x0 && character < 0x81)
                    || (character == 0xf8f0)
                    || (character > 0xff60 && character < 0xffa0)
                    || (character > 0xf8f0 && character < 0xf8f4)) {
                // 1 バイト文字
                count += 1;
            } else {
                // 2 バイト文字
                count += 2;
            }
        }
        //
        return (count);
    } catch (e) {
        jQuery.showErrorDetail(e, 'byteLength');
        return (0);
    }
};

for (var j = 1, len = value.length; j <= len; j++) {
    var slice = value.slice(0, j);
    var slength = $.byteLength(slice);
    if ( slength == 106 ) {
        $(this).val(slice);
        break;
    }
}

回答by Houshang.Karami

The System.ArraySegment is usefull,but you need to constructor with array input and offset and indexer.

System.ArraySegment 很有用,但是您需要使用数组输入和偏移量以及索引器来构造函数。