BMP 之外的 JavaScript 字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3744721/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-25 02:06:39  来源:igfitidea点击:

JavaScript strings outside of the BMP

javascriptunicodeutf-16surrogate-pairsastral-plane

提问by Delan Azabani

BMP being Basic Multilingual Plane

BMP是基本的多语言平面

According to JavaScript: the Good Parts:

根据JavaScript:好的部分

JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide.

JavaScript 是在 Unicode 是 16 位字符集的时候构建的,因此 JavaScript 中的所有字符都是 16 位宽。

This leads me to believe that JavaScript uses UCS-2 (not UTF-16!) and can only handle characters up to U+FFFF.

这让我相信 JavaScript 使用 UCS-2(不是 UTF-16!)并且只能处理高达 U+FFFF 的字符。

Further investigation confirms this:

进一步的调查证实了这一点:

> String.fromCharCode(0x20001);

The fromCharCodemethod seems to only use the lowest 16 bits when returning the Unicode character. Trying to get U+20001 (CJK unified ideograph 20001) instead returns U+0001.

fromCharCode方法在返回 Unicode 字符时似乎只使用最低 16 位。尝试获取 U+20001(CJK 统一表意文字 20001)反而返回 U+0001。

Question: is it at all possible to handle post-BMP characters in JavaScript?

问题:是否有可能在 JavaScript 中处理 BMP 后的字符?



2011-07-31: slide twelve from Unicode Support Shootout: The Good, The Bad, & the (mostly) Uglycovers issues related to this quite well:

2011-07-31:Unicode Support Shootout 中的12 张幻灯片好的、坏的和(大部分)丑陋的内容很好地涵盖了与此相关的问题:

采纳答案by bobince

Depends what you mean by ‘support'. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can.

取决于你所说的“支持”是什么意思。您当然可以使用代理将非 UCS-2 字符放入 JS 字符串中,如果可以,浏览器会显示它们。

But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, sliceetc) all deal with code units not characters, so will quite happily split surrogate pairs or hold invalid surrogate sequences.

但是,JS 字符串中的每一项都是一个单独的 UTF-16 代码单元。没有处理全字符的语言层面的支持:所有标准字符串成员(lengthsplitslice等)都处理代码单元没有字符,所以会很愉快地拆分代理对或持有无效的替代序列。

If you want surrogate-aware methods, I'm afraid you're going to have to start writing them yourself! For example:

如果您想要代理感知方法,恐怕您将不得不自己开始编写它们!例如:

String.prototype.getCodePointLength= function() {
    return this.length-this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1;
};

String.fromCodePoint= function() {
    var chars= Array.prototype.slice.call(arguments);
    for (var i= chars.length; i-->0;) {
        var n = chars[i]-0x10000;
        if (n>=0)
            chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF));
    }
    return String.fromCharCode.apply(null, chars);
};

回答by ecellingsworth

I came to the same conclusion as bobince. If you want to work with strings containing unicode characters outside of the BMP, you have to reimplement javascript's String methods. This is because javascript counts characters as each 16-bit code value. Symbols outside of the BMP need two code values to be represented. You therefore run into a case where some symbols count as two characters and some count only as one.

我得出了与 bobince 相同的结论。如果要在 BMP 之外使用包含 unicode 字符的字符串,则必须重新实现 javascript 的 String 方法。这是因为 javascript 将字符计为每个 16 位代码值。BMP 之外的符号需要表示两个代码值。因此,您会遇到某些符号算作两个字符而某些仅算作一个字符的情况。

I've reimplemented the following methods to treat each unicode code point as a single character: .length, .charCodeAt, .fromCharCode, .charAt, .indexOf, .lastIndexOf, .splice, and .split.

我重新实现了以下方法来将每个 unicode 代码点视为单个字符:.length、.charCodeAt、.fromCharCode、.charAt、.indexOf、.lastIndexOf、.splice 和 .split。

You can check it out on jsfiddle: http://jsfiddle.net/Y89Du/

您可以在 jsfiddle 上查看:http: //jsfiddle.net/Y89Du/

Here's the code without comments. I tested it, but it may still have errors. Comments are welcome.

这是没有注释的代码。我测试了它,但它可能仍然有错误。欢迎提出意见。

if (!String.prototype.ucLength) {
    String.prototype.ucLength = function() {
        // this solution was taken from 
        // http://stackoverflow.com/questions/3744721/javascript-strings-outside-of-the-bmp
        return this.length - this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length + 1;
    };
}

if (!String.prototype.codePointAt) {
    String.prototype.codePointAt = function (ucPos) {
        if (isNaN(ucPos)){
            ucPos = 0;
        }
        var str = String(this);
        var codePoint = null;
        var pairFound = false;
        var ucIndex = -1;
        var i = 0;  
        while (i < str.length){
            ucIndex += 1;
            var code = str.charCodeAt(i);
            var next = str.charCodeAt(i + 1);
            pairFound = (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF);
            if (ucIndex == ucPos){
                codePoint = pairFound ? ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000 : code;
                break;
            } else{
                i += pairFound ? 2 : 1;
            }
        }
        return codePoint;
    };
}

if (!String.fromCodePoint) {
    String.fromCodePoint = function () {
        var strChars = [], codePoint, offset, codeValues, i;
        for (i = 0; i < arguments.length; ++i) {
            codePoint = arguments[i];
            offset = codePoint - 0x10000;
            if (codePoint > 0xFFFF){
                codeValues = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)];
            } else{
                codeValues = [codePoint];
            }
            strChars.push(String.fromCharCode.apply(null, codeValues));
        }
        return strChars.join("");
    };
}

if (!String.prototype.ucCharAt) {
    String.prototype.ucCharAt = function (ucIndex) {
        var str = String(this);
        var codePoint = str.codePointAt(ucIndex);
        var ucChar = String.fromCodePoint(codePoint);
        return ucChar;
    };
}

if (!String.prototype.ucIndexOf) {
    String.prototype.ucIndexOf = function (searchStr, ucStart) {
        if (isNaN(ucStart)){
            ucStart = 0;
        }
        if (ucStart < 0){
            ucStart = 0;
        }
        var str = String(this);
        var strUCLength = str.ucLength();
        searchStr = String(searchStr);
        var ucSearchLength = searchStr.ucLength();
        var i = ucStart;
        while (i < strUCLength){
            var ucSlice = str.ucSlice(i,i+ucSearchLength);
            if (ucSlice == searchStr){
                return i;
            }
            i++;
        }
        return -1;
    };
}

if (!String.prototype.ucLastIndexOf) {
    String.prototype.ucLastIndexOf = function (searchStr, ucStart) {
        var str = String(this);
        var strUCLength = str.ucLength();
        if (isNaN(ucStart)){
            ucStart = strUCLength - 1;
        }
        if (ucStart >= strUCLength){
            ucStart = strUCLength - 1;
        }
        searchStr = String(searchStr);
        var ucSearchLength = searchStr.ucLength();
        var i = ucStart;
        while (i >= 0){
            var ucSlice = str.ucSlice(i,i+ucSearchLength);
            if (ucSlice == searchStr){
                return i;
            }
            i--;
        }
        return -1;
    };
}

if (!String.prototype.ucSlice) {
    String.prototype.ucSlice = function (ucStart, ucStop) {
        var str = String(this);
        var strUCLength = str.ucLength();
        if (isNaN(ucStart)){
            ucStart = 0;
        }
        if (ucStart < 0){
            ucStart = strUCLength + ucStart;
            if (ucStart < 0){ ucStart = 0;}
        }
        if (typeof(ucStop) == 'undefined'){
            ucStop = strUCLength - 1;
        }
        if (ucStop < 0){
            ucStop = strUCLength + ucStop;
            if (ucStop < 0){ ucStop = 0;}
        }
        var ucChars = [];
        var i = ucStart;
        while (i < ucStop){
            ucChars.push(str.ucCharAt(i));
            i++;
        }
        return ucChars.join("");
    };
}

if (!String.prototype.ucSplit) {
    String.prototype.ucSplit = function (delimeter, limit) {
        var str = String(this);
        var strUCLength = str.ucLength();
        var ucChars = [];
        if (delimeter == ''){
            for (var i = 0; i < strUCLength; i++){
                ucChars.push(str.ucCharAt(i));
            }
            ucChars = ucChars.slice(0, 0 + limit);
        } else{
            ucChars = str.split(delimeter, limit);
        }
        return ucChars;
    };
}

回答by Michael Allan

More recent JavaScript engines have String.fromCodePoint.

最近的 JavaScript 引擎有.String.fromCodePoint

const ideograph = String.fromCodePoint( 0x20001 ); // outside the BMP

Also a code-point iterator, which gets you the code-point length.

还有一个代码点迭代器,它可以为您提供代码点长度。

function countCodePoints( str )
{
    const i = str[Symbol.iterator]();
    let count = 0;
    while( !i.next().done ) ++count;
    return count;
}

console.log( ideograph.length ); // gives '2'
console.log( countCodePoints(ideograph) ); // '1'

回答by Simon Hi

Using for (c of this)instruction, one can make various computations on a string that contains non-BMP characters. For instance, to compute the string length, and to get the nth character of the string:

使用for (c of this)指令,可以对包含非 BMP 字符的字符串进行各种计算。例如,计算字符串长度,并获取字符串的第 n 个字符:

String.prototype.magicLength = function()
{
    var c, k;
    k = 0;
    for (c of this) // iterate each char of this
    {
        k++;
    }
    return k;
}

String.prototype.magicCharAt = function(n)
{
    var c, k;
    k = 0;
    for (c of this) // iterate each char of this
    {
        if (k == n) return c + "";
        k++;
    }
    return "";
}

回答by Jukka K. Korpela

Yes, you can. Although support to non-BMP characters directly in source documents is optional according to the ECMAScript standard, modern browsers let you use them. Naturally, the document encoding must be properly declared, and for most practical purposes you would need to use the UTF-8 encoding. Moreover, you need an editor that can handle UTF-8, and you need some input method(s); see e.g. my Full Unicode Inpututility.

是的你可以。尽管根据 ECMAScript 标准直接在源文档中支持非 BMP 字符是可选的,但现代浏览器允许您使用它们。自然,必须正确声明文档编码,并且对于大多数实际目的,您需要使用 UTF-8 编码。而且,你需要一个可以处理UTF-8的编辑器,你需要一些输入法;参见例如我的完整 Unicode 输入实用程序。

Using suitable tools and settings, you can write var foo = ''.

使用合适的工具和设置,您可以编写var foo = ''.

The non-BMP characters will be internally represented as surrogate pairs, so each non-BMP character counts as 2 in the string length.

非 BMP 字符将在内部表示为代理对,因此每个非 BMP 字符在字符串长度中计为 2。