Javascript 使用javascript计算textarea中的字节数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2848462/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-23 02:09:30  来源:igfitidea点击:

Count bytes in textarea using javascript

javascriptutf-8

提问by mcintyre321

I need to count how long in bytes a textarea is when UTF8 encoded using javascript. Any idea how I would do this?

当使用 javascript 编码 UTF8 时,我需要计算 textarea 的字节长度。知道我会怎么做吗?

thanks!

谢谢!

采纳答案by derflocki

edit: as didier-l has pointed out, this function does not count surrogate characters correctly.

编辑:正如 didier-l 所指出的,这个函数没有正确计算代理字符。

broofa's answer should count surrogates properly, see https://stackoverflow.com/a/12206089/274483.

broofa 的答案应该正确计算代理,参见https://stackoverflow.com/a/12206089/274483

I have tested the two proposed versions here as well as a naive implementation:

我在这里测试了两个提议的版本以及一个简单的实现:

 getUTF8Length: function(string) {
    var utf8length = 0;
    for (var n = 0; n < string.length; n++) {
        var c = string.charCodeAt(n);
        if (c < 128) {
            utf8length++;
        }
        else if((c > 127) && (c < 2048)) {
            utf8length = utf8length+2;
        }
        else {
            utf8length = utf8length+3;
        }
    }
    return utf8length;
 }

With the result that my version is slightly faster in firefox and significantly faster in chrome (~30x) than the here posted versions.

结果是我的版本在 Firefox 中略快,而在 chrome 中比这里发布的版本快得多(~30x)。

回答by Tgr

encodeURIComponent(text).replace(/%[A-F\d]{2}/g, 'U').length

回答by broofa

Combining various answers, the following method should be fast and accurate, and avoids issues with invalid surrogate pairs that can cause errors in encodeURIComponent():

结合各种答案,以下方法应该是快速和准确的,并且避免了可能导致 encodeURIComponent() 错误的无效代理对的问题:

function getUTF8Length(s) {
  var len = 0;
  for (var i = 0; i < s.length; i++) {
    var code = s.charCodeAt(i);
    if (code <= 0x7f) {
      len += 1;
    } else if (code <= 0x7ff) {
      len += 2;
    } else if (code >= 0xd800 && code <= 0xdfff) {
      // Surrogate pair: These take 4 bytes in UTF-8 and 2 chars in UCS-2
      // (Assume next char is the other [valid] half and just skip it)
      len += 4; i++;
    } else if (code < 0xffff) {
      len += 3;
    } else {
      len += 4;
    }
  }
  return len;
}

回答by frank_neff

If you have non-bmp characters in your string, it's a little more complicated...

如果你的字符串中有非 bmp 字符,那就有点复杂了......

Because javascript does UTF-16 encode, and a "character" is a 2-byte-stack (16 bit) all multibyte characters (3 and more bytes) will not work:

因为 javascript 进行 UTF-16 编码,并且“字符”是 2 字节堆栈(16 位),所以所有多字节字符(3 个或更多字节)都不起作用:

    <script type="text/javascript">
        var nonBmpString = "foo";
        console.log( nonBmpString.length );
        // will output 5
    </script>

The character "" has a length of 3 bytes (24bit). Javascript does interpret it as 2 characters, because in JS, a character is a 16 bit block.

字符“”的长度为 3 个字节(24 位)。Javascript 确实将其解释为 2 个字符,因为在 JS 中,一个字符是一个 16 位块。

So to correctly get the bytesize of a mixed string, we have to code our own function fixedCharCodeAt();

所以为了正确获取混合字符串的字节大小,我们必须编写我们自己的函数 fixedCharCodeAt();

    function fixedCharCodeAt(str, idx) {
        idx = idx || 0;
        var code = str.charCodeAt(idx);
        var hi, low;
        if (0xD800 <= code && code <= 0xDBFF) { // High surrogate (could change last hex to 0xDB7F to treat high private surrogates as single characters)
            hi = code;
            low = str.charCodeAt(idx + 1);
            if (isNaN(low)) {
                throw 'Kein gültiges Schriftzeichen oder Speicherfehler!';
            }
            return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;
        }
        if (0xDC00 <= code && code <= 0xDFFF) { // Low surrogate
            // We return false to allow loops to skip this iteration since should have already handled high surrogate above in the previous iteration
            return false;
            /*hi = str.charCodeAt(idx-1);
            low = code;
            return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;*/
        }
        return code;
    }

Now we can count the bytes...

现在我们可以计算字节数...

    function countUtf8(str) {
        var result = 0;
        for (var n = 0; n < str.length; n++) {
            var charCode = fixedCharCodeAt(str, n);
            if (typeof charCode === "number") {
                if (charCode < 128) {
                    result = result + 1;
                } else if (charCode < 2048) {
                    result = result + 2;
                } else if (charCode < 65536) {
                    result = result + 3;
                } else if (charCode < 2097152) {
                    result = result + 4;
                } else if (charCode < 67108864) {
                    result = result + 5;
                } else {
                    result = result + 6;
                }
            }
        }
        return result;
    }

By the way... You should not use the encodeURI-method, because, it's a native browser function ;)

顺便说一句......你不应该使用encodeURI方法,因为它是一个本地浏览器功能;)

More stuff:

更多东西:



Cheers

干杯

frankneff.ch / @frank_neff

回答by Ryan Wu

Add Byte length counting function to the string

为字符串添加字节长度计数功能

String.prototype.Blength = function() {
    var arr = this.match(/[^\x00-\xff]/ig);
    return  arr == null ? this.length : this.length + arr.length;
}

then you can use .Blength()to get the size

然后你可以使用.Blength()来获取大小

回答by qbolec

How about simple:

如何简单:

unescape(encodeURIComponent(utf8text)).length

The trick is that encodeURIComponent seems to work on characters while unescape works on bytes.

诀窍是 encodeURIComponent 似乎对字符起作用,而 unescape 对字节起作用。

回答by Juan Correa

I have been asking myself the same thing. This is the best answer I have stumble upon:

我一直在问自己同样的事情。这是我偶然发现的最佳答案:

http://www.inter-locale.com/demos/countBytes.html

http://www.inter-locale.com/demos/countBytes.html

Here is the code snippet:

这是代码片段:

<script type="text/javascript">
 function checkLength() {
    var countMe = document.getElementById("someText").value
    var escapedStr = encodeURI(countMe)
    if (escapedStr.indexOf("%") != -1) {
        var count = escapedStr.split("%").length - 1
        if (count == 0) count++  //perverse case; can't happen with real UTF-8
        var tmp = escapedStr.length - (count * 3)
        count = count + tmp
    } else {
        count = escapedStr.length
    }
    alert(escapedStr + ": size is " + count)
 }

but the link contains a live example of it to play with. "encodeURI(STRING)" is the building block here, but also look at encodeURIComponent(STRING) (as already point out on the previous answer) to see which one fits your needs.

但该链接包含一个可以玩的现场示例。"encodeURI(STRING)" 是这里的构建块,但也可以查看 encodeURIComponent(STRING) (如上一个答案中已经指出的那样)以查看哪个适合您的需求。

Regards

问候

回答by Lauri Oherd

encodeURI(text).split(/%..|./).length - 1

回答by user3211372

Try the following:

请尝试以下操作:

function b(c) {
     var n=0;
     for (i=0;i<c.length;i++) {
           p = c.charCodeAt(i);
           if (p<128) {
                 n++;
           } else if (p<2048) {
                 n+=2;
           } else {
                 n+=3;
           }
      }return n;
}

回答by Mehdi Mashayekhi

set meta UTF-8just & it's OK!

设置meta UTF-8就可以了!

<meta charset="UTF-8">
<meta http-equiv="content-type" content="text/html;charset=utf-8">

and js:

和js:

if($mytext.length > 10){
 // its okkk :)
}