php strlen() 和 UTF-8 编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11034058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
strlen() and UTF-8 encoding
提问by Jon Lyles
Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
假设 UTF-8 编码和 PHP 中的 strlen(),这个字符串的长度有可能是 4 吗?
I'm only interested to know about strlen(), not other functions
我只对 strlen() 感兴趣,而不是其他函数
This is the string:
这是字符串:
$1???2
1美元???2美元
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
我在自己的电脑上测试过,验证过UTF-8编码,得到的答案是6。
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
我在 strlen 的手册中或我在 UTF-8 上读过的任何内容都没有看到任何内容可以解释为什么上述某些字符的计数小于 1。
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.
PS:本问答(4)来自我在Ebay上买的郑商所的模拟测试。
采纳答案by bames53
The string you posted is six character long: $1???2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
您发布的字符串长度为 6 个字符:$1???2(美元符号,数字 1,带分音符的小写字母 i,倒置的问号,一半的分数,数字 2)
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
如果使用该字符串的 UTF-8 表示调用 strlen(),您将得到 9 的结果(可能,尽管有多种长度不同的表示)。
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1?2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '?' is identical to the ISO-8859-1 encoding of the three characters "???".
但是,如果我们将该字符串存储为 ISO 8859-1 或 CP1252,我们将拥有一个 6 字节长的序列,该序列将作为 UTF-8 合法。将这 6 个字节重新解释为 UTF-8 将产生 4 个字符:$1?2(美元符号,数字 1,Unicode 替换字符,数字 2)。即单个字符 '?' 的 UTF-8 编码 与 ISO-8859-1 三个字符“???”的编码相同。
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
当 UTF-8 解码器读取不是有效 UTF-8 数据的数据时,通常会插入替换字符。
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1?2), and then by whatever you used to analyze that data (producing $1???2).
看来原来的字符串是经过多层曲解处理的;通过在非 UTF-8 数据上使用 UTF-8 解码器(产生 1?2 美元),然后通过你用来分析该数据的任何东西(产生 1?2 美元)。
回答by Anton
how about using mb_strlen() ?
使用 mb_strlen() 怎么样?
http://lt.php.net/manual/en/function.mb-strlen.php
http://lt.php.net/manual/en/function.mb-strlen.php
But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.
但是,如果您需要使用 strlen,则可以通过将 mbstring.func_overload 指令设置为 2 来配置您的网络服务器,因此它会在您的脚本中自动将 strlen 的使用替换为 mb_strlen。
回答by Haim Evgi
need to use Multibyte String Function mb_strlen()like:
需要使用多字节字符串函数mb_strlen()如:
mb_strlen($string, 'UTF-8');
回答by Joni
It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.
很可能在准备问题和您阅读问题之间的某个时间点,某个过程已经损坏了其中的非 ASCII 字符,因此问题最初是关于某个包含 4 个字符的字符串。
The sequence ???is obtained when you encode the replacement character U+FFFD(?) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:
???当您将替换字符 U+FFFD(?) 编码为 UTF-8 并将结果解释为 latin1时,将获得该序列。例如,当从文件中读取文本时,此字符用作不编码任何字符的字节序列的替代。发生的事情很可能是这样的:
The original question, stored in a latin1 text file, had: $1¢2(you can replace ¢ with any non-ASCII character)
存储在 latin1 文本文件中的原始问题有:($1¢2您可以将 ¢ 替换为任何非 ASCII 字符)
The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1?2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2in the file.
该文件由使用 UTF-8 的程序读取。由于与 ¢ 对应的字节无法解释,程序将其替换并读取文本$1?2。然后使用 UTF-8 写出此文本,从而生成$1\xEF\xBF\xBD2文件。
Then some third program comes that reads the file in latin1, and shows $1???2.
然后第三个程序来读取 latin1 中的文件,并显示$1???2.
回答by goat
No.
不。
I'll use a proof by contradiction.
我将使用反证法。
strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytesin that string.
strlen 计算字节数,因此当 strlen 为 4 时,该字符串中需要正好有 4 个字节。
UTF8 encoding needs at least 1 byte per character.
UTF8 编码需要每个字符至少 1 个字节。
We have established that:
我们已经确定:
- there are 4 bytes
- a character is represented by no less than 1 byte
- 有 4 个字节
- 一个字符由不少于1个字节表示
...yet, we have 6 characters....which is a contradiction. So, no.
......然而,我们有 6 个字符......这是一个矛盾。所以不行。
However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.
然而,不完全清楚的是显示软件(例如,网络浏览器)使用哪个字符集来解释字符串。它可以使用一些不常见的编码方案,其中一个字符可以用少于 8 位来表示。如果是这种情况,那么 4 个字节可以显示为 6 个字符。因此,该字符串可能是 utf8,但浏览器可以决定将其解释为某个 5 位字符集。
回答by Madara's Ghost
Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).
许多 UTF-8 字符占用多个字节而不是一个字节。这就是 UTF-8 的构造方式(这就是您可以在一个集中拥有这么多字符的方式)。
Try mb_strlen()instead.
试试吧mb_strlen()。

