如何在 PHP 中检测格式错误的 utf-8 字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6723562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 01:04:41  来源:igfitidea点击:

How to detect malformed utf-8 string in PHP?

phpencodingutf-8iconv

提问by rsk82

iconv function sometimes gives me an error:

iconv 函数有时会给我一个错误:

Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]

Is there a way to detect that there are illegal characters in utf-8 string before putting data to inconv ?

有没有办法在将数据放入 inconv 之前检测 utf-8 字符串中是否存在非法字符?

回答by hakre

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

首先,请注意,无法检测文本是否属于特定的不需要的编码。您只能检查字符串在给定编码中是否有效。

You can make use of the UTF-8 validity check that is available in preg_match[PHP Manual]since PHP 4.3.5. It will return 0(with no additional information) if an invalid string is given:

自 PHP 4.3.5 起,您可以使用preg_match[PHP 手册] 中提供的 UTF-8 有效性检查。0如果给出无效字符串,它将返回(没有附加信息):

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding[PHP Manual]:

另一种可能性是mb_check_encoding[PHP 手册]

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding[PHP Manual]:

您可以使用的另一个功能是mb_detect_encoding[PHP 手册]

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strictparameter to true.

strict参数设置为 很重要true

Additionally, iconv[PHP Manual]allows you to change/drop invalid sequences on the fly. (However, if iconvencounters such a sequence, it generates a notification; this behavior cannot be changed.)

此外,iconv[PHP 手册]允许您即时更改/删除无效序列。(但是,如果iconv遇到这样的序列,它会生成一个通知;此行为无法更改。)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @and check the length of the return string:

您可以使用@并检查返回字符串的长度:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconvmanual page as well.

检查iconv手册页上的示例。

You have not shared the source code where the notice is resulting from. You should add it if you want a more concrete suggestion.

您尚未共享产生该通知的源代码。如果您想要更具体的建议,您应该添加它。

回答by jishi

The specification on which characters that are invalid in UTF-8 is pretty clear. You probably wanna strip those out before trying to parse it. They shouldn't be there so if you could avoid it even before generating the XML that would be even better.

关于哪些字符在 UTF-8 中无效的规范非常清楚。在尝试解析它之前,您可能想将它们去掉。他们不应该在那里,所以如果你能在生成 XML 之前避免它,那就更好了。

See here for a reference:

请参阅此处以获取参考:

http://www.w3.org/TR/xml/#charsets

http://www.w3.org/TR/xml/#charsets

That isn't a complete list, many parser also disallow some low-numbered control characters, but I can't find a comprehensive list right now.

这不是一个完整的列表,许多解析器也不允许一些低编号的控制字符,但我现在找不到一个完整的列表。

However, iconv might have builtin support for this:

但是, iconv 可能对此有内置支持:

http://www.zeitoun.net/articles/clear-invalid-utf8/start

http://www.zeitoun.net/articles/clear-invalid-utf8/start

回答by Robin

You could try using mb_detect_encodingto detect if you've got a different character set (than UTF-8) then mb_convert_encodingto convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.

您可以尝试使用mb_detect_encoding来检测您是否有不同的字符集(与 UTF-8 不同),然后mb_convert_encoding在需要时转换为 UTF-8。人们更有可能以不同的字符集为您提供有效的内容,而不是为您提供无效的 UTF-8。

回答by nobody

put an @ in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in source encoding id to ignore invalid characters:

在 iconv() 前面放一个 @ 以抑制 NOTICE 和 //IGNORE 在源编码 id 中的 UTF-8 之后忽略无效字符:

@iconv( 'UTF-8//IGNORE', $destinationEncoding, $yourString );