如何在 PHP 中检测格式错误的 utf-8 字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6723562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to detect malformed utf-8 string in PHP?
提问by rsk82
iconv function sometimes gives me an error:
iconv 函数有时会给我一个错误:
Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]
Is there a way to detect that there are illegal characters in utf-8 string before putting data to inconv ?
有没有办法在将数据放入 inconv 之前检测 utf-8 字符串中是否存在非法字符?
回答by hakre
First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.
首先,请注意,无法检测文本是否属于特定的不需要的编码。您只能检查字符串在给定编码中是否有效。
You can make use of the UTF-8 validity check that is available in preg_match
[PHP Manual]since PHP 4.3.5. It will return 0
(with no additional information) if an invalid string is given:
自 PHP 4.3.5 起,您可以使用preg_match
[PHP 手册] 中提供的 UTF-8 有效性检查。0
如果给出无效字符串,它将返回(没有附加信息):
$isUTF8 = preg_match('//u', $string);
Another possibility is mb_check_encoding
[PHP Manual]:
另一种可能性是mb_check_encoding
[PHP 手册]:
$validUTF8 = mb_check_encoding($string, 'UTF-8');
Another function you can use is mb_detect_encoding
[PHP Manual]:
您可以使用的另一个功能是mb_detect_encoding
[PHP 手册]:
$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));
It's important to set the strict
parameter to true
.
将strict
参数设置为 很重要true
。
Additionally, iconv
[PHP Manual]allows you to change/drop invalid sequences on the fly. (However, if iconv
encounters such a sequence, it generates a notification; this behavior cannot be changed.)
此外,iconv
[PHP 手册]允许您即时更改/删除无效序列。(但是,如果iconv
遇到这样的序列,它会生成一个通知;此行为无法更改。)
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;
You can use @
and check the length of the return string:
您可以使用@
并检查返回字符串的长度:
strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));
Check the examples on the iconv
manual page as well.
检查iconv
手册页上的示例。
You have not shared the source code where the notice is resulting from. You should add it if you want a more concrete suggestion.
您尚未共享产生该通知的源代码。如果您想要更具体的建议,您应该添加它。
回答by jishi
The specification on which characters that are invalid in UTF-8 is pretty clear. You probably wanna strip those out before trying to parse it. They shouldn't be there so if you could avoid it even before generating the XML that would be even better.
关于哪些字符在 UTF-8 中无效的规范非常清楚。在尝试解析它之前,您可能想将它们去掉。他们不应该在那里,所以如果你能在生成 XML 之前避免它,那就更好了。
See here for a reference:
请参阅此处以获取参考:
http://www.w3.org/TR/xml/#charsets
http://www.w3.org/TR/xml/#charsets
That isn't a complete list, many parser also disallow some low-numbered control characters, but I can't find a comprehensive list right now.
这不是一个完整的列表,许多解析器也不允许一些低编号的控制字符,但我现在找不到一个完整的列表。
However, iconv might have builtin support for this:
但是, iconv 可能对此有内置支持:
回答by Robin
You could try using mb_detect_encoding
to detect if you've got a different character set (than UTF-8) then mb_convert_encoding
to convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.
您可以尝试使用mb_detect_encoding
来检测您是否有不同的字符集(与 UTF-8 不同),然后mb_convert_encoding
在需要时转换为 UTF-8。人们更有可能以不同的字符集为您提供有效的内容,而不是为您提供无效的 UTF-8。
回答by nobody
put an @ in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in source encoding id to ignore invalid characters:
在 iconv() 前面放一个 @ 以抑制 NOTICE 和 //IGNORE 在源编码 id 中的 UTF-8 之后忽略无效字符:
@iconv( 'UTF-8//IGNORE', $destinationEncoding, $yourString );