php 如何处理用户输入的无效 UTF-8 字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3715264/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to handle user input of invalid UTF-8 characters?
提问by philfreo
I'm looking for general a strategy/advice on how to handle invalid UTF-8 input from users.
我正在寻找关于如何处理来自用户的无效 UTF-8 输入的一般策略/建议。
Even though my webapp uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode()and overall seems like a bad idea to have around.
尽管我的 web 应用程序使用 UTF-8,但某些用户以某种方式输入了无效字符。这会导致 PHP 的json_encode()出现错误,总体而言似乎是一个坏主意。
W3C I18N FAQ: Multilingual Formssays "If non-UTF-8 data is received, an error message should be sent back.".
W3C I18N 常见问题解答:多语言表单说“如果收到非 UTF-8 数据,应发回错误消息。”。
- How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
- How do you present the error in a helpful way to the user?
- How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
- For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
- 在一个有数十个不同位置可以输入数据的站点中,这实际上应该如何完成?
- 你如何以一种有用的方式向用户展示错误?
- 您如何临时存储和显示错误的表单数据,以便用户不会丢失所有文本?去除坏字符?使用替换字符,如何?
- 对于数据库中的现有数据,当检测到无效的 UTF-8 数据时,我应该尝试将其转换并保存回来(如何?utf8_encode()?mb_convert_encoding()?),还是在数据库中保持原样但做一些事情(什么?)在 json_encode() 之前?
EDIT: I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP". I'd like advice from people with experience in real-world situations how they've handled this.
编辑:我对 mbstring 扩展非常熟悉,并没有问“UTF-8 在 PHP 中是如何工作的”。我想从在现实世界中经验丰富的人那里得到建议,他们是如何处理这个问题的。
EDIT2: As part of the solution, I'd really like to see a fastmethod to convert invalid characters to U+FFFD
EDIT2:作为解决方案的一部分,我真的很想看到一种将无效字符转换为 U+FFFD的快速方法
回答by Alix Axel
The accept-charset="UTF-8"
attribute is only a guideline for browsers to follow, they are not forced to submit that in that way, crappy form submission bots are a good example...
该accept-charset="UTF-8"
属性只是浏览器遵循的指南,他们不会被迫以这种方式提交,糟糕的表单提交机器人就是一个很好的例子......
What I usually do is ignore bad chars, either via iconv()
or with the less reliable utf8_encode()
/ utf8_decode()
functions, if you use iconv
you also have the option to transliterate bad chars.
我通常做的是忽略坏字符,通过iconv()
或使用不太可靠的utf8_encode()
/utf8_decode()
函数,如果您使用,iconv
您还可以选择音译坏字符。
Here is an example using iconv()
:
这是一个使用示例iconv()
:
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis, something like this would probably do just fine:
如果您想向您的用户显示错误消息,我可能会以全局方式执行此操作,而不是根据接收到的每个值来执行此操作,这样的操作可能会很好:
function utf8_clean($str)
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}
$clean_GET = array_map('utf8_clean', $_GET);
if (serialize($_GET) != serialize($clean_GET))
{
$_GET = $clean_GET;
$error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}
// $_GET is clean!
You may also want to normalize new lines and strip (non-)visible control chars, like this:
您可能还想规范化新行并去除(非)可见的控制字符,如下所示:
function Clean($string, $control = true)
{
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string);
if ($control === true)
{
return preg_replace('~\p{C}+~u', '', $string);
}
return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}
Code to convert from UTF-8 to Unicode codepoints:
将 UTF-8 转换为 Unicode 代码点的代码:
function Codepoint($char)
{
$result = null;
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result = sprintf('U+%04X', $codepoint[1]);
}
return $result;
}
echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072
Probablyfaster than any other alternative, haven't tested it extensively though.
可能比任何其他替代方案都快,但尚未对其进行广泛测试。
Example:
例子:
$string = 'hello world?';
// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);
function Bad_Codepoint($string)
{
$result = array();
foreach ((array) $string as $char)
{
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result[] = sprintf('U+%04X', $codepoint[1]);
}
}
return implode('', $result);
}
Is this what you were looking for?
这就是你要找的吗?
回答by Archimedix
Receiving invalid characters from your web app might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset
attribute:
从您的 Web 应用程序接收无效字符可能与为 HTML 表单假定的字符集有关。您可以指定用于具有accept-charset
属性的表单的字符集:
<form action="..." accept-charset="UTF-8">
You also might want to take a look at similar questions in StackOverflow for pointers on how to handle invalid characters, e.g. those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.
您可能还想查看 StackOverflow 中的类似问题,以获取有关如何处理无效字符的指针,例如右侧列中的字符,但我认为向用户发出错误信号比尝试清除那些无效字符要好导致重要数据意外丢失或用户输入意外更改的字符。
回答by Nev Stokes
I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode()
as needs be:
我整理了一个相当简单的类来检查输入是否为 UTF-8 并utf8_encode()
根据需要运行 :
class utf8
{
/**
* @param array $data
* @param int $options
* @return array
*/
public static function encode(array $data)
{
foreach ($data as $key=>$val) {
if (is_array($val)) {
$data[$key] = self::encode($val, $options);
} else {
if (false === self::check($val)) {
$data[$key] = utf8_encode($val);
}
}
}
return $data;
}
/**
* Regular expression to test a string is UTF8 encoded
*
* RFC3629
*
* @param string $string The string to be tested
* @return bool
*
* @link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
*/
public static function check($string)
{
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs',
$string);
}
}
// For example
$data = utf8::encode($_POST);
回答by philfreo
For completeness to this question (not necessarily the best answer)...
为了这个问题的完整性(不一定是最佳答案)......
function as_utf8($s) {
return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}
回答by Geekster
I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down. Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters. The data you store in your database then is data triggered by the user, but not actually user-supplied data.
我建议只是不要让垃圾进入。不要依赖自定义功能,这可能会使您的系统陷入困境。只需根据您设计的字母表遍历提交的数据。创建一个可接受的字母字符串并逐字节遍历提交的数据,就好像它是一个数组一样。将可接受的字符推送到新字符串,并省略不可接受的字符。您存储在数据库中的数据是由用户触发的数据,而不是实际用户提供的数据。
EDIT #4: Replacing bad character with entiy: �
编辑#4:用entity替换坏字符:
EDIT #3: Updated : Sept 22 2010 @ 1:32pm Reason: Now string returned is UTF-8, plus I used the test file you provided as proof.
编辑 #3:更新:2010 年 9 月 22 日下午 1:32 原因:现在返回的字符串是 UTF-8,另外我使用了您提供的测试文件作为证明。
<?php
// build alphabet
// optionally you can remove characters from this array
$alpha[]= chr(0); // null
$alpha[]= chr(9); // tab
$alpha[]= chr(10); // new line
$alpha[]= chr(11); // tab
$alpha[]= chr(13); // carriage return
for ($i = 32; $i <= 126; $i++) {
$alpha[]= chr($i);
}
/* remove comment to check ascii ordinals */
// /*
// foreach ($alpha as $key=>$val){
// print ord($val);
// print '<br/>';
// }
// print '<hr/>';
//*/
//
// //test case #1
//
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv '.chr(160).chr(127).chr(126);
//
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';
//
// //test case #2
//
// $str = ''.'??????';
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';
//
// $str = '?';
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';
$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10),file($file));
$string = teststr($alpha,$testfile);
print $string;
print '<hr/>';
function teststr(&$alpha, &$str){
$strlen = strlen($str);
$newstr = chr(0); //null
$x = 0;
if($strlen >= 2){
for ($i = 0; $i < $strlen; $i++) {
$x++;
if(in_array($str[$i],$alpha)){
// passed
$newstr .= $str[$i];
}else{
// failed
print 'Found out of scope character. (ASCII: '.ord($str[$i]).')';
print '<br/>';
$newstr .= '�';
}
}
}elseif($strlen <= 0){
// failed to qualify for test
print 'Non-existent.';
}elseif($strlen === 1){
$x++;
if(in_array($str,$alpha)){
// passed
$newstr = $str;
}else{
// failed
print 'Total character failed to qualify.';
$newstr = '�';
}
}else{
print 'Non-existent (scope).';
}
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8"){
// skip
}else{
$newstr = utf8_encode($newstr);
}
// test encoding:
if(mb_detect_encoding($newstr, "UTF-8")=="UTF-8"){
print 'UTF-8 :D<br/>';
}else{
print 'ENCODED: '.mb_detect_encoding($newstr, "UTF-8").'<br/>';
}
return $newstr.' (scope: '.$x.', '.$strlen.')';
}
回答by Otar
There is a multibyte extension for PHP, check it out: http://www.php.net/manual/en/book.mbstring.php
PHP 有一个多字节扩展,请查看:http: //www.php.net/manual/en/book.mbstring.php
You should try mb_check_encoding()function.
您应该尝试mb_check_encoding()函数。
Good luck!
祝你好运!
回答by Elzo Valugi
How about stripping all chars outside your given subset. At least in some parts of my application I would not allow using chars outside the [a-Z] [0-9 sets], for example usernames. You can build a filter function that strips silently all chars outside this range, or that returns an error if it detects them and pushes the decision to the user.
如何剥离给定子集之外的所有字符。至少在我的应用程序的某些部分,我不允许在 [aZ] [0-9 组] 之外使用字符,例如用户名。您可以构建一个过滤器函数,以静默方式去除此范围之外的所有字符,或者在检测到它们并将决定推送给用户时返回错误。
回答by yfeldblum
Try doing what Rails does to force all browsers always to post UTF-8 data:
尝试执行 Rails 强制所有浏览器始终发布 UTF-8 数据的操作:
<form accept-charset="UTF-8" action="#{action}" method="post"><div
style="margin:0;padding:0;display:inline">
<input name="utf8" type="hidden" value="✓" />
</div>
<!-- form fields -->
</form>
See railssnowman.infoor the initial patchfor an explanation.
有关解释,请参阅railssnowman.info或初始补丁。
- To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a
meta http-equiv
tag). - To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use
accept-charset="UTF-8"
in the form. - To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is IE and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as
✓
which can only be from the Unicode charset (and, in this example, not the Korean charset).
- 要让浏览器以 UTF-8 编码发送表单提交数据,只需使用 Content-Type 标头“text/html; charset=utf-8”(或使用
meta http-equiv
标记)呈现页面。 - 要让浏览器以 UTF-8 编码发送表单提交数据,即使用户摆弄页面编码(浏览器让用户这样做),请
accept-charset="UTF-8"
在表单中使用。 - 让浏览器以 UTF-8 编码发送表单提交数据,即使用户摆弄页面编码(浏览器让用户这样做),即使浏览器是 IE 并且用户将页面编码切换为韩语和在表单字段中输入韩语字符,向表单添加一个隐藏输入,其值例如
✓
只能来自 Unicode 字符集(在本例中,不是韩语字符集)。
回答by Mr. Nobody
Set UTF-8 as the character set for all headers output by your PHP code
将 UTF-8 设置为 PHP 代码输出的所有标头的字符集
In every PHP output header, specify UTF-8 as the encoding:
在每个 PHP 输出标头中,指定 UTF-8 作为编码:
header('Content-Type: text/html; charset=utf-8');