php 从字符串中删除非 utf8 字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1401317/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Remove non-utf8 characters from string
提问by Dan Sosedoff
Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)
我在从字符串中删除非 utf8 字符时遇到问题,这些字符显示不正确。字符是这样的 0x97 0x61 0x6C 0x6F(十六进制表示)
What is the best way to remove them? Regular expression or something else ?
去除它们的最佳方法是什么?正则表达式还是别的什么?
采纳答案by Markus Jarderot
Using a regex approach:
使用正则表达式方法:
$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| . # anything else
/x
END;
preg_replace($regex, '', $text);
It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.
它搜索 UTF-8 序列,并将其捕获到组 1 中。它还匹配无法识别为 UTF-8 序列一部分的单个字节,但不捕获这些字节。替换是捕获到组 1 中的任何内容。这有效地删除了所有无效字节。
It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.
可以通过将无效字节编码为 UTF-8 字符来修复字符串。但是如果错误是随机的,这可能会留下一些奇怪的符号。
$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| ( [\x80-\xBF] ) # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] ) # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
if ($captures[1] != "") {
// Valid byte sequence. Return unmodified.
return $captures[1];
}
elseif ($captures[2] != "") {
// Invalid byte of the form 10xxxxxx.
// Encode as 11000010 10xxxxxx.
return "\xC2".$captures[2];
}
else {
// Invalid byte of the form 11xxxxxx.
// Encode as 11000011 10xxxxxx.
return "\xC3".chr(ord($captures[3])-64);
}
}
preg_replace_callback($regex, "utf8replacer", $text);
EDIT:
编辑:
!empty(x)will match non-empty values ("0"is considered empty).x != ""will match non-empty values, including"0".x !== ""will match anything except"".
!empty(x)将匹配非空值("0"被视为空值)。x != ""将匹配非空值,包括"0".x !== ""将匹配除"".
x != ""seem the best one to use in this case.
x != ""在这种情况下似乎是最好的使用方法。
I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.
我也稍微加快了比赛速度。它不是分别匹配每个字符,而是匹配有效的 UTF-8 字符序列。
回答by Sebastián Grignoli
If you apply utf8_encode()to an already UTF8 string it will return a garbled UTF8 output.
如果你申请utf8_encode()一个已经是 UTF8 的字符串,它会返回一个乱码的 UTF8 输出。
I made a function that addresses all this issues. It′s called Encoding::toUTF8().
我做了一个函数来解决所有这些问题。它被称为Encoding::toUTF8()。
You dont need to know what the encoding of your strings is. It can be Latin1 (ISO8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8()will convert everything to UTF8.
您不需要知道字符串的编码是什么。它可以是 Latin1 (ISO8859-1)、Windows-1252 或 UTF8,或者字符串可以是它们的混合。Encoding::toUTF8()将所有内容转换为 UTF8。
I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.
我这样做是因为一个服务给我提供了一个全乱七八糟的数据源,将这些编码混合在同一个字符串中。
Usage:
用法:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($mixed_string);
$latin1_string = Encoding::toLatin1($mixed_string);
I've included another function, Encoding::fixUTF8(), which will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.
我已经包含了另一个函数,Encoding::fixUTF8(),它将修复每个 UTF8 字符串,这些字符串看起来是多次编码为 UTF8 的乱码产品。
Usage:
用法:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
例子:
echo Encoding::fixUTF8("F??d??ration Camerounaise de Football");
echo Encoding::fixUTF8("F???d???ration Camerounaise de Football");
echo Encoding::fixUTF8("F?????d?????ration Camerounaise de Football");
echo Encoding::fixUTF8("F???dération Camerounaise de Football");
will output:
将输出:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Download:
下载:
回答by Frosty Z
You can use mbstring:
您可以使用 mbstring:
$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');
...will remove invalid characters.
...将删除无效字符。
See: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored
回答by David D
This function removes all NON ASCII characters, it's useful but not solving the question:
This is my function that always works, regardless of encoding:
此函数删除所有非 ASCII 字符,它很有用,但不能解决问题:
这是我的函数,无论编码如何,它始终有效:
function remove_bs($Str) {
$StrArr = str_split($Str); $NewStr = '';
foreach ($StrArr as $Char) {
$CharNo = ord($Char);
if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep £
if ($CharNo > 31 && $CharNo < 127) {
$NewStr .= $Char;
}
}
return $NewStr;
}
How it works:
这个怎么运作:
echo remove_bs('Hello ?how? ?are you??'); // Hello how are you?
回答by Znarkus
$text = iconv("UTF-8", "UTF-8//IGNORE", $text);
This is what I am using. Seems to work pretty well. Taken from http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/
这就是我正在使用的。似乎工作得很好。取自http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/
回答by technoarya
try this:
尝试这个:
$string = iconv("UTF-8","UTF-8//IGNORE",$string);
According to the iconv manual, the function will take the first parameter as the input charset, second parameter as the output charset, and the third as the actual input string.
根据iconv 手册,该函数将第一个参数作为输入字符集,第二个参数作为输出字符集,第三个参数作为实际输入字符串。
If you set both the input and output charset to UTF-8, and append the //IGNOREflag to the output charset, the function will drop(strip) all characters in the input string that can't be represented by the output charset. Thus, filtering the input string in effect.
如果您将输入和输出字符集都设置为UTF-8,并将//IGNORE标志附加到输出字符集,则该函数将删除(剥离)输入字符串中不能由输出字符集表示的所有字符。因此,过滤输入字符串有效。
回答by HTML5 developer
The text may contain non-utf8 character. Try to do first:
文本可能包含非 utf8 字符。先尝试做:
$nonutf8 = mb_convert_encoding($nonutf8 , 'UTF-8', 'UTF-8');
You can read more about it here: http://php.net/manual/en/function.mb-convert-encoding.phpnews
您可以在此处阅读更多相关信息:http: //php.net/manual/en/function.mb-convert-encoding.php news
回答by masakielastic
UConverter can be used since PHP 5.5. UConverter is better the choice if you use intl extension and don't use mbstring.
自 PHP 5.5 起可以使用 UConverter。如果您使用 intl 扩展名而不使用 mbstring,则 UConverter 是更好的选择。
function replace_invalid_byte_sequence($str)
{
return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}
function replace_invalid_byte_sequence2($str)
{
return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}
htmlspecialchars can be used to remove invalid byte sequence since PHP 5.4. Htmlspecialchars is better than preg_match for handling large size of byte and the accuracy. A lot of the wrong implementation by using regular expression can be seen.
自 PHP 5.4 起,htmlspecialchars 可用于删除无效的字节序列。Htmlspecialchars 比 preg_match 更好地处理大字节和准确性。可以看到很多使用正则表达式的错误实现。
function replace_invalid_byte_sequence3($str)
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}
回答by mumin
I have made a function that deletes invalid UTF-8 characters from a string. I'm using it to clear description of 27000 products before it generates the XML export file.
我做了一个从字符串中删除无效 UTF-8 字符的函数。我正在使用它在生成 XML 导出文件之前清除 27000 种产品的描述。
public function stripInvalidXml($value) {
$ret = "";
$current;
if (empty($value)) {
return $ret;
}
$length = strlen($value);
for ($i=0; $i < $length; $i++) {
$current = ord($value{$i});
if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) {
$ret .= chr($current);
}
else {
$ret .= "";
}
}
return $ret;
}
回答by clarkk
Welcome to 2019 and the /umodifier in regex which will handle UTF-8 multibyte chars for you
欢迎来到 2019 和/u正则表达式中的修饰符,它将为您处理 UTF-8 多字节字符
If you only use mb_convert_encoding($value, 'UTF-8', 'UTF-8')you will still end up with non-printable chars in your string
如果您只使用mb_convert_encoding($value, 'UTF-8', 'UTF-8'),您的字符串中仍然会出现不可打印的字符
This method will:
该方法将:
- Remove all invalid UTF-8 multibyte chars with
mb_convert_encoding - Remove all non-printable chars like
\r,\x00(NULL-byte) and other control chars withpreg_replace
- 删除所有无效的 UTF-8 多字节字符
mb_convert_encoding - 删除所有不可打印的字符,如
\r,\x00(NULL-byte) 和其他控制字符preg_replace
method:
方法:
function utf8_filter(string $value): string{
return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}
[:print:]match all printable chars and \nnewlines and strip everything else
[:print:]匹配所有可打印的字符和\n换行符并去除其他所有内容
You can see the ASCII table below.. The printable chars range from 32 to 127, but newline \nis a part of the control chars which range from 0 to 31 so we have to add newline to the regex /[^[:print:]\n]/u
你可以看到下面的 ASCII 表.. 可打印的字符范围从 32 到 127,但换行符\n是控制字符的一部分,范围从 0 到 31,所以我们必须在正则表达式中添加换行符/[^[:print:]\n]/u
You can try to send strings through the regex with chars outside the printable range like \x7F(DEL), \x1B(Esc) etc. and see how they are stripped
您可以尝试通过正则表达式发送带有超出可打印范围的字符的字符串,例如\x7F(DEL)、\x1B(Esc) 等,并查看它们是如何被剥离的
function utf8_filter(string $value): string{
return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}
$arr = [
'Danish chars' => 'Hello from Denmark with ???',
'Non-printable chars' => "\x7FHello with invalid chars\r \x00"
];
foreach($arr as $k => $v){
echo "$k:\n---------\n";
$len = strlen($v);
echo "$v\n(".$len.")\n";
$strip = utf8_decode(utf8_filter(utf8_encode($v)));
$strip_len = strlen($strip);
echo $strip."\n(".$strip_len.")\n\n";
echo "Chars removed: ".($len - $strip_len)."\n\n\n";
}


