php 从字符串中删除非 utf8 字符

Question

提问by Dan Sosedoff

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

我在从字符串中删除非 utf8 字符时遇到问题，这些字符显示不正确。字符是这样的 0x97 0x61 0x6C 0x6F（十六进制表示）

What is the best way to remove them? Regular expression or something else ?

去除它们的最佳方法是什么？正则表达式还是别的什么？

Answer 1

采纳答案by Markus Jarderot

Using a regex approach:

使用正则表达式方法：

$regex = <<<'END'
/
  (
    (?: [\x00-\x7F]                 # single-byte sequences   0xxxxxxx
    |   [\xC0-\xDF][\x80-\xBF]      # double-byte sequences   110xxxxx 10xxxxxx
    |   [\xE0-\xEF][\x80-\xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [\xF0-\xF7][\x80-\xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                        # ...one or more times
  )
| .                                 # anything else
/x
END;
preg_replace($regex, '', $text);

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

它搜索 UTF-8 序列，并将其捕获到组 1 中。它还匹配无法识别为 UTF-8 序列一部分的单个字节，但不捕获这些字节。替换是捕获到组 1 中的任何内容。这有效地删除了所有无效字节。

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

可以通过将无效字节编码为 UTF-8 字符来修复字符串。但是如果错误是随机的，这可能会留下一些奇怪的符号。

$regex = <<<'END'
/
  (
    (?: [\x00-\x7F]               # single-byte sequences   0xxxxxxx
    |   [\xC0-\xDF][\x80-\xBF]    # double-byte sequences   110xxxxx 10xxxxxx
    |   [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                      # ...one or more times
  )
| ( [\x80-\xBF] )                 # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] )                 # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
  if ($captures[1] != "") {
    // Valid byte sequence. Return unmodified.
    return $captures[1];
  }
  elseif ($captures[2] != "") {
    // Invalid byte of the form 10xxxxxx.
    // Encode as 11000010 10xxxxxx.
    return "\xC2".$captures[2];
  }
  else {
    // Invalid byte of the form 11xxxxxx.
    // Encode as 11000011 10xxxxxx.
    return "\xC3".chr(ord($captures[3])-64);
  }
}
preg_replace_callback($regex, "utf8replacer", $text);

EDIT:

编辑：

!empty(x)will match non-empty values ("0"is considered empty).
x != ""will match non-empty values, including "0".
x !== ""will match anything except "".

!empty(x)将匹配非空值（"0"被视为空值）。
x != ""将匹配非空值，包括"0".
x !== ""将匹配除"".

x != ""seem the best one to use in this case.

x != ""在这种情况下似乎是最好的使用方法。

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

我也稍微加快了比赛速度。它不是分别匹配每个字符，而是匹配有效的 UTF-8 字符序列。

Answer 2

回答by Sebastián Grignoli

If you apply utf8_encode()to an already UTF8 string it will return a garbled UTF8 output.

如果你申请utf8_encode()一个已经是 UTF8 的字符串，它会返回一个乱码的 UTF8 输出。

I made a function that addresses all this issues. It′s called Encoding::toUTF8().

我做了一个函数来解决所有这些问题。它被称为Encoding::toUTF8()。

You dont need to know what the encoding of your strings is. It can be Latin1 (ISO8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8()will convert everything to UTF8.

您不需要知道字符串的编码是什么。它可以是 Latin1 (ISO8859-1)、Windows-1252 或 UTF8，或者字符串可以是它们的混合。Encoding::toUTF8()将所有内容转换为 UTF8。

I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.

我这样做是因为一个服务给我提供了一个全乱七八糟的数据源，将这些编码混合在同一个字符串中。

Usage:

用法：

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($mixed_string);

$latin1_string = Encoding::toLatin1($mixed_string);

I've included another function, Encoding::fixUTF8(), which will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.

我已经包含了另一个函数，Encoding::fixUTF8()，它将修复每个 UTF8 字符串，这些字符串看起来是多次编码为 UTF8 的乱码产品。

Usage:

用法：

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

例子：

echo Encoding::fixUTF8("F??d??ration Camerounaise de Football");
echo Encoding::fixUTF8("F???d???ration Camerounaise de Football");
echo Encoding::fixUTF8("F?????d?????ration Camerounaise de Football");
echo Encoding::fixUTF8("F???dération Camerounaise de Football");

will output:

将输出：

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

下载：

https://github.com/neitanod/forceutf8

Answer 3

回答by Frosty Z

You can use mbstring:

您可以使用 mbstring：

$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');

...will remove invalid characters.

...将删除无效字符。

See: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

请参阅：用问号替换无效的 UTF-8 字符，mbstring.substitute_character 似乎被忽略

Answer 4

回答by David D

This function removes all NON ASCII characters, it's useful but not solving the question:
This is my function that always works, regardless of encoding:

此函数删除所有非 ASCII 字符，它很有用，但不能解决问题：
这是我的函数，无论编码如何，它始终有效：

function remove_bs($Str) {  
  $StrArr = str_split($Str); $NewStr = '';
  foreach ($StrArr as $Char) {    
    $CharNo = ord($Char);
    if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep ￡ 
    if ($CharNo > 31 && $CharNo < 127) {
      $NewStr .= $Char;    
    }
  }  
  return $NewStr;
}

How it works:

这个怎么运作：

echo remove_bs('Hello ?how? ?are you??'); // Hello how are you?

Answer 5

回答by Znarkus

$text = iconv("UTF-8", "UTF-8//IGNORE", $text);

This is what I am using. Seems to work pretty well. Taken from http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/

这就是我正在使用的。似乎工作得很好。取自http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/

Answer 6

回答by technoarya

try this:

尝试这个：

$string = iconv("UTF-8","UTF-8//IGNORE",$string);

According to the iconv manual, the function will take the first parameter as the input charset, second parameter as the output charset, and the third as the actual input string.

根据iconv 手册，该函数将第一个参数作为输入字符集，第二个参数作为输出字符集，第三个参数作为实际输入字符串。

If you set both the input and output charset to UTF-8, and append the //IGNOREflag to the output charset, the function will drop(strip) all characters in the input string that can't be represented by the output charset. Thus, filtering the input string in effect.

如果您将输入和输出字符集都设置为UTF-8，并将//IGNORE标志附加到输出字符集，则该函数将删除（剥离）输入字符串中不能由输出字符集表示的所有字符。因此，过滤输入字符串有效。

Answer 7

回答by HTML5 developer

The text may contain non-utf8 character. Try to do first:

文本可能包含非 utf8 字符。先尝试做：

$nonutf8 = mb_convert_encoding($nonutf8 , 'UTF-8', 'UTF-8');

You can read more about it here: http://php.net/manual/en/function.mb-convert-encoding.php news

您可以在此处阅读更多相关信息：http: //php.net/manual/en/function.mb-convert-encoding.php news

Answer 8

回答by masakielastic

UConverter can be used since PHP 5.5. UConverter is better the choice if you use intl extension and don't use mbstring.

自 PHP 5.5 起可以使用 UConverter。如果您使用 intl 扩展名而不使用 mbstring，则 UConverter 是更好的选择。

function replace_invalid_byte_sequence($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

htmlspecialchars can be used to remove invalid byte sequence since PHP 5.4. Htmlspecialchars is better than preg_match for handling large size of byte and the accuracy. A lot of the wrong implementation by using regular expression can be seen.

自 PHP 5.4 起，htmlspecialchars 可用于删除无效的字节序列。Htmlspecialchars 比 preg_match 更好地处理大字节和准确性。可以看到很多使用正则表达式的错误实现。

function replace_invalid_byte_sequence3($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

Answer 9

回答by mumin

I have made a function that deletes invalid UTF-8 characters from a string. I'm using it to clear description of 27000 products before it generates the XML export file.

我做了一个从字符串中删除无效 UTF-8 字符的函数。我正在使用它在生成 XML 导出文件之前清除 27000 种产品的描述。

public function stripInvalidXml($value) {
    $ret = "";
    $current;
    if (empty($value)) {
        return $ret;
    }
    $length = strlen($value);
    for ($i=0; $i < $length; $i++) {
        $current = ord($value{$i});
        if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) {
                $ret .= chr($current);
        }
        else {
            $ret .= "";
        }
    }
    return $ret;
}

Answer 10

回答by clarkk

Welcome to 2019 and the /umodifier in regex which will handle UTF-8 multibyte chars for you

欢迎来到 2019 和/u正则表达式中的修饰符，它将为您处理 UTF-8 多字节字符

If you only use mb_convert_encoding($value, 'UTF-8', 'UTF-8')you will still end up with non-printable chars in your string

如果您只使用mb_convert_encoding($value, 'UTF-8', 'UTF-8')，您的字符串中仍然会出现不可打印的字符

This method will:

该方法将：

Remove all invalid UTF-8 multibyte chars with mb_convert_encoding
Remove all non-printable chars like \r, \x00(NULL-byte) and other control chars with preg_replace

删除所有无效的 UTF-8 多字节字符 mb_convert_encoding
删除所有不可打印的字符，如\r, \x00(NULL-byte) 和其他控制字符preg_replace

method:

方法：

function utf8_filter(string $value): string{
    return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}

[:print:]match all printable chars and \nnewlines and strip everything else

[:print:]匹配所有可打印的字符和\n换行符并去除其他所有内容

You can see the ASCII table below.. The printable chars range from 32 to 127, but newline \nis a part of the control chars which range from 0 to 31 so we have to add newline to the regex /[^[:print:]\n]/u

你可以看到下面的 ASCII 表.. 可打印的字符范围从 32 到 127，但换行符\n是控制字符的一部分，范围从 0 到 31，所以我们必须在正则表达式中添加换行符/[^[:print:]\n]/u

You can try to send strings through the regex with chars outside the printable range like \x7F(DEL), \x1B(Esc) etc. and see how they are stripped

您可以尝试通过正则表达式发送带有超出可打印范围的字符的字符串，例如\x7F(DEL)、\x1B(Esc) 等，并查看它们是如何被剥离的

function utf8_filter(string $value): string{
    return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}

$arr = [
    'Danish chars'          => 'Hello from Denmark with ???',
    'Non-printable chars'   => "\x7FHello with invalid chars\r \x00"
];

foreach($arr as $k => $v){
    echo "$k:\n---------\n";

    $len = strlen($v);
    echo "$v\n(".$len.")\n";

    $strip = utf8_decode(utf8_filter(utf8_encode($v)));
    $strip_len = strlen($strip);
    echo $strip."\n(".$strip_len.")\n\n";

    echo "Chars removed: ".($len - $strip_len)."\n\n\n";
}

https://www.tehplayground.com/q5sJ3FOddhv1atpR

php 从字符串中删除非 utf8 字符

提问by Dan Sosedoff

采纳答案by Markus Jarderot

回答by Sebastián Grignoli

回答by Frosty Z

回答by David D

回答by Znarkus

回答by technoarya

回答by HTML5 developer

回答by masakielastic

回答by mumin

回答by clarkk

method:

方法：

相关推荐

最近更新

标签

php 从字符串中删除非 utf8 字符

提问by Dan Sosedoff

采纳答案by Markus Jarderot

回答by Sebastián Grignoli

回答by Frosty Z

回答by David D

回答by Znarkus

回答by technoarya

回答by HTML5 developer

回答by masakielastic

回答by mumin

回答by clarkk

method:

方法：

相关推荐

php 如何将UTC日期时间转换为另一个时区？

php PHP中的双向加密

php 如何以人类可读的格式输出（到日志）多级数组？

在 PHP 中编码密码的最佳方法

相关推荐

最近更新

标签