php 不间断的 utf-8 0xc2a0 空间和 preg_replace 奇怪的行为

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12837682/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 04:19:40  来源:igfitidea点击:

non-breaking utf-8 0xc2a0 space and preg_replace strange behaviour

phpregex

提问by DamirR

In my string I have utf-8 non-breaking space (0xc2a0) and I want to replace it with something else.

在我的字符串中,我有 utf-8 不间断空格(0xc2a0),我想用其他东西替换它。

When I use

当我使用

$str=preg_replace('~\xc2\xa0~', 'X', $str);

it works OK.

它工作正常。

But when I use

但是当我使用

$str=preg_replace('~\x{C2A0}~siu', 'W', $str);

non-breaking space is not found (and replaced).

未找到(并替换)不间断空间。

Why? What is wrong with second regexp?

为什么?第二个正则表达式有什么问题?

The format \x{C2A0}is correct, also I used uflag.

格式\x{C2A0}正确,我也使用了u标志。

回答by Newbo.O

Actually the documentation about escape sequences in PHP is wrong. When you use \xc2\xa0syntax, it searches for UTF-8 character. But with \x{c2a0}syntax, it tries to convert the Unicode sequence to UTF-8 encoded character.

实际上有关 PHP 转义序列的文档是错误的。当您使用\xc2\xa0语法时,它会搜索 UTF-8 字符。但是在\x{c2a0}语法上,它会尝试将 Unicode 序列转换为 UTF-8 编码的字符。

A non breaking space is U+00A0(Unicode) but encoded as C2A0in UTF-8. So if you try with the pattern ~\x{00a0}~siu, it will work as expected.

不间断空格是U+00A0(Unicode) 但编码为C2A0UTF-8。因此,如果您尝试使用 pattern ~\x{00a0}~siu,它将按预期工作。

回答by hugsbrugs

I've aggegate previous answers so people can just copy / paste following code to choose their favorite method :

我已经汇总了以前的答案,因此人们只需复制/粘贴以下代码即可选择他们最喜欢的方法:

$some_text_with_non_breaking_spaces = "some?text?with 2 non breaking spaces at the beginning";
echo 'Qty non-breaking space : ' . substr_count($some_text_with_non_breaking_spaces, "\xc2\xa0") . '<br>';
echo $some_text_with_non_breaking_spaces . '<br>';

# Method 1 : regular expression
$clean_text = preg_replace('~\x{00a0}~siu', ' ', $some_text_with_non_breaking_spaces);

# Method 2 : convert to bin -> replace -> convert to hex
$clean_text = hex2bin(str_replace('c2a0', '20', bin2hex($some_text_with_non_breaking_spaces)));

# Method 3 : my favorite
$clean_text = str_replace("\xc2\xa0", " ", $some_text_with_non_breaking_spaces);

echo 'Qty non-breaking space : ' . substr_count($clean_text, "\xc2\xa0"). '<br>';
echo $clean_text . '<br>';

回答by DThought

The two codes do different things in my opinion: the first \xc2\xa0will replace TWO characters, \xc2and \xa0with nothing.

在我看来,这两个代码做了不同的事情:第一个\xc2\xa0将替换两个字符,\xc2并且\xa0什么也不替换。

In UTF-8 encoding, this happens to be the codepoint for U+00A0.

在 UTF-8 编码中,这恰好是U+00A0.

Does \x{00A0}work? This should be the representation for \xc2\xa0.

\x{00A0}工作?这应该是 的表示\xc2\xa0

回答by Pali

I did not work this variant ~\x{c2a0}~siu.

我没有使用这个变体~\x{c2a0}~siu

Varian \x{00A0}works. I have not tried the second option and here is the result:

瓦里安\x{00A0}作品。我还没有尝试过第二个选项,结果如下:

I tried to convert it to hex and replace no-break space 0xC2 0xA0 (c2a0)to space 0x20 (20).

我试图将其转换为十六进制并将不间断空格替换0xC2 0xA0 (c2a0)为 space 0x20 (20)

Code:

代码:

$hex = bin2hex($item);
$_item = str_replace('c2a0', '20', $hex);
$item = hex2bin($_item);

回答by EllisGL

/\x{00A0}/, /\xC2\xA0/ and $clean_hex2bin-str_replace-bin2hex worked and didn't work. If I printed it out to the screen, it's all good, but if I tried to save it to a file, the file would be blank!

/\x{00A0}/、/\xC2\xA0/ 和 $clean_hex2bin-str_replace-bin2hex 有效但无效。如果我将它打印到屏幕上,一切都很好,但是如果我尝试将其保存到文件中,则该文件将是空白的!

I ended up using iconv('UTF-8', 'ISO-8859-1//IGNORE', $str);

我最终使用 iconv('UTF-8', 'ISO-8859-1//IGNORE', $str);