PHP 中的 preg_match 和 UTF-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1725227/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
preg_match and UTF-8 in PHP
提问by JW.
I'm trying to search a UTF8-encoded string using preg_match.
我正在尝试使用preg_match搜索 UTF8 编码的字符串。
preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];
This should print 1, since "H" is at index 1 in the string "?Hola!". But it prints 2. So it seems like it's not treating the subject as a UTF8-encoded string, even though I'm passing the "u" modifierin the regular expression.
这应该打印 1,因为“H”在字符串“?Hola!”中的索引 1 处。但它打印 2。所以它似乎没有将主题视为 UTF8 编码的字符串,即使我在正则表达式中传递了“u”修饰符。
I have the following settings in my php.ini, and other UTF8 functions are working:
我的 php.ini 中有以下设置,其他 UTF8 函数正在运行:
mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off
Any ideas?
有任何想法吗?
采纳答案by user187291
Looks like this is a "feature", see http://bugs.php.net/bug.php?id=37391
看起来这是一个“功能”,参见 http://bugs.php.net/bug.php?id=37391
'u' switch only makes sense for pcre, PHP itself is unaware of it.
'u' 开关只对 pcre 有意义,PHP 本身不知道它。
From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").
从 PHP 的角度来看,字符串是字节序列,返回字节偏移量似乎是合乎逻辑的(我不是说“正确”)。
回答by Gumbo
Although the umodifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.
尽管u修饰符使模式和主题都被解释为 UTF-8,但捕获的偏移量仍以字节计。
You can use mb_strlento get the length in UTF-8 characters rather than bytes:
您可以使用mb_strlenUTF-8 字符而不是字节来获取长度:
$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));
回答by Natxet
Try adding this (*UTF8)before the regex:
尝试在正则表达式之前添加此(*UTF8):
preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
Magic, thanks to a comment in http://www.php.net/manual/es/function.preg-match.php#95828
魔术,感谢http://www.php.net/manual/es/function.preg-match.php#95828 中的评论
回答by Guy Fawkes
Excuse me for necroposting, but may be somebody will find it useful: code below can work both as replacement for preg_match and preg_match_all functions and returns correct matches with correctoffset for UTF8-encoded strings.
请原谅我的 necroposting,但可能有人会发现它很有用:下面的代码既可以作为 preg_match 和 preg_match_all 函数的替代,也可以为 UTF8 编码的字符串返回具有正确偏移量的正确匹配项。
mb_internal_encoding('UTF-8');
/**
* Returns array of matches in same format as preg_match or preg_match_all
* @param bool $matchAll If true, execute preg_match_all, otherwise preg_match
* @param string $pattern The pattern to search for, as a string.
* @param string $subject The input string.
* @param int $offset The place from which to start the search (in bytes).
* @return array
*/
function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
{
$matchInfo = array();
$method = 'preg_match';
$flag = PREG_OFFSET_CAPTURE;
if ($matchAll) {
$method .= '_all';
}
$n = $method($pattern, $subject, $matchInfo, $flag, $offset);
$result = array();
if ($n !== 0 && !empty($matchInfo)) {
if (!$matchAll) {
$matchInfo = array($matchInfo);
}
foreach ($matchInfo as $matches) {
$positions = array();
foreach ($matches as $match) {
$matchedText = $match[0];
$matchedLength = $match[1];
$positions[] = array(
$matchedText,
mb_strlen(mb_strcut($subject, 0, $matchedLength))
);
}
$result[] = $positions;
}
if (!$matchAll) {
$result = $result[0];
}
}
return $result;
}
$s1 = 'Попробуем русскую строку для теста';
$s2 = 'Try english string for test';
var_dump(pregMatchCapture(true, '/обу/', $s1));
var_dump(pregMatchCapture(false, '/обу/', $s1));
var_dump(pregMatchCapture(true, '/lish/', $s2));
var_dump(pregMatchCapture(false, '/lish/', $s2));
Output of my example:
我的例子的输出:
array(1) {
[0]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(6) "обу"
[1]=>
int(4)
}
}
}
array(1) {
[0]=>
array(2) {
[0]=>
string(6) "обу"
[1]=>
int(4)
}
}
array(1) {
[0]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(4) "lish"
[1]=>
int(7)
}
}
}
array(1) {
[0]=>
array(2) {
[0]=>
string(4) "lish"
[1]=>
int(7)
}
}
回答by bronek89
I wrote small class to convert offsets returned by preg_match to proper utf offsets:
我写了一个小类来将 preg_match 返回的偏移量转换为正确的 utf 偏移量:
final class NonUtfToUtfOffset
{
/** @var int[] */
private $utfMap = [];
public function __construct(string $content)
{
$contentLength = mb_strlen($content);
for ($offset = 0; $offset < $contentLength; $offset ++) {
$char = mb_substr($content, $offset, 1);
$nonUtfLength = strlen($char);
for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
$this->utfMap[] = $offset;
}
}
}
public function convertOffset(int $nonUtfOffset): int
{
return $this->utfMap[$nonUtfOffset];
}
}
You can use it like that:
你可以这样使用它:
$content = 'a? ba? d';
$offsetConverter = new NonUtfToUtfOffset($content);
preg_match_all('#(ba?)#ui', $content, $m, PREG_OFFSET_CAPTURE);
foreach ($m[1] as [$word, $offset]) {
echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}
回答by velcrow
If all you want to do is find the multi-byte safe position of H try mb_strpos()
如果您只想找到 H 的多字节安全位置,请尝试 mb_strpos()
mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";
Output:
输出:
?Hola!
1
H

