PHP 中的 preg_match 和 UTF-8

Question

提问by JW.

I'm trying to search a UTF8-encoded string using preg_match.

我正在尝试使用preg_match搜索 UTF8 编码的字符串。

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

This should print 1, since "H" is at index 1 in the string "?Hola!". But it prints 2. So it seems like it's not treating the subject as a UTF8-encoded string, even though I'm passing the "u" modifierin the regular expression.

这应该打印 1，因为“H”在字符串“?Hola!”中的索引 1 处。但它打印 2。所以它似乎没有将主题视为 UTF8 编码的字符串，即使我在正则表达式中传递了“u”修饰符。

I have the following settings in my php.ini, and other UTF8 functions are working:

我的 php.ini 中有以下设置，其他 UTF8 函数正在运行：

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

Any ideas?

有任何想法吗？

Answer 1

采纳答案by user187291

Looks like this is a "feature", see http://bugs.php.net/bug.php?id=37391

看起来这是一个“功能”，参见 http://bugs.php.net/bug.php?id=37391

'u' switch only makes sense for pcre, PHP itself is unaware of it.

'u' 开关只对 pcre 有意义，PHP 本身不知道它。

From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").

从 PHP 的角度来看，字符串是字节序列，返回字节偏移量似乎是合乎逻辑的（我不是说“正确”）。

Answer 2

回答by Gumbo

Although the umodifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.

尽管u修饰符使模式和主题都被解释为 UTF-8，但捕获的偏移量仍以字节计。

You can use mb_strlento get the length in UTF-8 characters rather than bytes:

您可以使用mb_strlenUTF-8 字符而不是字节来获取长度：

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

Answer 3

回答by Natxet

Try adding this (*UTF8)before the regex:

尝试在正则表达式之前添加此(*UTF8)：

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

Magic, thanks to a comment in http://www.php.net/manual/es/function.preg-match.php#95828

魔术，感谢http://www.php.net/manual/es/function.preg-match.php#95828 中的评论

Answer 4

回答by Guy Fawkes

Excuse me for necroposting, but may be somebody will find it useful: code below can work both as replacement for preg_match and preg_match_all functions and returns correct matches with correctoffset for UTF8-encoded strings.

请原谅我的 necroposting，但可能有人会发现它很有用：下面的代码既可以作为 preg_match 和 preg_match_all 函数的替代，也可以为 UTF8 编码的字符串返回具有正确偏移量的正确匹配项。

     mb_internal_encoding('UTF-8');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = 'preg_match';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= '_all';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = 'Попробуем русскую строку для теста';
    $s2 = 'Try english string for test';

    var_dump(pregMatchCapture(true, '/обу/', $s1));
    var_dump(pregMatchCapture(false, '/обу/', $s1));

    var_dump(pregMatchCapture(true, '/lish/', $s2));
    var_dump(pregMatchCapture(false, '/lish/', $s2));

Output of my example:

我的例子的输出：

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) "обу"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) "обу"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) "lish"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) "lish"
        [1]=>
        int(7)
      }
    }

Answer 5

回答by bronek89

I wrote small class to convert offsets returned by preg_match to proper utf offsets:

我写了一个小类来将 preg_match 返回的偏移量转换为正确的 utf 偏移量：

final class NonUtfToUtfOffset
{
    /** @var int[] */
    private $utfMap = [];

    public function __construct(string $content)
    {
        $contentLength = mb_strlen($content);

        for ($offset = 0; $offset < $contentLength; $offset ++) {
            $char = mb_substr($content, $offset, 1);
            $nonUtfLength = strlen($char);

            for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
                $this->utfMap[] = $offset;
            }
        }
    }

    public function convertOffset(int $nonUtfOffset): int
    {
        return $this->utfMap[$nonUtfOffset];
    }
}

You can use it like that:

你可以这样使用它：

$content = 'a? ba? d';
$offsetConverter = new NonUtfToUtfOffset($content);

preg_match_all('#(ba?)#ui', $content, $m, PREG_OFFSET_CAPTURE);

foreach ($m[1] as [$word, $offset]) {
    echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
    echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}

https://3v4l.org/8Y32J

Answer 6

回答by velcrow

If all you want to do is find the multi-byte safe position of H try mb_strpos()

如果您只想找到 H 的多字节安全位置，请尝试 mb_strpos()

mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";

Output:

输出：

?Hola!
1
H

Answer 7

回答by Danon

You might want to look at T-Regxlibrary.

您可能想查看T-Regx库。

pattern('/Hola/u')->match('\xC2\xA1Hola!')->first(function (Match $match) 
{
    echo $match->offset();     // characters
    echo $match->byteOffset(); // bytes
});

This $match->offset()is UTF-8 safe offset.

这$match->offset()是 UTF-8 安全偏移。

PHP 中的 preg_match 和 UTF-8

提问by JW.

采纳答案by user187291

回答by Gumbo

回答by Natxet

回答by Guy Fawkes

回答by bronek89

回答by velcrow

回答by Danon

相关推荐

最近更新

标签

PHP 中的 preg_match 和 UTF-8

提问by JW.

采纳答案by user187291

回答by Gumbo

回答by Natxet

回答by Guy Fawkes

回答by bronek89

回答by velcrow

回答by Danon

相关推荐

php 树枝 - 在 for 循环中构建数组

如何有效地包含 config.php？

php Apache Mod 重写 Laravel

php 如何在 mpdf 中以横向模式设置页面？

相关推荐

最近更新

标签