PHP 正则表达式中的 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6407983/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 00:17:10  来源:igfitidea点击:

UTF-8 in PHP regular expressions

phpregexutf-8

提问by Gasper

I need help with regular expressions. My string contains unicode characters and code below doesn't work.

我需要正则表达式方面的帮助。我的字符串包含 unicode 字符,下面的代码不起作用。

First four characters must be numbers, then comma and then any alphabetic characters or whitespaces... I already read that if i add /u on end of regular expresion but it didn't work for me...

前四个字符必须是数字,然后是逗号,然后是任何字母字符或空格......我已经读过,如果我在正则表达式的末尾添加 /u 但它对我不起作用......

My code works with non-unicode characters

我的代码适用于非 unicode 字符

$post = '9999,?kofja loka';;
echo preg_match('/^[0-9]{4},[\s]*[a-zA-Z]+', $post);

Thanks for your answers!

感谢您的回答!

回答by stema

Updated answer:
This is now tested and working

更新的答案:
现在已经过测试并且可以正常工作

$post = '9999, ?kofja loka';
echo preg_match('/^\d{4},[\s\p{L}]+$/u', $post);

\\wwill not work, because it does not contain all unicode letters and contains also [0-9_]additionally to the letters.

\\w将不起作用,因为它不包含所有 unicode 字母并且还包含[0-9_]字母。

Important is also the umodifier to activate the unicode mode.

重要的还有u激活 unicode 模式的修饰符。

If there can be letters orwhitespace after the comma then you should put those into the same character class, in your regex there are 0 or more whitespace after the comma and then there are only letters.

如果逗号后可以有字母空格,那么您应该将它们放入相同的字符类中,在您的正则表达式中,逗号后有 0 个或多个空格,然后只有字母。

See http://www.regular-expressions.info/php.htmlfor php regex details

有关php 正则表达式的详细信息,请参阅http://www.regular-expressions.info/php.html

The \\p{L}(Unicode letter) is explained here

\\p{L}(Unicode的字母),说明在这里

Important is also the use of the end of string boundary $to ensure that really the complete string is verified, otherwise it will match only the first whitespace and ignore the rest for example.

重要的是使用字符串边界的结尾$来确保真正验证完整的字符串,否则它将只匹配第一个空格并忽略其余的例如。

回答by jmz

[a-zA-Z]will match only letters in the range of a-z and A-Z. You have non-US-ASCII letters, and therefore your regex won't match, regardless of the /umodifier. You need to use the word character escape sequence (\w).

[a-zA-Z]将仅匹配 az 和 AZ 范围内的字母。您有非 US-ASCII 字母,因此无论/u修饰符如何,您的正则表达式都不会匹配。您需要使用单词字符转义序列 ( \w)。

$post = '9999,?kofja loka';
echo preg_match('/^[0-9]{4},[\s]*[\w]+/u', $post);

回答by Sodved

The problem is your regular expression. You are explicitly saying that you will only accept a b c ... z A B C ... Z. ?is not in the a-z set. Remember, ?is as different to sas any other character.

问题是你的正则表达式。您明确表示您只会接受a b c ... z A B C ... Z. ?不在 az 集中。请记住,?是因为不同于s其他任何字符。

So if you really just want a sequence of letters, then you need to test for the unicode properties. e.g.

所以如果你真的只想要一个字母序列,那么你需要测试 unicode 属性。例如

echo preg_match('/^[0-9]{4},[\s]*\p{L}+', $post);

That shouuld work because \p{L}matches any unicode character which is considered a letter. Not just A through Z.

这应该有效,因为\p{L}匹配任何被视为字母的 unicode 字符。不仅仅是 A 到 Z。

回答by searlea

Add a u, and remember the trailing slash:

添加u, 并记住尾部斜杠:

echo preg_match('/^[0-9]{4},[\s]*[a-zA-Z]+/u', $post);

Edited:

编辑:

echo preg_match('/^\d{4},(?:\s|\w)+/u', $post);