php 正则表达式:\w - UTF-8 中的“_”+“-”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2062169/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 04:52:05  来源:igfitidea点击:

RegEx: \w - "_" + "-" in UTF-8

phpregexunicodeutf-8pcre

提问by Alix Axel

I need a regular expression that matches UTF-8 letters and digits, the dash sign (-) but doesn't match underscores (_), I tried these silly attempts without success:

我需要一个匹配 UTF-8 字母和数字、破折号 ( -) 但不匹配下划线 ( _)的正则表达式,我尝试了这些愚蠢的尝试但没有成功:

  • ([\w-^_])+
  • ([\w^_]-?)+
  • (\w[^_]-?)+
  • ([\w-^_])+
  • ([\w^_]-?)+
  • (\w[^_]-?)+

The \wis shorthand for [A-Za-z0-9_], but it also matches UTF-8 chars if I have the umodifier set.

The\w是 的简写[A-Za-z0-9_],但如果我u设置了修饰符,它也会匹配 UTF-8 字符。

Can anyone help me out with this one?

谁能帮我解决这个问题?

回答by gha.st

Try this:

尝试这个:

(?:[\w\-](?<!_))+

It does a simple match on anything that is encoded as a \w (or a dash) and then has a zero-width lookbehind that ensures that the character that was just matched is not a underscore.

它对编码为 \w(或破折号)的任何内容进行简单匹配,然后进行零宽度后视,以确保刚刚匹配的字符不是下划线。

Otherwise you could pick this one:

否则你可以选择这个:

(?:[^_\W]|-)+

which is a more set-based approach (note the uppercase W)

这是一种更基于集合的方法(注意大写的 W)

OK, I had a lot of fun with unicode in php's flavor of PCREs :D Peekaboo says there is a simple solution available:

好的,我在 php 风格的 PCRE 中使用 unicode 玩得很开心:D Peekaboo 说有一个简单的解决方案可用:

[\p{L}\p{N}\-]+

\p{L} matches anything unicode that qualifies as a Letter (note: not a word character, thus no underscores), while \p{N} matches anything that looks like a number (including roman numerals and more exotic things).
\- is just an escaped dash. Although not strictly necessary, I tend to make it a point to escape dashes in character classes... Note, that there are dozens of different dashes in unicode, thus giving rise to the following version:

\p{L} 匹配任何符合字母条件的 unicode(注意:不是单词字符,因此没有下划线),而 \p{N} 匹配任何看起来像数字的东西(包括罗马数字和更奇特的东西)。
\- 只是一个转义的破折号。虽然不是绝对必要的,但我倾向于在字符类中转义破折号......请注意,unicode中有几十种不同的破折号,因此产生了以下版本:

[\p{L}\p{N}\p{Pd}]+

Where "Pd" is Punctuation Dash, including, but not limited to our minus-dash-thingy. (Note, again no underscore here).

其中“Pd”是标点符号,包括但不限于我们的减号。(注意,这里再次没有下划线)。

回答by Jiri Klouda

I am not sure which language you use, but in PERL you can simply write: [[:alnum:]-]+ when the correct locale is set.

我不确定您使用哪种语言,但在 PERL 中,您可以简单地编写: [[:alnum:]-]+ 设置正确的语言环境。