javascript 带有特殊字符的名称的正则表达式 (Unicode)

Question

提问by Kristoffer la Cour

Okay, I have read about regex all day now, and still don't understand it properly. What i'm trying to do is validate a name, but the functions i can find for this on the internet only use [a-zA-Z], leaving characters out that i need to accept to.

好的，我已经读了一整天关于正则表达式的内容，但仍然没有正确理解它。我想要做的是验证一个名称，但我在互联网上可以找到的功能只使用[a-zA-Z]，而忽略了我需要接受的字符。

I basically need a regex that checks that the name is at least two words, and that it does not contain numbers or special characters like !"#¤%&/()=..., however the words can contain characters like ?, é, ? and so on...

我基本上需要一个正则表达式来检查名称是否至少是两个单词，并且它不包含数字或特殊字符，例如!"#¤%&/()=...，但是单词可以包含像 ?, é, ? 等等...

An example of an accepted name would be: "John Elkj?rd" or "André Svenson"
An non-accepted name would be: "Hans", "H4nn3Andersen" or "Martin Henriksen!"

可接受名称的示例是：“John Elkj?rd”或“André Svenson”
不被接受的名称将是：“ Hans”、“H 4nn 3Andersen”或“Martin Henriksen ！”

If it matters i use the javascript .match()function client side and want to use php's preg_replace()only "in negative" server side. (removing non-matching characters).

如果重要的话，我使用 javascript.match()函数客户端并想使用 phppreg_replace()唯一的“负面”服务器端。（删除不匹配的字符）。

Any help would be much appreciated.

任何帮助将非常感激。

Update:
Okay, thanks to Alix Axel's answeri have the important part down, the server side one.

更新：
好的，感谢Alix Axel 的回答，我有重要的部分，服务器端。

But as the page from LightWing's answersuggests, i'm unable to find anything about unicode support for javascript, so i ended up with half a solution for the client side, just checking for at least two words and minimum 5 characters like this:

但是正如LightWing 回答中的页面所暗示的那样，我找不到任何关于 unicode 对 javascript 支持的信息，所以我最终为客户端找到了一半的解决方案，只需检查至少两个单词和至少 5 个字符，如下所示：

if(name.match(/\S+/g).length >= minWords && name.length >= 5) {
  //valid
}

An alternative would be to specify all the unicode characters as suggested in shifty's answer, which i might end up doing something like, along with the solution above, but it is a bit unpractical though.

另一种方法是按照shifty 的回答中的建议指定所有 unicode 字符，我最终可能会做类似的事情，以及上面的解决方案，但这有点不切实际。

Answer 1

回答by Alix Axel

Try the following regular expression:

试试下面的正则表达式：

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$

In PHP this translates to:

在 PHP 中，这转化为：

if (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0)
{
    // valid
}

You should read it like this:

你应该这样读：

^   # start of subject
    (?:     # match this:
        [           # match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s          # any kind of space
        [               #match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s?         # any kind of space (0 or more times)
    )+      # one or more times
$   # end of subject

I honestly don't know how to port this to Javascript, I'm not even sure Javascript supports Unicode properties but in PHP PCRE this seems to work flawlessly @ IDEOne.com:

老实说，我不知道如何将它移植到 Javascript，我什至不确定 Javascript 是否支持 Unicode 属性，但在 PHP PCRE 中，这似乎完美无缺@IDEOne.com：

$names = array
(
    'Alix',
    'André Svenson',
    'H4nn3 Andersen',
    'Hans',
    'John Elkj?rd',
    'Kristoffer la Cour',
    'Marco d\'Almeida',
    'Martin Henriksen!',
);

foreach ($names as $name)
{
    echo sprintf('%s is %s' . "\n", $name, (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0) ? 'valid' : 'invalid');
}

I'm sorry I can't help you regarding the Javascript part but probably someone here will.

很抱歉，关于 Javascript 部分我无法帮助您，但这里可能有人会帮助您。

Validates:

验证：

John Elkj?rd
André Svenson
Marco d'Almeida
Kristoffer la Cour

约翰·埃尔克杰德
安德烈·斯文森
马可·德阿尔梅达
克里斯托弗拉库尔

Invalidates:

无效：

Hans
H4nn3 Andersen
Martin Henriksen!

汉斯
H4nn3安徒生
马丁·亨利克森！

To replace invalid characters, though I'm not sure why you need this, you just need to change it slightly:

要替换无效字符，虽然我不确定您为什么需要它，但您只需要稍微更改它：

$name = preg_replace('~[^\p{L}\p{Mn}\p{Pd}\'\x{2019}\s]~u', '', $name);

Examples:

例子：

H4nn3 Andersen ->Hnn Andersen
Martin Henriksen! ->Martin Henriksen

H4nn3 安徒生->Hnn 安徒生
马丁·亨利克森！->马丁·亨利克森

Note that you always need to use the umodifier.

请注意，您始终需要使用u修饰符。

Answer 2

回答by JacquesB

Regarding JavaScript it is more tricky, since JavaScript Regex syntax doesn't support unicode character properties. A pragmatic solution would be to match letters like this:

关于 JavaScript，它更棘手，因为 JavaScript Regex 语法不支持 unicode 字符属性。一个务实的解决方案是匹配这样的字母：

[a-zA-Z\xC0-\uFFFF]

This allows letters in all languages and excludes numbers and all the special (non-letter) characters commonly found on keyboards. It is imperfect because it also allows unicode special symbols which are not letters, e.g. emoticons, snowman and so on. However, since these symbols are typically not available on keyboards I don't think they will be entered by accident. So depending on your requirements it may be an acceptable solution.

这允许使用所有语言的字母，但不包括数字和键盘上常见的所有特殊（非字母）字符。它是不完美的，因为它还允许非字母的 unicode 特殊符号，例如表情符号、雪人等。然而，由于这些符号在键盘上通常不可用，我认为它们不会被意外输入。因此，根据您的要求，它可能是一个可以接受的解决方案。

Answer 3

回答by Seth V

Here's an optimization over the fantastic answer by @Alix above. It removes the need to define the character class twice, and allows for easier definition of any number of required words.

这是对上面@Alix 出色答案的优化。它消除了两次定义字符类的需要，并允许更容易地定义任意数量的必需单词。

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+(?:$|\s+)){2,}$

It can be broken down as follows:

它可以分解如下：

^         # start
  (?:       # non-capturing group
    [         # match a:
      \p{L}     # Unicode letter, or
      \p{Mn}    # Unicode accents, or
      \p{Pd}    # Unicode hyphens, or
      \'        # single quote, or
      \x{2019}  # single quote (alternative)
    ]+        # one or more times
    (?:       # non-capturing group
      $         # either end-of-string
    |         # or
      \s+       # one or more spaces
    )         # end of group
  ){2,}     # two or more times
$         # end-of-string

Essentially, it is saying to find a word as defined by the character class, then either find one or more spaces or an end of a line. The {2,}at the end tells it that a minimum of two words must be found for a match to succeed. This ensures the OP's "Hans" example will not match.

本质上，它是说找到一个由字符类定义的单词，然后找到一个或多个空格或一行的结尾。将{2,}在年底告诉它最低的两个词必须找到一个匹配成功。这可确保 OP 的“Hans”示例不匹配。

Lastly, since I found this question while looking for a similar solution for ruby, here is the regular expression as can be used in Ruby 1.9+

最后，由于我在寻找ruby的类似解决方案时发现了这个问题，这里是可以在 Ruby 1.9+ 中使用的正则表达式

\A(?:[\p{L}\p{Mn}\p{Pd}\'\U+2019]+(?:\Z|\s+)){2,}\Z

The primary changes are using \A and \Z for beginning and end of string (instead of line) and Ruby's Unicode character notation.

主要的变化是使用 \A 和 \Z 作为字符串的开头和结尾（而不是行）和 Ruby 的 Unicode 字符表示法。

Answer 4

回答by Saleh

visit this page Unicode Characters in Regular Expression

访问此页面正则表达式中的 Unicode 字符

Answer 5

回答by mjspier

you can add the allowed special chars to the regex.

您可以将允许的特殊字符添加到正则表达式中。

example:

例子：

[a-zA-Z???ü??ü?é]+

EDIT:

编辑：

not the best solution, but this would give a result if there are at least to words.

不是最好的解决方案，但如果至少有文字，这将给出结果。

[a-zA-Z???ü??ü?é]+\s[a-zA-Z???ü??ü?é]+

Answer 6

回答by ashein

When checking your input string you could

检查您的输入字符串时，您可以

trim() it to remove leading/trailing whitespaces
match against [^\w\s] to detect non-word\non-whitespace characters
match against \s+ to get the number of word separators which equals to number of words + 1.

trim() 它删除前导/尾随空格
匹配 [^\w\s] 以检测非单词\非空白字符
与 \s+ 匹配以获得等于单词数 + 1 的单词分隔符的数量。

However I'm not sure that the \w shorthand includes accented characters, but it should fall into "word characters" category.

但是，我不确定 \w 速记是否包含重音字符，但它应该属于“单词字符”类别。

Answer 7

回答by manuel-84

This is the JS regex that I use for fancy names composed with max 3 words (1 to 60 chars), separated by space/single quote/minus sign

这是我用于由最多 3 个单词（1 到 60 个字符）组成的花哨名称的 JS 正则表达式，由空格/单引号/减号分隔

^([a-zA-Z\xC0-\uFFFF]{1,60}[ \-\']{0,1}){1,3}$

javascript 带有特殊字符的名称的正则表达式 (Unicode)

提问by Kristoffer la Cour

回答by Alix Axel

回答by JacquesB

回答by Seth V

回答by Saleh

回答by mjspier

回答by ashein

回答by manuel-84

相关推荐

最近更新

标签

javascript 带有特殊字符的名称的正则表达式 (Unicode)

提问by Kristoffer la Cour

回答by Alix Axel

回答by JacquesB

回答by Seth V

回答by Saleh

回答by mjspier

回答by ashein

回答by manuel-84

相关推荐

IE7 不支持 indexOf javascript 是真的吗？

javascript Web 应用程序的推荐 JS kb 限制是多少？

如何在 JavaScript 中初始化锯齿状数组？

javascript Node.js + MySQL - 处理事务

相关推荐

最近更新

标签