javascript 带有扩展拉丁字母的 RegEx (? ? ü è ?)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11704182/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-26 14:04:00  来源:igfitidea点击:

RegEx with extended latin alphabet (? ? ü è ?)

javascriptnode.jsregexutf-8

提问by buschtoens

I want to do some basic String testing in Node.js. Assume I have a form where users enter their name and I wanna check if it's just rubbish or a real name.

我想在 Node.js 中做一些基本的字符串测试。假设我有一个表单,用户可以在其中输入他们的姓名,我想检查它是垃圾还是真实姓名。

Happily (or sadly for my check) I get users from all around the world which means that their names contain non-english characters, like ? ? ü ? é. I was used to use /[A-Za-z -]{2,}/but this doesn't match names like "Jan Buscht?ns".

令人高兴的是(或可悲的是我的支票)我得到了来自世界各地的用户,这意味着他们的名字包含非英语字符,如? ? ü ? é. 我曾经使用过,/[A-Za-z -]{2,}/但这与"Jan Buscht?ns".

Do I have to manually add every possible non-english but latin character to my RegEx to work? I don't want a 100+ characters long RegEx like /[A-Za-z -??ü??ü?ééèèêê...]{2,}/.

我是否必须手动将所有可能的非英语但拉丁字符添加到我的 RegEx 中才能工作?我不想要 100+ 个字符长的 RegEx 像/[A-Za-z -??ü??ü?ééèèêê...]{2,}/.

采纳答案by ?mega

Check http://www.regular-expressions.info/unicode.htmland http://xregexp.com/plugins/

检查http://www.regular-expressions.info/unicode.htmlhttp://xregexp.com/plugins/

You would need to use \p{L}to match any letter character if you want to include unicode.

\p{L}如果要包含 unicode,则需要使用来匹配任何字母字符。

Speaking unicode, alternative of \wis [\p{L}\p{N}_]then.

说到 unicode,然后\w是替代方案[\p{L}\p{N}_]

回答by Daniel Cassidy

The answer depends on exactly what you want to do.

答案完全取决于您想要做什么。

As you have noticed, [A-Za-z]only matches Latin letters without diacritics.

正如您所注意到的,[A-Za-z]只匹配没有变音符号的拉丁字母。

If you only care about German diacritics and the ?ligature, then you can just replace that part with [A-Za-z??ü??ü?], e.g.:

如果您只关心德语变音符号和? 连字,然后您可以将该部分替换为[A-Za-z??ü??ü?],例如:

/[A-Za-z??ü??ü? -]{2,}/

But that probably isn't what you want to do. You probably want to match Latin letters with any diacritics, not just those used in German. Or perhaps you want to match any letters from any alphabet, not just Latin.

但这可能不是您想要做的。您可能希望将拉丁字母与任何变音符号相匹配,而不仅仅是德语中使用的那些。或者,您可能想匹配任何字母表中的任何字母,而不仅仅是拉丁字母。

Other regular expressions dialects have character classes to help you with problems like this, but unfortunately JavaScript's regular expression dialect has very few character classes and none of them help you here.

其他正则表达式方言有字符类可以帮助您解决此类问题,但不幸的是,JavaScript 的正则表达式方言字符类非常少,而且没有一个可以帮助您。

(In case you don't know, a “character class” is an expression that matches any character that is a member of a predefined group of characters. For example, \wis a character class that matches any ASCII letter, or digit, or an underscore, and .is a character class that matches any character.)

(如果您不知道,“字符类”是一种与属于预定义字符组成员的任何字符相匹配的表达式。例如,\w是与任何 ASCII 字母、数字或下划线,并且.是匹配任何字符的字符类。)

This means that you have to list out every range of UTF-16 code units that corresponds to a character that you want to match.

这意味着您必须列出与要匹配的字符对应的每个 UTF-16 代码单元范围。

A quick and dirty solution might be to say [a-zA-Z\u0080-\uFFFF], or in full:

一个快速而肮脏的解决方案可能是说[a-zA-Z\u0080-\uFFFF],或者完整的:

/[A-Za-z\u0080-\uFFFF -]{2,}/

This will match any letter in the ASCII range, but will also match any character at all that is outside the ASCII range. This includes all possible alphabetic characters with or without diacritics in any script. However, it also includes a lot of characters that are not letters. Non-letters in the ASCII range are excluded, but non-letters outside the ASCII range are included.

这将匹配 ASCII 范围内的任何字母,但也会匹配 ASCII 范围之外的任何字符。这包括任何脚本中带有或不带有变音符号的所有可能的字母字符。但是,它还包括许多不是字母的字符。ASCII 范围内的非字母被排除,但 ASCII 范围外的非字母包括在内。

The above might be good enough for your purposes, but if it isn't then you will have to figure out which character ranges you need and specify those explicitly.

以上可能足以满足您的目的,但如果不是,那么您将必须弄清楚您需要哪些字符范围并明确指定这些范围。