C# 用于验证姓名和姓氏的正则表达式?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/888838/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regular expression for validating names and surnames?
提问by Sklivvz
Although this seems like a trivial question, I am quite sure it is not :)
虽然这似乎是一个微不足道的问题,但我很确定它不是:)
I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:
我需要验证来自世界各地的人的姓名。想象一个巨大的名字和姓氏列表,我需要尽可能删除我识别的任何杂物。我怎样才能用正则表达式做到这一点?如果只有英语,我认为这会削减它:
^[a-z -']+$
However, I need to support also these cases:
但是,我还需要支持这些情况:
- other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
- different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
- no numbers or symbols or unnecessary punctuation or runes, etc..
- titles, middle initials, suffixes are not part of this data
- names are already separated by surnames.
- we are prepared to force ultra rare names to be simplified (there's a person named '@' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
- note that many countries have laws about names so there are standards to follow
- 其他标点符号,因为它们可能在不同的国家/地区使用(不知道是哪个,但也许你知道!)
- 不同的 Unicode 字母集(重音字母、希腊语、日语、中文等)
- 没有数字或符号或不必要的标点符号或符文等。
- 标题、中间名首字母、后缀不是此数据的一部分
- 名字已经由姓氏分隔。
- 我们准备强制简化超稀有名称(存在一个名为“@”的人,但在任何地方都允许使用该字符是没有意义的。使用实用主义和善意。)
- 请注意,许多国家/地区都有关于姓名的法律,因此需要遵循一些标准
Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their namewhen registering in the list?
是否有一种标准方法可以验证我可以实施的这些字段,以确保我们的网站用户有很好的体验,并且可以在列表中注册时实际使用他们的名字?
I would be looking for something similar to the many "email address" regexes that you can find on google.
我会寻找类似于您可以在谷歌上找到的许多“电子邮件地址”正则表达式的东西。
采纳答案by Sklivvz
I'll try to give a proper answer myself:
我会尝试自己给出一个正确的答案:
The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.
名称中唯一允许使用的标点符号是句号、撇号和连字符。我在极端案例列表中没有看到任何其他案例。
Regarding numbers, there's only one case with an 8. I think I can safely disallow that.
关于数字,只有一种情况是 8。我想我可以安全地禁止这种情况。
Regarding letters, any letter is valid.
关于信件,任何信件都是有效的。
I also want to include space.
我也想包括空间。
This would sum up to this regex:
这将总结为这个正则表达式:
^[\p{L} \.'\-]+$
This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.
这提出了一个问题,即撇号可以用作攻击向量。它应该被编码。
So the validation code should be something like this (untested):
所以验证代码应该是这样的(未经测试):
var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$"))
throw new ArgumentException("nameParam");
name = name.Replace("'", "'"); //' does not work in IE
Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?
任何人都可以想出一个名称不应该通过此测试或可以通过的 XSS 或 SQL 注入的原因吗?
complete tested solution
完整的测试解决方案
using System;
using System.Text.RegularExpressions;
namespace test
{
class MainClass
{
public static void Main(string[] args)
{
var names = new string[]{"Hello World",
"John",
"Jo?o",
"タロウ",
"やまだ",
"山田",
"先生",
"мыхаыл",
"Θεοκλεια",
"?????????",
"???? ?????",
"?????????",
"??????",
"?",
"D'Addario",
"John-Doe",
"P.A.M.",
"' --",
"<xss>",
"\""
};
foreach (var nameParam in names)
{
Console.Write(nameParam+" ");
var name = nameParam.Trim();
if (!Regex.IsMatch(name, @"^[\p{L}\p{M}' \.\-]+$"))
{
Console.WriteLine("fail");
continue;
}
name = name.Replace("'", "'");
Console.WriteLine(name);
}
}
}
}
回答by kscott
I would think you would be better off excludingthe characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like @#$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name
我认为你最好用正则表达式排除你不想要的字符。试图获得每一个变音符号、重音 e、连字符等将是非常疯狂的。只需排除数字(但是一个名叫“George Forman the 4th”的人呢)和你知道你不想要的符号,比如@#$%^ 或者你有什么。但即便如此,使用正则表达式也只能保证输入与正则表达式匹配,它不会告诉您这是一个有效名称
EDIT after clarifying that this is trying to prevent XSS:A regex on a name field is obviously not going to stop XSS on it's own. However, this article has a section on filtering that is a starting point if you want to go that route.
在澄清这是试图防止 XSS 后编辑:名称字段上的正则表达式显然不会自行停止 XSS。但是,如果您想走那条路,这篇文章有一个关于过滤的部分,这是一个起点。
http://tldp.org/HOWTO/Secure-Programs-HOWTO/cross-site-malicious-content.html
http://tldp.org/HOWTO/Secure-Programs-HOWTO/cross-site-malicious-content.html
s/[\<\>\"\'\%\;\(\)\&\+]//g;
回答by Chris Cudmore
I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.
我同意在这种情况下需要限制输入,但我不相信这是可能的 - Unicode 是巨大的、不断扩展的,世界各地名称中使用的子集也是如此。
Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewherein the world.
与电子邮件不同,对于人们可能使用的名称,甚至他们可以在各自政府注册为官方的哪些表示,没有普遍认可的标准。我怀疑任何正则表达式最终将无法路过认为有效的名称某人,某个地方在世界上。
Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tablesproblem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.
当然,您确实需要清理或转义输入,以避免出现Little Bobby Tables问题。对于您允许的输入,可能还有其他限制,例如用于存储、渲染或操作名称的底层系统。因此,我建议您首先确定验证所属的系统所需的限制,然后仅根据这些限制创建验证表达式。在某些情况下,这可能仍然会造成不便,但这种情况应该很少见。
回答by user9876
I would just allow everything (except an empty string) and assume the user knows what his name is.
我只会允许所有内容(空字符串除外)并假设用户知道他的名字是什么。
There are 2 common cases:
有2种常见情况:
- You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
- You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.
- 您关心姓名是否准确,并根据真实的纸质护照或其他身份证件或信用卡进行验证。
- 您并不那么在意,无论如何用户都可以注册为“Fred Smith”(或“Jane Doe”)。
In case (1), you can allow all characters because you're checking against a paper document.
在情况 (1) 中,您可以允许所有字符,因为您正在检查纸质文档。
In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".
在情况 (2) 中,您也可以允许所有字符,因为“123 456”实际上并不比“Abc Def”更糟糕。
回答by Gumbo
I don't think that's a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn't prevent users from entering pseudo-names like John Doe, Max Mustermann(there even is a person with that name), Abcde Fghijkor Ababa Bebebe.
我不认为这是个好主意。即使您找到了合适的正则表达式(可能使用了 Unicode 字符属性),这也不会阻止用户输入像John Doe、Max Mustermann(甚至还有一个人有这个名字)、Abcde Fghijk或Ababa Bebebe 这样的假名。
回答by John Saunders
BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?
BTW,你打算只允许拉丁字母,还是你也打算尝试验证中文、阿拉伯语、印地语等?
As others have said, don't even tryto do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.
正如其他人所说,甚至不要尝试这样做。退后一步,问问自己你真正想要完成什么。然后尝试完成它,而不要对人们的名字是什么或他们的意思做任何假设。
回答by Trampas Kirk
It's a very difficult problem to validate something like a name due to all the corner cases possible.
由于所有可能的极端情况,验证名称之类的东西是一个非常困难的问题。
Corner Cases
角落案例
- Anything anything here
Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.
清理输入并让他们输入他们想要的任何名称,因为决定什么是有效名称,什么不是有效名称可能超出了您所做的任何范围;鉴于潜在的奇怪 - 合法名称的范围几乎是无限的。
If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.
如果他们想称自己为 Tricyclopltz^2-Glockenschpiel,那是他们的问题,而不是你的问题。
回答by Timi
A very contentious subject that I seem to have stumbled along here. However sometimes it's nice to head dear little-bobby tables off at the pass and send little Robert to the headmasters office along with his semi-colons and SQL comment lines --.
一个非常有争议的话题,我似乎在这里偶然发现。然而,有时在通行证处把亲爱的小鲍比表送走,把小罗伯特连同他的分号和 SQL 注释行一起送到校长办公室是件好事——。
This REGEX in VB.NET includes regular alphabetic characters and various circumflexed european characters. However poor old James Mc'Tristan-Smythe the 3rd will have to input his pedigree in as the Jim the Third.
VB.NET 中的此 REGEX 包括常规字母字符和各种带回旋的欧洲字符。然而,可怜的老詹姆斯·麦克特里斯坦-史密斯三世将不得不输入他的血统作为吉姆三世。
<asp:RegularExpressionValidator ID="RegExValid1" Runat="server"
ErrorMessage="ERROR: Please enter a valid surname<br/>" SetFocusOnError="true" Display="Dynamic"
ControlToValidate="txtSurname" ValidationGroup="MandatoryContent"
ValidationExpression="^[A-Za-z'\-\p{L}\p{Zs}\p{Lu}\p{Ll}\']+$">
回答by Paulo Carvalho
You could use the following regex code to validate 2 names separeted by a space with the following regex code:
您可以使用以下正则表达式代码来验证由以下正则表达式代码的空格分隔的 2 个名称:
^[A-Za-zà-ú]+ [A-Za-zà-ú]+$
^[A-Za-zà-ú]+ [A-Za-zà-ú]+$
or just use:
或者只是使用:
[[:lower:]] = [a-zà-ú]
[[:lower:]] = [a-zà-ú]
[[:upper:]] =[A-Zà-ú]
[[:upper:]] =[A-Zà-ú]
[[:alpha:]] = [A-Za-zà-ú]
[[:alpha:]] = [A-Za-zà-ú]
[[:alnum:]] = [A-Za-zà-ú0-9]
[[:alnum:]] = [A-Za-zà-ú0-9]
回答by MT.
This somewhat helps:
这有点帮助:
^[a-zA-Z]'?([a-zA-Z]|\.| |-)+$
^[a-zA-Z]'?([a-zA-Z]|\.| |-)+$