模式中包含 \Uxxxxxxxx 字符的 C# 正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/364009/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 00:38:57  来源:igfitidea点击:

C# Regular Expressions with \Uxxxxxxxx characters in the pattern

c#regexunicodeastral-plane

提问by Ben McNiel

Regex.IsMatch( "foo", "[\U00010000-\U0010FFFF]" ) 

Throws: System.ArgumentException: parsing "[-]" - [x-y] range in reverse order.

抛出: System.ArgumentException: 以相反的顺序解析“[-]” - [xy] 范围。

Looking at the hex values for \U00010000 and \U0010FFF I get: 0xd800 0xdc00 for the first character and 0xdbff 0xdfff for the second.

查看 \U00010000 和 \U0010FFF 的十六进制值,我得到:第一个字符为 0xd800 0xdc00,第二个字符为 0xdbff 0xdfff。

So I guess I have really have one problem. Why are the Unicode characters formed with \U split into two chars in the string?

所以我想我真的有一个问题。为什么用 \U 形成的 Unicode 字符在字符串中分成两个字符?

采纳答案by Jon Skeet

They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?

他们是代理对。看看这些值——它们超过 65535。一个字符只是一个 16 位的值。您将如何仅用 16 位表达 65536?

Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The \uxxxx pattern in the regular expression documentation only covers 0-65535, just like \uxxxx as a C# escape sequence.)

不幸的是,文档中不清楚 .NET 中的正则表达式引擎如何(或是否)处理不在基本多语言平面中的字符。(正则表达式文档中的 \uxxxx 模式只覆盖了 0-65535,就像 \uxxxx 作为 C# 转义序列一样。)

Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?

您真正的正则表达式是否更大,或者您实际上只是想看看那里是否有任何非 BMP 字符?

回答by Ben McNiel

@Jon Skeet

@乔恩斯基特

So what you are telling me is that there is not a way to use the Regex tools in .net to match on chars outside of the utf-16 range?

所以你告诉我的是,没有办法在 .net 中使用 Regex 工具来匹配 utf-16 范围之外的字符?

The full regex is:

完整的正则表达式是:

^(\u0009|[\u0020-\u007E]|\u0085|[\u00A0-\uD7FF]|[\uE000-\uFFFD]|[\U00010000-\U0010FFFF])+$

I am attempting to check if a string only contains what a yaml document defines as printable Unicode chararters.

我正在尝试检查字符串是否仅包含 yaml 文档定义为可打印 Unicode 字符的内容。

回答by Andriy K

To workaround such things with .Net regex engine, I'm using following trick: "[\U010000-\U10FFFF]"is replaced with [\uD800-\uDBFF][\uDC00-\uDFFF]The idea behind this is that as .Net regexes handle code units instead of code points, we're providing it with surrogate ranges as regular characters. It's also possible to specify more narrow ranges by operating with edges, e.g.: [\U011DEF-\U013E07]is same as (?:\uD807[\uDDEF-\uDFFF])|(?:[\uD808-\uD80E][\uDC00-\uDFFF])|(?:\uD80F[\uDC00-uDE07])

为了使用 .Net 正则表达式引擎解决此类问题,我使用了以下技巧: "[\U010000-\U10FFFF]"替换为[\uD800-\uDBFF][\uDC00-\uDFFF]背后的想法是,由于 .Net 正则表达式处理代码单元而不是代码点,因此我们将代理范围作为常规字符提供给它。也可以通过边缘操作来指定更窄的范围,例如:[\U011DEF-\U013E07](?:\uD807[\uDDEF-\uDFFF])|(?:[\uD808-\uD80E][\uDC00-\uDFFF])|(?:\uD80F[\uDC00-uDE07])

It's harder to read and operate with, and it's not that flexible, but still fits as workaround.

它更难阅读和操作,也不是那么灵活,但仍然适合作为解决方法。