bash grep 有效域正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21172095/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 09:16:58  来源:igfitidea点击:

grep valid domain regex

regexbashdnsgrep

提问by Arka

I'm trying to make a regex for grep that match only valid domains.

我正在尝试为 grep 制作一个仅匹配有效域的正则表达式。

My version work pretty well but match the following invalid domain :

我的版本运行良好,但匹配以下无效域:

@subdom..dom.ext

Here is my regex :

这是我的正则表达式:

echo "@dom.ext" | grep "^@[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+$"

I'm working with bash so I escaped special characters.

我正在使用 bash,所以我转义了特殊字符。

Sample that should match :

应该匹配的示例:

@subdom.dom.ext
@subsubdom.subdom.dom.ext
@subsub-dom.sub-dom.ext

Thanks for help

感谢帮助

回答by mklement0

A truly complete solution requires more work, but here's an approximationthat may work well enough(note that a @prefix is assumed and the input string is expected to start with it):

一个真正完整的解决方案需要更多的工作,但这里有一个可能工作得很好的近似值(请注意,@假设有一个前缀,并且输入字符串应该以它开头):

^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$

You can use this with egrep(or grep -E), but also with [[ ... =~ ... ]], bash's regex-matching operator.

您可以将其与egrep(或grep -E) 一起使用[[ ... =~ ... ]],也可以与, bash 的正则表达式匹配运算符一起使用。

Makes the following assumptions, which are more permissive than actual DNS name constraints:

做出以下假设,这些假设比实际的 DNS 名称限制更宽松:

  • Only ASCII (non-foreign) letters are allowed - see below for Internationalized Domain Name (IDN) considerations; also, the Punycode *(ASCII-compatible) forms of IDNs - e.g., xn--bcher-kva.chfor bücher.ch- are not matched - see below.

  • There's no limit on the number of nested subdomains.

  • There's no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).

  • The TLD (last component) is composed of letters only and has a length of at least 2.

  • Both subdomain and domain names must start with a letter; subdomains are allowed to be single-letter.

  • 只允许使用 ASCII(非外来)字母 - 请参阅下文了解国际化域名 (IDN) 的注意事项;此外,Punycode *(ASCII 兼容)形式的 IDN - 例如,xn--bcher-kva.chfor bücher.ch- 不匹配 - 见下文。

  • 嵌套子域的数量没有限制。

  • 任何标签(名称组件)的长度都没有限制,名称的总长度也没有限制(有关实际限制,请参见此处)。

  • TLD(最后一个部分)仅由字母组成,长度至少为 2。

  • 子域名和域名都必须以字母开头;子域可以是单字母的。

Here's a quick test:

这是一个快速测试:

for d in @subdom..dom.ext @dom.ext @subdom.dom.ext @subsubdom.subdom.dom.ext @subsub-dom.sub-dom.ext @x.org; do
 [[ $d =~ \
    ^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$ \
 ]] && echo YES || echo NO
done


Support for Internationalized Domain Names (IDN)with literalUnicode characters- again, a complete solution requires more work:

支持具有文字Unicode 字符的国际化域名 (IDN)- 同样,一个完整的解决方案需要更多的工作:

A simple improvement to also match IDNs is to replace [a-zA-Z]with [[:alpha:]]and [a-zA-Z0-9]with [[:alnum:]]in the above regex; i.e.:

匹配 IDN 的一个简单改进是替换上述正则表达式中的[a-zA-Z]with[[:alpha:]][a-zA-Z0-9]with [[:alnum:]];IE:

^@(([[:alpha:]](-?[[:alnum:]])*)\.)+[[:alpha:]]{2,}$

Caveats:

注意事项

  • Noattempt is made to recognize Punycode-encoded versions of IDNs, which use an ASCII-based encoding with prefix xn--, and which would require decoding afterwards.

  • As Patrick Mevzekpoints out, the above can yield both false negatives and false positives(using his examples):

    • False positive: an invalidPunycode-encoded name such as ab--whatever
    • False positive: Invalid cross-language names; e.g., cαfe.fr, which uses a Greek letter in a French domain name - a rule that is impossible to enforce via a regex alone.
    • False negatives: emoji-based names such as .ws(xn--jr8h.ws)
    • False negative: ???????is a valid TLD in IANA root today, but will not match [[:alpha:]]{2,}$
    • ... and many more
  • Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]]or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.

  • I'm unclear on whether names in right-to-left writing scripts are properly matched.

  • For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCIIencoding of the name (via Punycode), not the original.

  • 没有尝试识别IDN 的Punycode编码版本,它使用基于 ASCII 的编码和 prefixxn--,并且需要随后解码。

  • 正如Patrick Mevzek指出的那样,上面的内容可以产生假阴性和假阳性(使用他的例子):

    • 误报:无效的Punycode 编码名称,例如ab--whatever
    • 误报:无效的跨语言名称;例如,cαfe.fr在法语域名中使用希腊字母 - 仅通过正则表达式无法强制执行的规则。
    • 误报:基于表情符号的名称,例如.ws( xn--jr8h.ws)
    • 假阴性:???????是今天 IANA 根中的有效 TLD,但不会匹配[[:alpha:]]{2,}$
    • ... 还有很多
  • [[:alpha:]]或匹配时,并非所有类 Unix 平台都完全支持所有 Unicode 字母[[:alnum:]]。例如,使用基于 UTF-8 的语言环境,OS X 10.9.1 显然只匹配拉丁变音符号(例如ü, á)和西里尔字符(除了 ASCII),而值得称赞的是 Linux 3.2 似乎涵盖了所有脚本,包括亚洲和阿拉伯语那些。

  • 我不清楚从右到左书写脚本中的名称是否正确匹配。

  • 为完整起见:即使上面的正则表达式没有尝试强制执行长度限制,尝试使用 IDN 执行此操作会复杂得多,因为长度限制适用于名称的ASCII编码(通过Punycode),而不是原来的。

Tip of the hat to @Alfe and for pointing out the problem with IDNs, and to @Arka for offering a simplified version of the regex to replace the lengthier one I had initially crafted under the mistaken assumption that single-letter domain names must be ruled out.

感谢@Alfe 并指出 IDN 的问题,感谢@Arka 提供了一个简化版本的正则表达式来替换我最初在错误假设必须排除单字母域名的情况下制作的更长的正则表达式出去。

回答by Alfe

Use

grep '@[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*$'

回答by Arka

echo "@dom.ext" | grep -E "^@[a-zA-Z0-9]+([-.]?[a-zA-Z0-9]+)*.[a-zA-Z]+$"

This did the job.

这完成了工作。