bash 是否支持词边界正则表达式?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9792702/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Does bash support word boundary regular expressions?
提问by starfry
I am trying to match on the presence of a word in a list before adding that word again (to avoid duplicates). I am using bash 4.2.24 and am trying the below:
我试图在再次添加该词之前匹配列表中某个词的存在(以避免重复)。我正在使用 bash 4.2.24 并且正在尝试以下操作:
[[ $foo =~ \bmyword\b ]]
also
还
[[ $foo =~ \<myword\> ]]
However, neither seem to work. They are mentioned in the bash docs example: http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_01.html.
但是,两者似乎都不起作用。它们在 bash 文档示例中提到:http: //tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_01.html。
I presume I am doing something wrong but I am not sure what.
我认为我做错了什么,但我不确定是什么。
采纳答案by Eduardo Ivanec
Yes, all the listed regex extensions are supported but you'll have better luck putting the pattern in a variable before using it. Try this:
是的,所有列出的正则表达式扩展都受支持,但是在使用它之前将模式放入变量中会更好。尝试这个:
re=\bmyword\b
[[ $foo =~ $re ]]
Digging around I found this question, whose answers seems to explain why the behaviour changes when the regex is written inline as in your example.
挖掘周围我发现了这个问题,其答案似乎解释了为什么当正则表达式如您的示例中那样内联写入时行为会发生变化。
Editor's note: The linked question does notexplain the OP's problem; it merely explains how starting with Bash version 3.2 regexes (or at least the special regex chars.) must by default be unquotedto be treated as such - which is exactly what the OP attempted.
However, the workarounds in this answer areeffective.
编者按:链接的问题没有解释OP的问题;它只是解释了如何从 Bash 3.2 版正则表达式(或至少是特殊的正则表达式字符)开始,默认情况下必须不加引号才能被如此对待——这正是 OP 所尝试的。
但是,此答案中的解决方法是有效的。
You'll probably have to rewrite your tests so as to use a temporary variable for your regexes, or use the 3.1 compatibility mode:
您可能必须重写测试以便为正则表达式使用临时变量,或者使用 3.1 兼容模式:
shopt -s compat31
回答by mklement0
tl;dr
tl;博士
To be safe, do not use a regex literalwith
=~.
Instead, use:- either: an auxiliary variable- see @Eduardo Ivancec's answer.
- or: a command substitution that outputs a string literal- see @ruakh's comment on @Eduardo Ivancec's answer
- Note that both must be used unquotedas the
=~RHS.
Whether
\band\</\>are supported at alldepends on the host platform, not Bash:- they DO work on Linux,
- but NOT on BSD-based platforms such as macOS; there, use
[[:<:]]and[[:>:]]instead, which, in the context of an unquoted regex literal, must be escaped as[[:\<:]]and[[:\>:]]; the following works as expected, but only on BSD/macOS:[[ ' myword ' =~ [[:\<:]]myword[[:\>:]] ]] && echo YES # OK
The problem wouldn't arise- on any platform - if you limited your regex to the constructs in the POSIX ERE (extended regular expression) specification.
Unfortunately, POSIX EREs do notsupport word-boundary assertions, though you can emulatethem - see the last section.
As on macOS, no
\-prefixed constructs are supported, so that handy character-class shortcuts such as\sand\waren't available either.However, the up-side is that such ERE-compliant regexesare then portable(work on both Linux and macOS, for instance)
为安全起见,请勿将正则表达式文字与
=~.
相反,使用:- 要么:一个辅助变量- 请参阅@Eduardo Ivancec 的回答。
- 或:输出字符串文字的命令替换- 请参阅@ruakh 对@Eduardo Ivancec 的回答的评论
- 请注意,两者都必须不加引号用作
=~RHS。
\b和\</是否\>完全受支持取决于主机平台,而不是 Bash:- 他们确实在Linux 上工作,
- 但不适用于基于BSD的平台,例如macOS;在那里,使用
[[:<:]]and[[:>:]]代替,在未加引号的正则表达式文字的上下文中,它必须转义为[[:\<:]]and[[:\>:]];以下按预期工作,但仅适用于 BSD/macOS:[[ ' myword ' =~ [[:\<:]]myword[[:\>:]] ]] && echo YES # OK
如果您将正则表达式限制为POSIX ERE(扩展正则表达式)规范中的构造,则在任何平台上都不会出现该问题。
不幸的是,POSIX ERE不支持字边界断言,尽管您可以模拟它们 - 请参阅最后一节。
在 macOS 上,不
\支持 - 前缀结构,因此方便的字符类快捷方式(例如\s和 )\w也不可用。然而,好处是这种符合 ERE 的正则表达式是可移植的(例如,适用于 Linux 和 macOS)
=~is the rare case (the only case?) of a built-inBash feature whose behavior is platform-dependent: It uses the regex libraries of the platform it is running on, resulting in different regex flavors on different platforms.
=~其行为与平台相关的内置Bash 功能的罕见情况(唯一情况?)是:它使用其运行平台的正则表达式库,从而在不同平台上产生不同的正则表达式风格。
Thus, it is generally non-trivial and requires extra care to write portablecode that uses the =~operator.
Sticking with POSIX EREs is the only robust approach, which means that you have to work around their limitations - see bottom section.
因此,编写使用operator 的可移植代码通常=~很重要并且需要格外小心。
坚持使用 POSIX ERE 是唯一可靠的方法,这意味着您必须解决它们的局限性 - 请参阅底部部分。
If you want to know more, read on.
如果您想了解更多,请继续阅读。
On Bash v3.2+ (unless the compat31shoptoption is set), the RHS (right-hand side operand) of the =~operator must be unquotedin order to be recognized as a regex(if you quotethe right operand, =~performs regular string comparisoninstead).
巴蜀V3.2 +(除非该compat31shopt选项设置),在的RHS(右侧操作数)=~运营商必须加引号才能被识别为正则表达式(如果你引用正确的操作,=~执行普通字符串比较,而不是)。
More accurately, at least the special regex characters and sequences must be unquoted, so it's OK and useful to quote those substringsthat should be taken literally; e.g., [[ '*' =~ ^'*' ]]matches, because ^is unquotedand thus correctly recognized as the start-of-string anchor, whereas *, which is normally a special regex char, matches literallydue to the quoting.
更准确地说,至少特殊的正则表达式字符和序列必须不加引号,因此引用那些应该按字面意义的子字符串是可以且有用的;例如,matches, 因为没有加引号,因此被正确识别为字符串的开始锚点,而,通常是一个特殊的正则表达式字符,由于引用而按字面匹配。[[ '*' =~ ^'*' ]]^*
However, there appears to be a design limitationin (at least) bash 3.xthat prevents use of \-prefixed regex constructs (e.g., \<, \>, \b, \s, \w, ...) in a literal=~RHS; the limitation affects Linux, whereas BSD/macOS versions are notaffected, due to fundamentally not supporting any \-prefixed regex constructs:
然而,似乎是一个设计限制在(至少)bash 3.x该禁止使用的\-prefixed正则表达式构建体(例如,\<,\>,\b,\s,\w在A,...)文字=~RHS; 限制影响的Linux,BSD而/ MacOS的版本不会受到影响,因为根本不支持任何\-prefixed正则表达式结构:
# Linux only:
# PROBLEM (see details further below):
# Seen by the regex engine as: <word>
# The shell eats the '\' before the regex engine sees them.
[[ ' word ' =~ \<word\> ]] && echo MATCHES # !! DOES NOT MATCH
# Causes syntax error, because the shell considers the < unquoted.
# If you used \bword\b, the regex engine would see that as-is.
[[ ' word ' =~ \<word\> ]] && echo MATCHES # !! BREAKS
# Using the usual quoting rules doesn't work either:
# Seen by the regex engine as: \<word\> instead of \<word\>
[[ ' word ' =~ \\<word\\> ]] && echo MATCHES # !! DOES NOT MATCH
# WORKAROUNDS
# Aux. viarable.
re='\<word\>'; [[ ' word ' =~ $re ]] && echo MATCHES # OK
# Command substitution
[[ ' word ' =~ $(printf %s '\<word\>') ]] && echo MATCHES # OK
# Change option compat31, which then allows use of '...' as the RHS
# CAVEAT: Stays in effect until you reset it, may have other side effects.
# Using (...) around the command confines the effect to a subshell.
(shopt -s compat31; [[ ' word ' =~ '\<word\>' ]] && echo MATCHES) # OK
The problem:
问题:
Tip of the hat to Fólkvangrfor his input.
尖帽子的弗尔克范格为他输入。
A literalRHS of =~is by design parsed differentlythan unquoted tokens as arguments, in an attempt to allow the user to focus on escaping characters justfor the regex, without also having to worry about the usual shellescaping rules in unquoted tokens.
一个字面的RHS=~是由设计解析不同比未加引号的令牌作为参数,以试图使用户专注于转义字符,只为正则表达式,而不同时不必担心平时壳逸出在不带引号标记规则。
For instance,
例如,
[[ 'a[b' =~ a\[b ]] && echo MATCHES # OK
matches, because the \is _passed through to the regex engine (that is, the regex engine too sees literala\[b), whereas if you used the same unquoted token as a regular argument, the usual shell expansionsapplied to unquoted tokens would "eat" the \, because it is interpreted as a shellescape character:
匹配,因为\_passed 到正则表达式引擎(也就是说,正则表达式引擎也看到了literala\[b),而如果您使用相同的未加引号的标记作为常规参数,则应用于未加引号的标记的通常shell 扩展将“吃掉” \,因为它被解释为shell转义字符:
$ printf %s a\[b
a[b # '\' was removed by the shell.
However, in the context of =~this exceptional passing through of \is only applied before characters that are regexmetacharacters by themselves, as defined by the ERE (extended regular expressions) POSIX specification(in order to escape them for the regex, so that they're treated as literals:\ ^ $ [ { . ? * + ( ) |
Conversely, these regex metacharacters may exceptionally be used unquoted- and indeed mustbe left unquoted to have their special regexmeaning - even though most of them normally require \-escaping in unquoted tokens to prevent the shellfrom interpreting them.
Yet, a subsetof the shellmetacharacters dostill need escaping, for the shell's sake, so as not to break the syntax of the [[ ... ]]conditional:& ; < > space
Since these characters aren't also regexmetacharacters, there is no need to also support escaping them on the regex side, so that, for instance, the regex engine seeing \&in the RHS as just &works fine.
但是,在=~这种特殊传递的上下文中,of\仅应用于本身是正则表达式元字符的字符之前,如ERE(扩展正则表达式)POSIX 规范所定义(为了为正则表达式转义它们,以便它们是视为文字:\ ^ $ [ { . ? * + ( ) |
相反,这些正则表达式元字符可能会在异常情况下不加引号使用- 实际上必须不加引号以具有其特殊的正则表达式含义 - 即使它们中的大多数通常需要\- 在不加引号的标记中转义以防止外壳从解释它们。
然而,一个子集的的外壳元字符都仍然需要转义,在外壳的缘故,以免打破的语法[[ ... ]]条件:& ; < > space
由于这些字符是不是也正则表达式元字符,没有必要也支持他们逃跑在正则表达式方面,例如,正则表达式引擎\&在 RHS 中看到的&效果很好。
For any othercharacter preceded by \, the shell removesthe \before sending the string to the regex engine (as it does during normal shell expansion), which is unfortunate, because then even characters that the shell doesn'tconsider special cannot be passed as \<char>to the regex engine, because the shell invariably passes them as just <char>.
E.g, \bis invariably seen as just bby the regex engine.
对于以开头的任何其他字符\,shell会\在将字符串发送到正则表达式引擎之前删除该字符串(就像在正常的 shell 扩展期间一样),这是不幸的,因为即使是 shell不认为特殊的字符也不能传递\<char>给正则表达式引擎,因为外壳总是将它们作为<char>.
例如,\b总是被b正则表达式引擎视为。
It is therefore currently impossible to use a (by definition non-POSIX) regex construct in the form \<char>(e.g., \<, \>, \b, \s, \w, \d, ...) in a literal, unquoted =~RHS, because no form of escaping can ensure that these constructs are seen by the regexengine as such, after parsing by the shell:
因此,目前不可能在字面的、不带引号的RHS 中使用(根据定义为非 POSIX)正则表达式构造\<char>(例如,\<, \>, \b, \s, \w, \d, ...)=~,因为没有任何形式的转义可以确保这些构造在被shell解析后,正则表达式引擎会看到:
Since neither <, >, nor bare regexmetacharacters, the shell removesthe \from \<, \>, \b(as happens in regular shell expansion). Therefore, passing \<word\>, for instance, makes the regex engine see <word>, which is not the intent:
由于<, >, 也不b是正则表达式元字符,shell删除了\from \<, \>, \b(就像在常规 shell 扩展中发生的那样)。因此,\<word\>例如,传递会使正则表达式引擎看到<word>,这不是本意:
[[ '<word>' =~ \<word\> ]] && echo YESmatches, because the regex engine sees<word>.[[ 'boo' =~ ^\boo ]] && echo YESmatches, because the regex engine sees^boo.
[[ '<word>' =~ \<word\> ]] && echo YES匹配,因为正则表达式引擎看到<word>.[[ 'boo' =~ ^\boo ]] && echo YES匹配,因为正则表达式引擎看到^boo.
Trying \\<word\\>breaksthe command, because the shelltreats each \\as an escaped \, which means that metacharacter <is then considered unquoted, causing a syntax error:
尝试\\<word\\>中断命令,因为shell将每个\\视为转义的\,这意味着元字符<然后被认为是未引用的,导致语法错误:
[[ ' word ' =~ \\<word\\> ]] && echo YEScauses a syntax error.- This wouldn't happen with
\\b, but\\bis passed through(due to the\preceding a regex metachar,\), which also doesn't work:[[ '\boo' =~ ^\\boo ]] && echo YESmatches, because the regex engine sees\\boo, which matches literal\boo.
[[ ' word ' =~ \\<word\\> ]] && echo YES导致语法错误。- 这不会有发生
\\b,但\\b在穿过(由于\正则表达式元字符,前面\),这也不起作用:[[ '\boo' =~ ^\\boo ]] && echo YES匹配,因为正则表达式引擎看到\\boo,它匹配文字\boo。
Trying \\\<word\\\>- which by normalshell expansion rules results in \<word\>(try printf %s \\\<word\\\>) - alsodoesn't work:
尝试\\\<word\\\>- 根据正常的外壳扩展规则导致\<word\>(try printf %s \\\<word\\\>) -也不起作用:
What happens is that the shell eatsthe
\in\<(ditto for\band other\-prefixed sequences), and then passes the preceding\\through to the regex engine as-is(again, because\is preserved before a regexmetachar):[[ ' \<word\> ' =~ \\\<word\\\> ]] && echo YESmatches, because the regex engine sees\\<word\\>, which matches literal\<word\>.
所发生的是,该壳吃的
\在\<(同上,用于\b和其他\-prefixed序列),然后经过前述\\通过对正则表达式引擎原样(再次,因为\一个之前被保留的正则表达式元字符):[[ ' \<word\> ' =~ \\\<word\\\> ]] && echo YES匹配,因为正则表达式引擎看到\\<word\\>,它匹配文字\<word\>。
In short:
简而言之:
Bash's parsing of
=~RHS literalswas designed with single-characterregex metacharacters in mind, and does not support multi-characterconstructs that start with\, such as\<.Because POSIX EREs support no such constructs,
=~works as designed if you limit yourself to such regexes.However, even within this constraint the design is somewhat awkward, due to the need to mix regex-related and shell-related
\-escaping (quoting).Fólkvangr found the official design rationale in the Bash FAQ here, which, however, neither addresses said awkwardness nor the lack of support for (invariably non-POSIX)
\<char>regex constructs; it does mention using an aux. variable as a workaround, however, although only with respect to making it easier to represent whitespace.
All these parsing problems go away if the string that the regex engine should see is provided via a variableor via the output from a command substitution, as demonstrated above.
Bash 对
=~RHS文字的解析在设计时考虑了单字符正则表达式元字符,并且不支持以 开头的多字符结构\,例如\<.由于 POSIX ERE 不支持此类构造,因此
=~如果您将自己限制为此类正则表达式,则可以按设计工作。然而,即使在这个约束范围内,设计也有些笨拙,因为需要混合与正则表达式相关和与外壳相关的
\转义(引用)。Fólkvangr 在此处的 Bash 常见问题解答中找到了官方设计原理,然而,它既没有解决上述尴尬问题,也没有解决对(总是非 POSIX)
\<char>正则表达式结构的支持;它确实提到使用辅助。然而,变量作为一种解决方法,尽管只是为了更容易表示whitespace。
如果正则表达式引擎应该看到的字符串是通过变量或命令替换的输出提供的,所有这些解析问题都会消失,如上所示。
Optional reading: A portable emulationof word-boundary assertions with POSIX-compliant EREs (extended regular expressions):
可选阅读:使用符合 POSIX 标准的 ERE(扩展正则表达式)对字边界断言进行便携式仿真:
(^|[^[:alnum:]_])instead of\</[[:<:]]([^[:alnum:]_]|$)instead of\>/[[:>:]]
(^|[^[:alnum:]_])而不是\</[[:<:]]([^[:alnum:]_]|$)而不是\>/[[:>:]]
Note: \bcan't be emulated with a SINGLE expression - use the above in the appropriate places.
注意:\b不能用 SINGLE 表达式模拟 - 在适当的地方使用上面的。
The potential caveat is that the above expressions will also capturethe non-word character being matched, whereas true assertionssuch as \</ [[:<:]]and do not.
潜在的警告是上述表达式也将捕获被匹配的非单词字符,而真正的断言如\</[[:<:]]和则不会。
$foo = 'myword'
[[ $foo =~ (^|[^[:alnum:]_])myword([^[:alnum:]_]|$) ]] && echo YES
The above matches, as expected.
正如预期的那样,上述匹配。
回答by Weidenrinde
Not exactly "\b", but for me more readable (and portable) than the other suggestions:
不完全是“\b”,但对我来说比其他建议更具可读性(和便携性):
[[ $foo =~ (^| )myword($| ) ]]
回答by Colin Fraizer
The accepted answer focuses on using auxiliary variables to deal with the syntax oddities of regular expressions in Bash's [[ ... ]]expressions. Very good info.
接受的答案侧重于使用辅助变量来处理 Bash[[ ... ]]表达式中正则表达式的语法奇怪问题。非常好的信息。
However, the real answer is:
然而,真正的答案是:
\b\<and \>do not work on OS X 10.11.5 (El Capitan) with bash version 4.3.42(1)-release (x86_64-apple-darwin15.0.0).
\b\<并且\>不能在 bash 版本 4.3.42(1)-release (x86_64-apple-darwin15.0.0) 的 OS X 10.11.5 (El Capitan) 上工作。
Instead, use [[:<:]]and [[:>:]].
相反,使用[[:<:]]和[[:>:]]。
回答by Cole Tierney
I've used the following to match word boundaries on older systems. The key is to wrap $foowith spaces since [^[:alpha:]]will not match words at the beginning or end of the list.
我使用以下内容来匹配旧系统上的单词边界。关键是$foo用空格换行,因为[^[:alpha:]]不会匹配列表开头或结尾的单词。
[[ " $foo " =~ [^[:alpha:]]myword[^[:alpha:]] ]]
Tweak the character class as needed based on the expected contents of myword, otherwise this may not be good solution.
根据 的预期内容根据需要调整字符类myword,否则这可能不是一个好的解决方案。
回答by Ben Flynn
Tangential to your question, but if you can use grep -E(or egrep, its effective, but obsolescent alias) in your script:
与您的问题相切,但如果您可以在脚本中使用grep -E(或egrep,其有效但过时的别名):
if grep -q -E "\b${myword}\b" <<<"$foo"; then
I ended up using this after flailing with bash's =~.
在与 bash 的=~.
Note that while regex constructs \<, \>, and \bare not POSIX-compliant, both the BSD (macOS) and GNU (Linux) implementations of grep -Esupport them, which makes this approach widely usable in practice.
请注意,虽然正则表达式构造\<、\>和\b不符合 POSIX,但 BSD (macOS) 和 GNU (Linux) 实现都grep -E支持它们,这使得这种方法在实践中广泛使用。
Small caveat (not an issue in the case at hand): By not using =~, you lose the ability to inspect capturing subexpressions (capture groups) via ${BASH_REMATCH[@]}later.
小警告(在手头的情况下不是问题):如果不使用=~,您将失去${BASH_REMATCH[@]}稍后检查捕获子表达式(捕获组)的能力。
回答by Steven Penny
This worked for me
这对我有用
bar='\<myword\>'
[[ $foo =~ $bar ]]
回答by Goetz Pfeiffer
You can use grep, which is more portable than bash's regexp like this:
您可以使用 grep,它比 bash 的 regexp 更便携,如下所示:
if echo $foo | grep -q '\<myword\>'; then
echo "MATCH";
else
echo "NO MATCH";
fi

