BASH 正则表达式匹配 - 在括号中的字符列表中包括括号以匹配?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10181836/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 02:02:54  来源:igfitidea点击:

BASH regexp matching - including brackets in a bracketed list of characters to match against?

regexbash

提问by DanielSmedegaardBuus

I'm trying to do a tiny bash script that'll clean up the file and folder names of downloaded episodes of some tv shows I like. They often look like "[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE", and I basically just want to strip out that speedcd advertising bit.

我正在尝试做一个小的 bash 脚本,它将清理我喜欢的一些电视节目的下载剧集的文件和文件夹名称。它们通常看起来像“[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE”,我基本上只是想去掉那个speedcd广告位。

It's easy enough to remove www.Speed.Cd, spaces, and dashes using regexp matching in BASH, but for the life of me, I cannot figure out how to include the brackets in a list of characters to be matched against. [- [] doesn't work, neither does [- \[], [- \\[], [- \\\[], or any number of escape characters preceding the bracket I want to remove.

在 BASH 中使用正则表达式匹配删除 www.Speed.Cd、空格和破折号很容易,但对于我的生活,我无法弄清楚如何在要匹配的字符列表中包含括号。[- [] 不起作用,[- \[]、[- \\[]、[- \\\[] 或我想删除的括号前的任意数量的转义字符也不起作用。

Here's what I've got so far:

这是我到目前为止所得到的:

[[ "$newfile" =~ ^(.*)([- \[]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[- \]]*)(.*)$ ]] &&
    newfile="${BASH_REMATCH[1]}${BASH_REMATCH[4]}"

But it breaks on the brackets.

但它打破了括号。

Any ideas?

有任何想法吗?

TIA, Daniel :)

TIA,丹尼尔:)

EDIT: I should probably note that I'm using "shopt -s nocasematch" to ensure case insensitive matching, just in case you're wondering :)

编辑:我可能应该注意到我正在使用“shopt -s nocasematch”来确保不区分大小写的匹配,以防万一你想知道:)

EDIT 2: Thanks to all who contributed. I'm not 100% sure which answer was to be the "correct" one, as I had several problems with my statement. Actually, the most accurate answer was just a comment to my question posted by jw013, but I didn't get it at the time because I hadn't understood yet that spaces should be escaped. I've opted for aefxx's as that one basically says the same, but with explanations :) Would've liked to put a correct answer mark on ormaaj's answer, too, as he spotted more grave issues with my expression.

编辑 2:感谢所有贡献者。我不是 100% 确定哪个答案是“正确”的,因为我的陈述有几个问题。其实,最准确的答案只是jw013对我的问题发表的评论,但当时我没有得到它,因为我还没有理解应该转义空格。我选择了 aefxx,因为那个人基本上说的是相同的,但有解释:) 也希望在 ormaaj 的答案上打上正确的答案标记,因为他发现我的表情有更严重的问题。

Anyway, the approach I was using above, trying to match and extract the parts to keep and leave behind the unwanted ones is really not very elegant, and won't catch all cases, not even something really simple like "Some.Show.S07E14.720p.HDTV.X264-SOMEONE - [ www.Speed.Cd ]". I've instead rewritten it to match and extract just the unwanted parts and then do string replacement of those on the original string, like so (loop is in case there's multiple brandings):

无论如何,我在上面使用的方法,尝试匹配和提取部分以保留和留下不需要的部分真的不是很优雅,并且不会捕获所有情况,甚至不是像“Some.Show.S07E14”这样的非常简单的方法.720p.HDTV.X264-SOMEONE - [ www.Speed.Cd ]”。我改为重写它以匹配和提取不需要的部分,然后对原始字符串上的那些部分进行字符串替换,就像这样(循环以防有多个品牌):

# Remove common torrent site brandings, including surrounding spaces, brackets, etc.:
while [[ "$newfile" =~ ([[\ {\(-]*(www\.)?(torrentday\.com|torrenting\.com|spastikustv|speed\.cd|moviesp2p\.com|publichd\.org|publichd|scenetime\.com|kingdom-release)[]\ }\)-]*) ]]; do
    newfile=${newfile//"${BASH_REMATCH[1]}"/}
done

回答by aefxx

Ok, this is the first time I've heard of the =~operator but nevertheless here's what I found by trial and error:

好的,这是我第一次听说=~运营商,但这是我通过反复试验发现的:

if [[ $newfile =~ ^(.*)([-[:space:][]*(what|ever)[][:space:]-]*)(.*)$ ]] 
                          ^^^^^^^^^^              ^^^^^^^^^^

Looks strange but actually does work (just tested it).

看起来很奇怪,但实际上确实有效(刚刚测试过)。

EDIT
Quote from the Linux man pages regex(7):

编辑
引用来自 Linux 手册页 regex(7):

To include a literal ] in the list, make it the first character (following a possible ^). To include a literal -, make it the first or last character, or the second endpoint of a range. To use a literal aq-aq as the first endpoint of a range, enclose it in "[." and ".]" to make it a collating element (see below). With the exception of these and some combinations using aq[aq (see next paragraphs), all other special characters, including aq\aq, lose their special significance within a bracket expression.

要在列表中包含文字 ],请将其设为第一个字符(在可能的 ^ 之后)。要包含文字 -,请将其作为第一个或最后一个字符,或范围的第二个端点。要将文字 aq-aq 用作范围的第一个端点,请将其括在“[”中。和“.]”使其成为整理元素(见下文)。除了这些和一些使用 aq[aq(见下一段)的组合之外,所有其他特殊字符,包括 aq\aq,在括号表达式中都失去了它们的特殊意义。

回答by ormaaj

Whenever you're doing a regex it's most compatible between Bash versions to put regexes in a variable even if you do manage to dodge all the pitfalls of putting them directly in a test expression. http://mywiki.wooledge.org/BashPitfalls#if_.5B.5B_.24foo_.3D.2BAH4_.27some_RE.27_.5D.5D

每当您执行正则表达式时,即使您确实设法避开了将它们直接放在测试表达式中的所有陷阱,Bash 版本之间最兼容将正则表达式放入变量中。http://mywiki.wooledge.org/BashPitfalls#if_.5B.5B_.24foo_.3D.2BAH4_.27some_RE.27_.5D.5D

Your current regex looks like you're trying to optionally match anything preceding the opening bracket. I'd guess you're actually trying to save for example 3 and 4 from something like this:

您当前的正则表达式看起来像是您正在尝试选择匹配左括号之前的任何内容。我猜你实际上是想从这样的事情中保存例如 3 和 4:

$ shopt -s nocasematch
$ newfile='[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE'
$ re='^.*[-[:space:][]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[][:space:]-]*(.*)$'
$ [[ $newfile =~ $re ]]
$ declare -p BASH_REMATCH
declare -ar BASH_REMATCH='([0]="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE" [1]="www.Speed.Cd" [2]="Some.Show.S07E14.720p.HDTV.X264-SOMEONE")'

回答by Peter.O

The basic issue is quite simple, if not obvious.
A BASH REGEX is totallyunprotected (from the shell), and cannot be protected by "?double quotes?". This means that everyliteral space (and tab,etc) mustbe protected by a baskslash \... end of story. The rest is just a case of getting you regex to suit your needs.

基本问题很简单,如果不是很明显的话。
BASH REGEX 是完全不受保护的(不受外壳程序的影响),并且不能被?双引号?”保护。这意味着每个文字空间(和制表符等)都必须受到 bassklash \... 的保护。其余的只是让您使用正则表达式以满足您的需求。

One other thing; use [\ []and []\ ]to match [and ]respectively, within the range square-bracket construct (in this case along with a space).

另一件事;使用[\ []and分别[]\ ]匹配 [],在范围方括号构造内(在这种情况下与空格一起)。

example:

例子:

newfile="[ ]"
[[ "$newfile" =~ ^[\ []\ []\ ]$ ]] &&
    echo YES ||
    echo NO

回答by jdi

You can try something like this (though you weren't 100% clear on what cases you are trying to filter:

您可以尝试这样的操作(尽管您不是 100% 清楚要过滤的情况:

newfile="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE"

if [[ $newfile =~ ^(.*)([^a-zA-Z0-9.]*\[.*\][^a-zA-Z0-9.]*)(.*)$ ]]; then 
    newfile="${BASH_REMATCH[1]}${BASH_REMATCH[3]}"
fi

echo $newfile
# Some.Show.S07E14.720p.HDTV.X264-SOMEONE

Its just stripping any non-alnum (and dot) characters outside the [], and anything within []

它只是剥离了 之外的任何非alnum(和点)字符[],以及里面的任何东西[]