bash 从 Grep 正则表达式中捕获组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1891797/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Capturing Groups From a Grep RegEx
提问by Isaac
I've got this little script in sh
(Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point:
我在sh
(Mac OSX 10.6)中有这个小脚本来查看一系列文件。谷歌在这一点上已经停止提供帮助:
files="*.jpg"
for f in $files
do
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
echo $name
done
So far (obviously, to you shell gurus) $name
merely holds 0, 1 or 2, depending on if grep
found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+)
and store that to a variable.
到目前为止(显然,对您来说,shell 专家)$name
仅包含 0、1 或 2,这取决于是否grep
发现文件名与所提供的内容相匹配。我想要的是捕获括号内的内容([a-z]+)
并将其存储到变量中。
I'd like to use grep
only, if possible. If not, please no Python or Perl, etc. sed
or something like it –?I'm new to shell and would like to attack this from the *nix purist angle.
如果可能,我只想使用grep
。如果没有,请不要使用 Python 或 Perl 等sed
或类似的东西 -?我是 shell 的新手,想从 *nix 纯粹主义角度对此进行攻击。
Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I cat $name '.jpg'
?
另外,作为一个超酷的奖励,我很好奇如何在 shell 中连接字符串?我捕获的组是存储在 $name 中的字符串“somename”,我想在它的末尾添加字符串“.jpg”,可以cat $name '.jpg'
吗?
Please explain what's going on, if you've got the time.
如果你有时间,请解释发生了什么。
回答by Paused until further notice.
If you're using Bash, you don't even have to use grep
:
如果您使用 Bash,则甚至不必使用grep
:
files="*.jpg"
regex="[0-9]+_([a-z]+)_[0-9a-z]*"
for f in $files # unquoted in order to allow the glob to expand
do
if [[ $f =~ $regex ]]
then
name="${BASH_REMATCH[1]}"
echo "${name}.jpg" # concatenate strings
name="${name}.jpg" # same thing stored in a variable
else
echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files
fi
done
It's better to put the regex in a variable. Some patterns won't work if included literally.
最好将正则表达式放在变量中。如果从字面上包含某些模式将不起作用。
This uses =~
which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH
. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.
这使用 =~
which 是 Bash 的正则表达式匹配运算符。匹配结果保存在一个名为 的数组中$BASH_REMATCH
。第一个捕获组存储在索引 1 中,第二个(如果有)存储在索引 2 中,依此类推。索引 0 是完全匹配。
You should be aware that without anchors, this regex (and the one using grep
) will match any of the following examples and more, which may not be what you're looking for:
您应该知道,如果没有锚点,此正则表达式(以及使用 的正则表达式grep
)将匹配以下任何示例以及更多示例,这些示例可能不是您要查找的:
123_abc_d4e5
xyz123_abc_d4e5
123_abc_d4e5.xyz
xyz123_abc_d4e5.xyz
To eliminate the second and fourth examples, make your regex like this:
要消除第二个和第四个示例,请像这样制作正则表达式:
^[0-9]+_([a-z]+)_[0-9a-z]*
which says the string must startwith one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:
这表示字符串必须以一位或多位数字开头。克拉代表字符串的开始。如果在正则表达式的末尾添加美元符号,如下所示:
^[0-9]+_([a-z]+)_[0-9a-z]*$
then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.
那么第三个示例也将被删除,因为点不在正则表达式中的字符中,并且美元符号代表字符串的结尾。请注意,第四个示例也未能通过此匹配。
If you have GNU grep
(around 2.5 or later, I think, when the \K
operator was added):
如果你有 GNU grep
(大约 2.5 或更高版本,我认为,当\K
添加操作符时):
name=$(echo "$f" | grep -Po '(?i)[0-9]+_\K[a-z]+(?=_[0-9a-z]*)').jpg
The \K
operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is (?<=)
- the pattern would be included before the closing parenthesis. You must use \K
if quantifiers may match strings of different lengths (e.g. +
, *
, {2,4}
).
的\K
操作者(可变长度向后看)导致前述图案匹配,但不包括在结果中的匹配。等价的固定长度是(?<=)
- 模式将包含在右括号之前。您必须使用\K
if 量词可以匹配不同长度的字符串(例如+
, *
, {2,4}
)。
The (?=)
operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result.
该(?=)
运算符匹配固定或可变长度的模式,称为“前瞻”。它也不在结果中包含匹配的字符串。
In order to make the match case-insensitive, the (?i)
operator is used. It affects the patterns that follow it so its position is significant.
为了使匹配不区分大小写,使用了(?i)
运算符。它会影响跟随它的模式,因此它的位置很重要。
The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.
根据文件名中是否有其他字符,可能需要调整正则表达式。您会注意到,在本例中,我展示了一个在捕获子字符串的同时连接字符串的示例。
回答by RobM
This isn't really possible with pure grep
, at least not generally.
这对于 pure 来说是不可能的grep
,至少一般来说是不可能的。
But if your pattern is suitable, you may be able to use grep
multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut
and sed
are far better at this).
但是,如果您的模式合适,您可以grep
在管道中多次使用,首先将您的行减少到已知格式,然后仅提取您想要的位。(尽管像cut
和sed
这样的工具在这方面要好得多)。
Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_
You could extract this like so:
假设为了论证你的模式有点简单:[0-9]+_([a-z]+)_
你可以像这样提取它:
echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'
The first grep
would remove any lines that didn't match your overall patern, the second grep
(which has --only-matching
specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.
第一个grep
将删除与您的整体模式不匹配的任何行,第二个grep
(已--only-matching
指定)将显示名称的 alpha 部分。这只有效,因为该模式是合适的:“alpha 部分”足够具体,可以提取出您想要的内容。
(Aside: Personally I'd use grep
+ cut
to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2
. This gets cut
to parse the line into fields by splitting on the delimiter _
, and returns just field 2 (field numbers start at 1)).
(旁白:就我个人而言,我会使用grep
+cut
来实现您的目标:echo $name | grep {pattern} | cut -d _ -f 2
。这可以cut
通过在分隔符上拆分来将行解析为字段_
,并仅返回字段 2(字段编号从 1 开始))。
Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep
+ sed
etc is a more Unixy way of doing things :-)
Unix 哲学是拥有做一件事的工具,并且做得很好,并将它们结合起来以完成重要的任务,所以我认为grep
+sed
等是一种更 Unixy 的做事方式:-)
回答by John Sherwood
I realize that an answer was already accepted for this, but from a "strictly *nix purist angle" it seems like the right tool for the job is pcregrep
, which doesn't seem to have been mentioned yet. Try changing the lines:
我意识到已经为此接受了一个答案,但是从“严格 *nix 纯粹主义的角度”看来,该工作的正确工具是pcregrep
,但似乎尚未提及。尝试更改行:
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
to the following:
到以下几点:
name=$(echo $f | pcregrep -o1 -Ei '[0-9]+_([a-z]+)_[0-9a-z]*')
to get only the contents of the capturing group 1.
仅获取捕获组 1 的内容。
The pcregrep
tool utilizes all of the same syntax you've already used with grep
, but implements the functionality that you need.
该pcregrep
工具使用您已经使用过的所有相同语法grep
,但实现了您需要的功能。
The parameter -o
works just like the grep
version if it is bare, but it also accepts a numeric parameter in pcregrep
, which indicates which capturing group you want to show.
如果参数是裸版本,则该参数的-o
工作方式与grep
版本相同,但它也接受 中的数字参数pcregrep
,该参数指示您要显示哪个捕获组。
With this solution there is a bare minimum of change required in the script. You simply replace one modular utility with another and tweak the parameters.
使用此解决方案,脚本中所需的更改最少。您只需将一个模块化实用程序替换为另一个模块化实用程序并调整参数。
Interesting Note:You can use multiple -o arguments to return multiple capture groups in the order in which they appear on the line.
有趣的注意事项:您可以使用多个 -o 参数按照它们在行中出现的顺序返回多个捕获组。
回答by cobbal
Not possible in just grep I believe
我相信只有 grep 是不可能的
for sed:
对于 sed:
name=`echo $f | sed -E 's/([0-9]+_([a-z]+)_[0-9a-z]*)|.*//'`
I'll take a stab at the bonus though:
不过,我会尝试一下奖金:
echo "$name.jpg"
回答by opsb
This is a solution that uses gawk. It's something I find I need to use often so I created a function for it
这是一个使用 gawk 的解决方案。这是我发现我需要经常使用的东西,所以我为它创建了一个函数
function regex1 { gawk 'match($ echo 'hello world' | regex1 'hello\s(.*)'
world
,/''/, ary) {print ary['${2:-'1'}']}'; }
to use just do
使用只是做
f=001_abc_0za.jpg
work=${f%_*}
name=${work#*_}
回答by martin clayton
A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards, and similarly at the start:
给您的建议 - 您可以使用参数扩展从最后一个下划线开始删除名称的一部分,同样在开头:
shopt -s extglob
shopt -s nullglob
shopt -s nocaseglob
for file in +([0-9])_+([a-z])_+([a-z0-9]).jpg
do
IFS="_"
set -- $file
echo "This is your captured output : "
done
Then name
will have the value abc
.
那么name
就有了价值abc
。
See Apple developer docs, search forward for 'Parameter Expansion'.
请参阅 Apple开发人员文档,向前搜索“参数扩展”。
回答by ghostdog74
if you have bash, you can use extended globbing
如果你有 bash,你可以使用扩展的通配符
ls +([0-9])_+([a-z])_+([a-z0-9]).jpg | while read file
do
IFS="_"
set -- $file
echo "This is your captured output : "
done
or
或者
##代码##