Bash 正则表达式捕获组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46396910/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash Regex Capture Groups
提问by mhaken
I have a single string that is this kind of format:
我有一个这种格式的字符串:
"Mike H<[email protected]>" [email protected] "Mike H<[email protected]>"
If I was writing a normal regex in JS, C#, etc, I'd do this
如果我在 JS、C# 等中编写一个普通的正则表达式,我会这样做
(?:"(.+?)"|'(.+?)'|(\S+))
And iterate the match groups to grab each string, ideally without the quotes. I ultimately want to add each value to an array, so in the example, I'd end up with 3 items in an array as follows:
并迭代匹配组以获取每个字符串,理想情况下没有引号。我最终想将每个值添加到一个数组中,因此在示例中,我最终会在数组中包含 3 个项目,如下所示:
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>
I can't figure out how to replicate this functionality with grep
or sed
or bash regex's. I've tried some things like
我不知道如何使用grep
orsed
或 bash 正则表达式复制此功能。我尝试过一些类似的事情
echo "$email" | grep -oP "\"\K(.+?)(?=\")|'\K(.+?)(?=')|(\S+)"
The problem with this is that while it kind of mimics the functionality of capture groups, it doesn't really work with multiples, so I get captures like
问题在于,虽然它有点模仿捕获组的功能,但它并不真正适用于倍数,所以我得到了类似的捕获
"Mike
H<[email protected]>"
[email protected]
If I remove the look ahead/behind logic, I at least get the 3 strings, but the first and last are still wrapped in quotes. In that approach, I pipe the output to read
so I can individually add each string to the array, but I'm open to other options.
如果我删除前瞻/后视逻辑,我至少会得到 3 个字符串,但第一个和最后一个仍然用引号括起来。在这种方法中,我将输出通过管道传输到,read
以便我可以将每个字符串单独添加到数组中,但我对其他选项持开放态度。
EDIT:
编辑:
I think my input example may have been confusing, it's just a possible input. The real input could be double quoted, single quoted, or non-quoted (without spaces) strings in any order with any quantity. The Javascript/C# regex I provided is the real behavior I'm trying to achieve.
我认为我的输入示例可能令人困惑,这只是一个可能的输入。实际输入可以是任意数量的任意顺序的双引号、单引号或非引号(无空格)字符串。我提供的 Javascript/C# 正则表达式是我试图实现的真实行为。
采纳答案by mhaken
What I was able to do that worked, but wasn't as concise as I wanted the code to be:
我能够做的事情有效,但没有我想要的代码那么简洁:
arr=()
while read line; do
line="${line//\"/}"
arr+=("${line//\'/}")
done < <(echo $email | grep -oP "\"(.+?)\"|'(.+?)'|(\S+)")
This gave me an array of the capturing group and handled the input in any order, wrapped in double or single quotes or none at all if it didn't have a space. It also provided the elements in the array without the wrapping quotes. Appreciate all of the suggestions.
这给了我一个捕获组的数组,并以任何顺序处理输入,用双引号或单引号包裹,如果没有空格,则根本不包裹。它还提供了数组中没有环绕引号的元素。欣赏所有的建议。
回答by dawg
You can use Perl:
你可以使用 Perl:
$ email='"Mike H<[email protected]>" [email protected] "Mike H<[email protected]>"'
$ echo "$email" | perl -lane 'while (/"([^"]+)"|(\S+)/g) {print ? : }'
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>
Or in pure Bash, it gets kinda wordy:
或者在纯 Bash 中,它有点罗嗦:
re='\"([^\"]+)\"[[:space:]]*|([^[:space:]]+)[[:space:]]*'
while [[ $email =~ $re ]]; do
echo ${BASH_REMATCH[1]}${BASH_REMATCH[2]}
i=${#BASH_REMATCH}
email=${email:i}
done
# same output
回答by JJoao
Your first expression is fine; just be careful with the quotes (use single quotes when \
are present). In the end trim the "
with sed.
你的第一个表情很好;请注意引号(当\
存在时使用单引号)。最后"
用 sed修剪。
$ echo $mail | grep -Po '".*?"|\S+' | sed -r 's/"$|^"//g'
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>
回答by RomanPerekhrest
gawk+ bashsolution (adding each item to array):
gawk+ bash解决方案(将每个项目添加到数组):
email_str='"Mike H<[email protected]>" [email protected] "Mike H<[email protected]>"'
readarray -t email_arr < <(awk -v FPAT="[^\"'[:space:]]+[^\"']+[^\"'[:space:]]+" \
'{ for(i=1;i<=NF;i++) print $i }' <<<$email_str)
Now, all items are in email_arr
现在,所有物品都在 email_arr
Accessing the 2nd item:
访问第二项:
echo "${email_arr[1]}"
[email protected]
Accessing the 3rd item:
访问第三项:
echo "${email_arr[3]}"
Mike H<[email protected]>
回答by James Brown
Using GNU awk and FPAT
to define fields by content:
使用GNU AWK,并FPAT
以按内容定义字段:
$ awk '
BEGIN { FPAT="([^ ]*)|(\"[^\"]*\")" } # define a field to be space-separated or in quotes
{
for(i=1;i<=NF;i++) { # iterate every field
gsub(/^\"|\"$/,"",$i) # remove leading and trailing quotes
print $i # output
}
}' file
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>
回答by CWLiu
You may use sed
to achieve that,
你可以sed
用来实现这一目标,
$ sed -r 's/"(.*)" (.*)"(.*)"/\n\n/g' <<< "$EMAIL"
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>
回答by P....
Using gawk
where you can set multi-line RS
.
使用gawk
您可以设置多行的地方RS
。
awk -v RS='"|" ' 'NF' inputfile
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>
回答by Rahul Verma
Modify your regex like this :
像这样修改你的正则表达式:
grep -oP '("?\s*)\K.*?(?=")' file
Output:
输出:
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>