Bash 正则表达式捕获组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46396910/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 16:28:30  来源:igfitidea点击:

Bash Regex Capture Groups

regexbashgreppcre

提问by mhaken

I have a single string that is this kind of format:

我有一个这种格式的字符串:

"Mike H<[email protected]>" [email protected] "Mike H<[email protected]>"

If I was writing a normal regex in JS, C#, etc, I'd do this

如果我在 JS、C# 等中编写一个普通的正则表达式,我会这样做

(?:"(.+?)"|'(.+?)'|(\S+))

And iterate the match groups to grab each string, ideally without the quotes. I ultimately want to add each value to an array, so in the example, I'd end up with 3 items in an array as follows:

并迭代匹配组以获取每个字符串,理想情况下没有引号。我最终想将每个值添加到一个数组中,因此在示例中,我最终会在数组中包含 3 个项目,如下所示:

Mike H<[email protected]>
[email protected] 
Mike H<[email protected]>

I can't figure out how to replicate this functionality with grepor sedor bash regex's. I've tried some things like

我不知道如何使用greporsed或 bash 正则表达式复制此功能。我尝试过一些类似的事情

echo "$email" | grep -oP "\"\K(.+?)(?=\")|'\K(.+?)(?=')|(\S+)"

The problem with this is that while it kind of mimics the functionality of capture groups, it doesn't really work with multiples, so I get captures like

问题在于,虽然它有点模仿捕获组的功能,但它并不真正适用于倍数,所以我得到了类似的捕获

"Mike
H<[email protected]>"
 [email protected] 

If I remove the look ahead/behind logic, I at least get the 3 strings, but the first and last are still wrapped in quotes. In that approach, I pipe the output to readso I can individually add each string to the array, but I'm open to other options.

如果我删除前瞻/后视逻辑,我至少会得到 3 个字符串,但第一个和最后一个仍然用引号括起来。在这种方法中,我将输出通过管道传输到,read以便我可以将每个字符串单独添加到数组中,但我对其他选项持开放态度。

EDIT:

编辑:

I think my input example may have been confusing, it's just a possible input. The real input could be double quoted, single quoted, or non-quoted (without spaces) strings in any order with any quantity. The Javascript/C# regex I provided is the real behavior I'm trying to achieve.

我认为我的输入示例可能令人困惑,这只是一个可能的输入。实际输入可以是任意数量的任意顺序的双引号、单引号或非引号(无空格)字符串。我提供的 Javascript/C# 正则表达式是我试图实现的真实行为。

采纳答案by mhaken

What I was able to do that worked, but wasn't as concise as I wanted the code to be:

我能够做的事情有效,但没有我想要的代码那么简洁:

arr=()
while read line; do
  line="${line//\"/}"
  arr+=("${line//\'/}")
done < <(echo $email | grep -oP "\"(.+?)\"|'(.+?)'|(\S+)")

This gave me an array of the capturing group and handled the input in any order, wrapped in double or single quotes or none at all if it didn't have a space. It also provided the elements in the array without the wrapping quotes. Appreciate all of the suggestions.

这给了我一个捕获组的数组,并以任何顺序处理输入,用双引号或单引号包裹,如果没有空格,则根本不包裹。它还提供了数组中没有环绕引号的元素。欣赏所有的建议。

回答by dawg

You can use Perl:

你可以使用 Perl:

$ email='"Mike H<[email protected]>" [email protected] "Mike H<[email protected]>"'
$ echo "$email" | perl -lane 'while (/"([^"]+)"|(\S+)/g) {print  ?  : }' 
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>

Or in pure Bash, it gets kinda wordy:

或者在纯 Bash 中,它有点罗嗦:

re='\"([^\"]+)\"[[:space:]]*|([^[:space:]]+)[[:space:]]*'
while [[ $email =~ $re ]]; do
    echo ${BASH_REMATCH[1]}${BASH_REMATCH[2]}
    i=${#BASH_REMATCH}
    email=${email:i}
done 
# same output

回答by JJoao

Your first expression is fine; just be careful with the quotes (use single quotes when \are present). In the end trim the "with sed.

你的第一个表情很好;请注意引号(当\存在时使用单引号)。最后"用 sed修剪。

$ echo $mail | grep -Po '".*?"|\S+' | sed -r 's/"$|^"//g'
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>

回答by RomanPerekhrest

gawk+ bashsolution (adding each item to array):

gawk+ bash解决方案(将每个项目添加到数组):

email_str='"Mike H<[email protected]>" [email protected] "Mike H<[email protected]>"'

readarray -t email_arr < <(awk -v FPAT="[^\"'[:space:]]+[^\"']+[^\"'[:space:]]+" \
                         '{ for(i=1;i<=NF;i++) print $i }' <<<$email_str)


Now, all items are in email_arr

现在,所有物品都在 email_arr

Accessing the 2nd item:

访问第二项:

echo "${email_arr[1]}"
[email protected]

Accessing the 3rd item:

访问第三项:

echo "${email_arr[3]}"
Mike H<[email protected]>

回答by James Brown

Using GNU awk and FPATto define fields by content:

使用GNU AWK,并FPAT按内容定义字段

$ awk '
BEGIN { FPAT="([^ ]*)|(\"[^\"]*\")" }  # define a field to be space-separated or in quotes
{
    for(i=1;i<=NF;i++) {               # iterate every field
        gsub(/^\"|\"$/,"",$i)          # remove leading and trailing quotes
        print $i                       # output
    }
}' file
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>

回答by CWLiu

You may use sedto achieve that,

你可以sed用来实现这一目标,

$ sed -r 's/"(.*)" (.*)"(.*)"/\n\n/g' <<< "$EMAIL"
Mike H<[email protected]>
[email protected] 
Mike H<[email protected]>

回答by P....

Using gawkwhere you can set multi-line RS.

使用gawk您可以设置多行的地方RS

awk -v RS='"|" ' 'NF' inputfile
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>

回答by Rahul Verma

Modify your regex like this :

像这样修改你的正则表达式:

grep -oP '("?\s*)\K.*?(?=")' file

Output:

输出:

Mike H<[email protected]>
[email protected]
Mike H<[email protected]>