bash 如何让 awk 忽略双引号内的字段分隔符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29642102/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 21:55:53  来源:igfitidea点击:

How to make awk ignore the field delimiter inside double quotes?

bashshellawk

提问by Deepak K M

I need to delete 2 columns in a comma seperated values file. Consider the following line in the csv file:

我需要删除逗号分隔值文件中的 2 列。考虑 csv 文件中的以下行:

"[email protected],www.example.com",field2,field3,field4
"[email protected]",field2,field3,field4

Now, the result I want at the end:

现在,我最终想要的结果:

"[email protected],www.example.com",field4
"[email protected]",field4

I used the following command:

我使用了以下命令:

awk 'BEGIN{FS=OFS=","}{print ,}'

But the embedded comma which is inside quotes is creating a problem, Following is the result I am getting:

但是引号内的嵌入式逗号造成了问题,以下是我得到的结果:

"[email protected],field3
"[email protected]",field4

Now my question is how do I make awk ignore the "," which are inside the double quotes?

现在我的问题是如何让 awk 忽略双引号内的“,”?

回答by Ed Morton

From the GNU awk manual (http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content):

来自 GNU awk 手册(http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content):

$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print ,}' file
"[email protected],www.example.com",field4
"[email protected]",field4

and see What's the most robust way to efficiently parse CSV using awk?for more generally parsing CSVs that include newlines, etc. within fields.

并查看使用 awk 有效解析 CSV 的最可靠方法是什么?用于更一般地解析字段中包含换行符等的 CSV。

回答by 4ae1e1

This is not a bash/awk solution, but I recommend CSVKit, which can be installed by pip install csvkit. It provides a collection of command line tools to work specifically with CSV, including csvcut, which does exactly what you ask for:

这不是 bash/awk 解决方案,但我推荐CSVKit,它可以通过pip install csvkit. 它提供了一组专门用于 CSV 的命令行工具,包括csvcut,它完全符合您的要求:

csvcut --columns=1,4 <<EOF
"[email protected],www.example.com",field2,field3,field4
"[email protected]",field2,field3,field4
EOF

Output:

输出:

"[email protected],www.example.com",field4
[email protected],field4

It strips the unnecessary quotes, which I suppose shouldn't be a problem.

它去掉了不必要的引号,我认为这应该不是问题。

Read the docs of CSVKit here on RTD. ThoughtBot has a nice little blog postintroducing this tool, which is where I learnt about CSVKit.

在 RTD 上阅读 CSVKit 的文档。ThoughtBot 有一篇不错的小博文介绍了这个工具,这是我了解 CSVKit 的地方。

回答by John1024

In your sample input file, it is the first field and only the first field, that is quoted. If this is true in general, then consider the following as a method for deleting the second and third columns:

在您的示例输入文件中,它是引用的第一个字段并且仅是第一个字段。如果这通常是正确的,那么请考虑将以下内容作为删除第二列和第三列的方法:

$ awk -F, '{for (i=1;i<=NF;i++){printf "%s%s",(i>1)?",":"",$i; if ($i ~ /"$/)i=i+2};print""}' file
"[email protected],www.example.com",field4
"[email protected]",field4

As mentioned in the comments, awk does not natively understand quoted separators. This solution works around that by looking for the first field that ends with a quote. It then skips the two fields that follow.

正如评论中提到的,awk 本身并不理解带引号的分隔符。此解决方案通过查找以引号结尾的第一个字段来解决该问题。然后它跳过后面的两个字段。

The Details

细节

  • for (i=1;i<=NF;i++)

    This starts a forover each field i.

  • printf "%s%s",(i>1)?",":"",$i

    This prints field i. If it is not the first field, the field is preceded by a comma.

  • if ($i ~ /"$/)i=i+2

    If the current field ends with a double-quote, this then increments the field counter by 2. This is how we skip over fields 2 and 3.

  • print""

    After we are done with the forloop, this prints a newline.

  • for (i=1;i<=NF;i++)

    这从for每个字段开始i

  • printf "%s%s",(i>1)?",":"",$i

    这会打印 field i。如果它不是第一个字段,则该字段前面有一个逗号。

  • if ($i ~ /"$/)i=i+2

    如果当前字段以双引号结尾,则字段计数器增加 2。 这就是我们跳过字段 2 和 3 的方式。

  • print""

    在我们完成for循环后,这将打印一个换行符。

回答by John1024

This awk should work regardless of where the quoted field is and works on escaped quotes as well.

无论引用的字段在哪里,这个 awk 都应该工作,并且也适用于转义的引号。

awk '{while(match(
"[email protected],www.example.com",field2,field3,field4  
"[email protected]",field2,field3,field4  
field1,"[email protected],www.example.com",field3,field4  
,/"[^"]+",|([^,]+(,|$))/,a)){
"[email protected],www.example.com",field4
"[email protected]",field4
field1,field4
=substr(
field1,"field,2","but this field has ""escaped"\" quotes",field4
,RSTART+RLENGTH);b[++x]=a[0]} print b[1] b[4];x=0}' file


Input

输入

 while(match(
 
  print b[1] b[4];x=0}
=substr(
awk '{while(match(##代码##,/("[^"]+",|[^,]*,|([^,]+$))/,a)){
     ##代码##=substr(##代码##,RSTART+RLENGTH);b[++x]=a[0]}
     print b[1] b[4];x=0}' file
,RSTART+RLENGTH);b[++x]=a[0]
,/"[^"]+",|([^,]+(,|$))/,a))

Output

输出

##代码##

It even works on

它甚至适用于

##代码##

That the mighty FPAT variable fails on !

强大的 FPAT 变量失败了!



Explanation

解释

##代码##

Starts a while loop that continues as long as the match is a success(i.e there is a field).
The match matches the first occurence of the regex which incidentally matches the fields and store it in array a

开始一个 while 循环,只要匹配成功(即有一个字段)就会继续。
匹配匹配正则表达式的第一次出现,该正则表达式偶然匹配字段并将其存储在数组中a

##代码##

Sets $0to begin at the end of matched field and adds the matched field to the corresponding array position in b.

设置$0为从匹配字段的末尾开始,并将匹配的字段添加到 中的相应数组位置b

##代码##

Prints the fields you want from band sets x back to zero for the next line.

打印您想要的字段b并将下一行的 x 设置回零。



Flaws

缺陷

Will fail if field contains both escaped quotes and a comma

如果字段同时包含转义引号和逗号,则会失败



Edit

编辑

Updated to support empty fields

更新以支持空字段

##代码##