Bash:用引号、逗号和换行符解析 CSV
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36287982/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash: Parse CSV with quotes, commas and newlines
提问by Jacob Horbulyk
Say I have the following csv file:
假设我有以下 csv 文件:
id,message,time
123,"Sorry, This message
has commas and newlines",2016-03-28T20:26:39
456,"It makes the problem non-trivial",2016-03-28T20:26:41
I want to write a bash command that will return only the time column. i.e.
我想编写一个只返回时间列的 bash 命令。IE
time
2016-03-28T20:26:39
2016-03-28T20:26:41
What is the most straight forward way to do this? You can assume the availability of standard unix utils such as awk, gawk, cut, grep, etc.
什么是最直接的方法来做到这一点?您可以假设标准 unix 实用程序(例如 awk、gawk、cut、grep 等)的可用性。
Note the presence of "" which escape , and newline characters which make trivial attempts with
请注意转义的 "" 和进行微不足道的尝试的换行符的存在
cut -d , -f 3 file.csv
futile.
徒劳的。
回答by hek2mgl
As chepner said, you are encouraged to use a programming language which is able to parse csv.
正如chepner 所说,我们鼓励您使用能够解析csv的编程语言。
Here comes an example in python:
这是python中的一个例子:
import csv
with open('a.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, quotechar='"')
for row in reader:
print(row[-1]) # row[-1] gives the last column
回答by SriniV
As said here
正如这里所说
gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", time
2016-03-28T20:26:39
2016-03-28T20:26:41
, RT) }' file
, RT) }' file.csv \
| awk -F, '{print $NF}'
To handle specifically those newlines that are in doubly-quoted strings and leave those alone that are outside them, using GNU awk
(for RT
):
要专门处理那些双引号字符串中的换行符,并保留它们之外的那些换行符,请使用GNU awk
(for RT
):
$ awk -F'"' '!(NF%2){getline remainder;f:13.3 "System peripheral" "Intel Corporation" "Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder" -r01 "Super Micro Computer Inc" "Device 0838"
=# echo 'f:13.3 "System peripheral" "Intel Corporation" "Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder" -r01 "Super Micro Computer Inc" "Device 0838"' | { eval array=($(cat)); declare -p array; }
declare -a array='([0]="f:13.3" [1]="System peripheral" [2]="Intel Corporation" [3]="Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder" [4]="-r01" [5]="Super Micro Computer Inc" [6]="Device 0838")'
#
OFS remainder}
NR>1{sub(/,/,"",$NF); print $NF}' file
2016-03-28T20:26:39
2016-03-28T20:26:41
This works by splitting the file along "
characters and removing newlines in every other block.
这是通过沿"
字符拆分文件并在每个其他块中删除换行符来实现的。
Output
输出
sed -e 's/,/\n/g' file.csv | egrep ^201[0-9]-
Then use awk to split the columns and display the last column
然后使用 awk 拆分列并显示最后一列
回答by Aaron Digulla
CSV is a format which needs a proper parser (i.e. which can't be parsed with regular expressions alone). If you have Pythoninstalled, use the csv
moduleinstead of plain BASH.
CSV 是一种需要适当解析器的格式(即不能单独使用正则表达式解析)。如果您安装了Python,请使用该csv
模块而不是普通的 BASH。
If not, consider csvkitwhich has a lot of powerful tools to process CSV files from the command line.
如果没有,请考虑csvkit,它有很多强大的工具可以从命令行处理 CSV 文件。
See also:
也可以看看:
回答by karakfa
another awk
alternative using FS
awk
使用 FS 的另一种选择
awk -F, '!/This/{print $NF}' file
time
2016-03-28T20:26:39
2016-03-28T20:26:41
回答by Brian Chrisman
I ran into something similar when attempting to deal with lspci -m output, but the embedded newlines would need to be escaped first (though IFS=, should work here, since it abuses bash' quote evaluation). Here's an example
我在尝试处理 lspci -m 输出时遇到了类似的问题,但需要首先转义嵌入的换行符(尽管 IFS=,应该在这里工作,因为它滥用了 bash 的报价评估)。这是一个例子
##代码##And the only reasonable way I can find to bring that into bash is along the lines of:
我能找到的将其带入 bash 的唯一合理方法是:
##代码##Not a full answer, but might help!
不是完整的答案,但可能会有所帮助!