bash 使用 gawk 解析 CSV 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/314384/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 20:33:03  来源:igfitidea点击:

Parsing a CSV file using gawk

bashcsvawkgawk

提问by MCS

How do you parse a CSV file using gawk? Simply setting FS=","is not enough, as a quoted field with a comma inside will be treated as multiple fields.

如何使用 gawk 解析 CSV 文件?简单的设置FS=","是不够的,因为带有逗号的引用字段将被视为多个字段。

Example using FS=","which does not work:

使用FS=","which 不起作用的示例:

file contents:

文件内容:

one,two,"three, four",five
"six, seven",eight,"nine"

gawk script:

傻瓜脚本:

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
  printf "---------------------------\n"
}

bad output:

不良输出:

field #1: one
field #2: two
field #3: "three
field #4:  four"
field #5: five
---------------------------
field #1: "six
field #2:  seven"
field #3: eight
field #4: "nine"
---------------------------

desired output:

所需的输出:

field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------

采纳答案by Jonathan Leffler

The short answer is "I wouldn't use gawk to parse CSV if the CSV contains awkward data", where 'awkward' means things like commas in the CSV field data.

简短的回答是“如果 CSV 包含尴尬的数据,我不会使用 gawk 来解析 CSV”,其中 'awkward' 表示 CSV 字段数据中的逗号之类的东西。

The next question is "What other processing are you going to be doing", since that will influence what alternatives you use.

下一个问题是“您将进行哪些其他处理”,因为这会影响您使用的替代方案。

I'd probably use Perl and the Text::CSV or Text::CSV_XS modules to read and process the data. Remember, Perl was originally written in part as an awkand sedkiller - hence the a2pand s2pprograms still distributed with Perl which convert awkand sedscripts (respectively) into Perl.

我可能会使用 Perl 和 Text::CSV 或 Text::CSV_XS 模块来读取和处理数据。请记住,Perl的原文为部分作为一个awksed杀手-因此a2ps2p程序仍然分布用Perl其将awksed脚本(分别)转换为Perl。

回答by BCoates

The gawk version 4 manualsays to use FPAT = "([^,]*)|(\"[^\"]+\")"

gawk 版本 4 手册说要使用FPAT = "([^,]*)|(\"[^\"]+\")"

When FPATis defined, it disables FSand specifies fields by content instead of by separator.

FPAT被定义,它禁用FS和通过内容,而不是通过分隔符指定字段。

回答by D Bro

You can use a simple wrapper function called csvquote to sanitize the input and restore it after awk is done processing it. Pipe your data through it at the start and end, and everything should work out ok:

您可以使用一个名为 csvquote 的简单包装函数来清理输入并在 awk 完成处理后恢复它。在开始和结束时通过它传输数据,一切都应该正常:

before:

前:

gawk -f mypgoram.awk input.csv

after:

后:

csvquote input.csv | gawk -f mypgoram.awk | csvquote -u

See https://github.com/dbro/csvquotefor code and documentation.

有关代码和文档,请参阅https://github.com/dbro/csvquote

回答by ayaz

If permissible, I would use the Python csvmodule, paying special attention to the dialect used and formatting parameters required, to parse the CSV file you have.

如果允许,我将使用 Python csv模块,特别注意使用的方言和所需的格式参数,来解析您拥有的 CSV 文件。

回答by ayaz

csv2delim.awk

csv2delim.awk

# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
#     delim can be any character, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
#     repl can be any character, defaults to ~
# changes two consecutive quotes within quotes to '

# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
#       -v delim    delimiter, defaults to tab
#       -v repl     replacement char, defaults to ~

# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt

# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacement option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and "" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present

BEGIN {
    if (delim == "") delim = "\t"
    if (repl == "") repl = "~"
    print "csv2delim.awk v.m 1.4 run at " strftime() > "/dev/stderr" ###########################################
}

{
    #if (
"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first ",sec   ond,"third"
"first" , "second","th  ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3
~ repl) { # print "Replacement character " repl " is on line " FNR ":" lineIn ";" > "/dev/stderr" #} if (
rem test csv2delim
rem default is: -v delim={tab} -v repl=~
gawk                      -f csv2delim.awk test.csv > test.txt
gawk -v delim=;           -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk            -v repl=` -f csv2delim.awk test.csv > testr.txt
~ delim) { print "Temp delimiter character " delim " is on line " FNR ":" lineIn ";" > "/dev/stderr" print " replaced by " repl > "/dev/stderr" } gsub(delim, repl)
{
  ColumnCount = 0
  
field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---
=
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' $!\n";
while (my $line = <$data>) {
    if ($csv->parse($line)) {
        my @f = $csv->fields();
        for my $n (0..$#f) {
            print "field #$n: $f[$n]\n";
        }
        print "---\n";
    }
}
"," # Assures all fields end with comma while(
Can't locate Text/CSV_XS.pm in @INC (@INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.
) # Get fields by pattern, not by delimiter { match(
BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) {
    f[++n] = $i
    if (substr(f[n],1,1)=="\"") {
      while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\") {
        f[n] = sprintf("%s,%s", f[n], $(++i))
      }
    }
  }
  for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
  print "----------------------------------\n"
}
, / *"[^"]*" *,|[^,]*,/) # Find a field with its delimiter suffix Field = substr(##代码##, RSTART, RLENGTH) # Get the located field with its delimiter gsub(/^ *"?|"? *,$/, "", Field) # Strip delimiter text: comma/space/quote Column[++ColumnCount] = Field # Save field without delimiter in an array ##代码## = substr(##代码##, RLENGTH + 1) # Remove processed text from the raw data } }
= gensub(/([^,])\"\"/, "\1'", "g") # ##代码## = gensub(/\"\"([^,])/, "'\1", "g") # not needed above covers all cases out = "" #for (i = 1; i <= length(##代码##); i++) n = length(##代码##) for (i = 1; i <= n; i++) if ((ch = substr(##代码##, i, 1)) == "\"") inString = (inString) ? 0 : 1 # toggle inString else out = out ((ch == "," && ! inString) ? delim : ch) print out } END { print NR " records processed from " FILENAME " at " strftime() > "/dev/stderr" }

test.csv

测试文件

##代码##

test.bat

测试.bat

##代码##

回答by kbulgrien

##代码##

Patterns that follow this one can access the fields in Column[]. ColumnCount indicates the number of elements in Column[] that were found. If not all rows contain the same number of columns, Column[] contains extra data after Column[ColumnCount] when processing the shorter rows.

遵循这一模式的模式可以访问 Column[] 中的字段。ColumnCount 表示在 Column[] 中找到的元素数。如果并非所有行都包含相同数量的列,则在处理较短的行时,Column[] 会在 Column[ColumnCount] 之后包含额外的数据。

This implementation is slow, but it appears to emulate the FPAT/patsplit()feature found in gawk >= 4.0.0 mentioned in a previous answer.

这个实现很慢,但它似乎模拟了在之前的答案中提到的 gawk >= 4.0.0 中找到的FPAT/patsplit()功能。

Reference

参考

回答by Vijay Dev

I am not exactly sure whether this is the right way to do things. I would rather work on a csv file in which either all values are to quoted or none. Btw, awk allows regexes to be Field Separators. Check if that is useful.

我不确定这是否是正确的做事方式。我宁愿处理一个 csv 文件,其中所有值都被引用或没有。顺便说一句,awk 允许正则表达式作为字段分隔符。检查这是否有用。

回答by Chris Koknat

Perl has the Text::CSV_XS module which is purpose-built to handle the quoted-comma weirdness.
Alternately try the Text::CSV module.

Perl 有 Text::CSV_XS 模块,该模块专门用于处理引号引起的奇怪现象。
或者尝试使用 Text::CSV 模块。

perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv

perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv

Produces this output:

产生这个输出:

##代码##

Here's a human-readable version.
Save it as parsecsv, chmod +x, and run it as "parsecsv file.csv"

这是一个人类可读的版本。
将其另存为 parsecsv, chmod +x,并将其作为“parsecsv file.csv”运行

##代码##

You may need to point to a different version of perl on your machine, since the Text::CSV_XS module may not be installed on your default version of perl.

您可能需要在您的机器上指向不同版本的 perl,因为 Text::CSV_XS 模块可能没有安装在您的默认 perl 版本上。

##代码##

If none of your versions of Perl have Text::CSV_XS installed, you'll need to:
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS

如果你的 Perl 版本都没有安装 Text::CSV_XS,你需要:
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS

回答by MCS

Here's what I came up with. Any comments and/or better solutions would be appreciated.

这是我想出的。任何评论和/或更好的解决方案将不胜感激。

##代码##

The basic idea is that I loop through the fields, and any field which starts with a quote but does not end with a quote gets the next field appended to it.

基本思想是我循环遍历字段,任何以引号开头但不以引号结尾的字段都会附加下一个字段。