如何处理 bash 脚本读取的 CSV 文件中的逗号

Question

提问by chrisbunney

I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).

我正在创建一个 bash 脚本来从 CSV 文件生成一些输出（我有超过 1000 个条目并且不喜欢手工制作......）。

The content of the CSV file looks similar to this:

CSV 文件的内容类似于：

Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as Adygeya, Republic. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.

我有一些代码可以使用逗号作为分隔符来分隔字段，但有些值实际上包含逗号，例如Adygeya, Republic. 这些值用引号括起来，表示其中的字符应视为字段的一部分，但我不知道如何解析它以将其考虑在内。

Currently I have this loop:

目前我有这个循环：

while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input

which produces this output for the sample data given above:

它为上面给出的示例数据生成此输出：

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]

As you can see, the third entry is parsed incorrectly. I want it to output

如您所见，第三个条目的解析不正确。我想要它输出

[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]

Answer 1

回答by Dimitre Radoulov

If you want to do it all in awk(GNU awk 4is required for this script to work as intended):

如果您想在awk 中完成所有操作（此脚本需要GNU awk 4才能按预期工作）：

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]+)|("[^"]+")' infile

Sample output:

示例输出：

% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{    
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" &&
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }
 }' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]

With Perl:

使用Perl：

perl -MText::ParseWords -lne'
 print join " ", map "[$_]", 
   parse_line(",",0, $_);
  ' infile

This should work with your awk version (based on thisc.u.s.post, removed the embedded commas too).

这应该适用于您的 awk 版本（基于此c.us帖子，也删除了嵌入的逗号）。

awk '{
 n = parse_csv(#!/usr/local/bin/gawk -f

BEGIN {
    FS="," 
    FPAT="([^,]+)|(\"[^\"]+\")"
    }

      {
    for (i=1;i<=NF;i++) 
        printf ("[%s] ",$i);
    print ""
    } 
, data)
 for (i = 0; ++i <= n;) {
    gsub(/,/, " ", data[i])
    printf "[%s]%s", data[i], (i < n ? OFS : RS)
    }
  }
function parse_csv(str, array,   field, i) { 
  split( "", array )
  str = str ","
  while ( match(str, /[ \t]*("[^"]*(""[^"]*)*"|[^,]*)[ \t]*,/) ) { 
    field = substr(str, 1, RLENGTH)
    gsub(/^[ \t]*"?|"?[ \t]*,$/, "", field)
    gsub(/""/, "\"", field)
    array[++i] = field
    str = substr(str, RLENGTH + 1)
  }
  return i
}' infile

Answer 2

回答by jaypal singh

After looking at @Dimitre'ssolution over here. You can do something like this -

细算@ Dimitre的解决方案在这里。你可以做这样的事情 -

[jaypal:~/Temp] cat filename
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

[jaypal:~/Temp] ./script.awk  filename
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
["Adygeya, Republic"] [RU-AD] [21250] [RU] [Russian Federation]

Test:

测试：

[jaypal:~/Temp] ./script.awk  filename | sed 's#\"##g'
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]

For removing "you can pipe the output to sed.

要删除，"您可以将输出通过管道传输到sed.

(")(.*)(,)(.*)(")

Answer 3

回答by chrisbunney

After thinking about the problem, I realised that since the comma in the string isn't important to me, it'd be easier to simply remove it from the input before parsing.

思考这个问题后，我意识到由于字符串中的逗号对我来说并不重要，因此在解析之前将其从输入中简单地删除会更容易。

To that end, I've concocted a sedcommand that matches strings surrounded by doubled quotes that contain a comma. The command then removes the bits you don't want from the matched string. It does this by separating the regex into remembered sections.

为此，我编造了一个sed命令来匹配由包含逗号的双引号包围的字符串。该命令然后从匹配的字符串中删除您不想要的位。它通过将正则表达式分成记住的部分来做到这一点。

This solution only works where the string contains a single comma between double quotes.

此解决方案仅适用于字符串在双引号之间包含单个逗号的情况。

The unescaped regex is

未转义的正则表达式是

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)//'

The first, third, and fifth pairs of parentheses capture the opening double quote, comma, and closing double quote respectively.

第一、第三和第五对括号分别捕获开始双引号、逗号和结束双引号。

The second and third pairs of parentheses capture the actual content of the field which we want to keep.

第二对和第三对括号捕获了我们想要保留的字段的实际内容。

sedCommand To Remove Comma:

sed删除逗号的命令：

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)//'

sedCommand To Remove Comma and Double Quotes:

sed删除逗号和双引号的命令：

tmpFile=$input"Temp"
sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)//' < $input > $tmpFile
while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $tmpFile
rm $tmpFile

Updated Code:

更新代码：

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
[Bío-Bío] [CL-BI] [20154] [CL] [Chile]

Output:

输出：

#!/bin/bash

input=
delimiter=

if [ -z "$input" ];
then
    echo "Input file must be passed as an argument!"
    exit 98
fi

if ! [ -f $input ] || ! [ -e $input ];
then
    echo "Input file '"$input"' doesn't exist!"
    exit 99
fi

if [ -z "$delimiter" ];
then
    echo "Delimiter character must be passed as an argument!"
    exit 98
fi

gawk '{
    c=0
    csvquote inputfile.csv | awk -F, '{print "[""] [""] [""] [""] [""]"}' | csvquote -u
=awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]*)|("[^"]+")' infile
","                                   # yes, cheating
    while(##代码##) {
        delimiter=""
        if (c++ > 0) # Evaluate and then increment c
        {
            delimiter="'$delimiter'"
        }

        match(##代码##,/ *"[^"]*" *,|[^,]*,/)
        s=substr(##代码##,RSTART,RLENGTH)             # save what matched in f
        gsub(/^ *"?|"? *,$/,"",s)               # remove extra stuff
        printf (delimiter s)
        ##代码##=substr(##代码##,RLENGTH+1)                 # "consume" what matched
    }
    printf ("\n")
}' $input

Answer 4

回答by chrisbunney

Owing to the slightly outdated version of awkon my system and a personal preference to stick to a Bash script, I've arrived a slightly different solution.

由于awk我系统上的稍微过时的版本以及坚持使用 Bash 脚本的个人偏好，我得到了一个稍微不同的解决方案。

I've produced a utility script based on this blog postthat parses the CSV file and replaces the delimiters with a delimiter of your choice so that the output can be captured and used to easily process the data. The script respects quoted strings and embedded commas, but will remove the double quotes it finds and doesn't work with escaped double quotes within fields.

我已经根据这篇博客文章生成了一个实用程序脚本，它解析 CSV 文件并用您选择的分隔符替换分隔符，以便可以捕获输出并用于轻松处理数据。该脚本尊重带引号的字符串和嵌入的逗号，但会删除它找到的双引号，并且不适用于字段中的转义双引号。

##代码##

Just posting it up in case someone else finds it useful.

只是张贴它以防其他人发现它有用。

Answer 5

回答by D Bro

If you can tolerate having the surrounding quotes persist in the output, you can use a small script I wrote called csvquote to enable awk and cut (and other UNIX text tools) to properly handle quoted fields that contain commas. You wrap the command like this:

如果您可以容忍在输出中保留周围的引号，您可以使用我编写的一个名为 csvquote 的小脚本来启用 awk 和 cut（以及其他 UNIX 文本工具）来正确处理包含逗号的引用字段。你像这样包装命令：

##代码##

see https://github.com/dbro/csvquotefor the code and documentation

有关代码和文档，请参阅https://github.com/dbro/csvquote

Answer 6

回答by Sven L.

Using Dimitre's solution (thank you for that) I noticed that his program ignores empty fields.

使用 Dimitre 的解决方案（谢谢你）我注意到他的程序忽略了空字段。

Here is the fix:

这是修复：

##代码##

如何处理 bash 脚本读取的 CSV 文件中的逗号

提问by chrisbunney

回答by Dimitre Radoulov

回答by jaypal singh

Test:

测试：

回答by chrisbunney

回答by chrisbunney

回答by D Bro

回答by Sven L.

相关推荐

最近更新

标签

如何处理 bash 脚本读取的 CSV 文件中的逗号

提问by chrisbunney

回答by Dimitre Radoulov

回答by jaypal singh

Test:

测试：

回答by chrisbunney

回答by chrisbunney

回答by D Bro

回答by Sven L.

相关推荐

bash 脚本中的 sed 不起作用：但它适用于命令行

bash 将 curl 与命令文件一起使用

if else bash 脚本中的整数表达式预期测试或条件

bash Vim 在交互模式下无法识别别名？

相关推荐

最近更新

标签