bash 使用 join/awk/sed 合并 CSV 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27600967/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 12:03:26  来源:igfitidea点击:

Merge CSV files using join/awk/sed

bashcsvjoinawksed

提问by Adrien

Could you please help me to find THE bash command which will join/merge those following cvs files "template.csv + file1.csv + file2.csv + file3.csv + ... + fileX.csv" into "ouput.csv".

你能帮我找到 bash 命令,它将把以下 cvs 文件“template.csv + file1.csv + file2.csv + file3.csv + ... + fileX.csv”加入/合并到“ouput.csv” .

For each line in template.csv, concatenate associated values (if exist) listed in the fileX.csv as below:

对于 template.csv 中的每一行,将 fileX.csv 中列出的关联值(如果存在)连接起来,如下所示:

template.csv:

模板.csv:

header
1
2
3
4
5
6
7
8
9

file1.csv:

文件1.csv:

header,value1
2,value12
3,value13
7,value17
8,value18
9,value19

file2.csv:

文件2.csv:

header,value2
1,value21
2,value22
3,value23
4,value24

file3.csv:

文件3.csv:

header,value3
2,value32
4,value34
6,value36
7,value37
8,value38

output.csv:

输出.csv:

header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,

My template file is containing 35137 lines.
I already developed a bash script doing this merge (based on "do while", etc...) but the performance is not good at all. Too long to make the output.csv. I'm sure that it is possible to do the same using join, awk, ... but I don't see how ...

我的模板文件包含 35137 行。
我已经开发了一个执行此合并的 bash 脚本(基于“do while”等),但性能一点也不好。太长而无法制作 output.csv。我确信可以使用 join、awk 来做同样的事情……但我不知道如何……

IMPORTANT UPDATE

重要更新

The first column of my real files are containing a datetime and not a simple number ... so the script must take into account the space between the date and the time ... sorry for the update !

我的真实文件的第一列包含一个日期时间而不是一个简单的数字......所以脚本必须考虑到日期和时间之间的空间......抱歉更新!

Script should be now designed with the below csv files as example:

现在应该使用以下 csv 文件作为示例设计脚本:

template.csv:

模板.csv:

header
2000-01-01 00:00:00
2000-01-01 00:15:00
2000-01-01 00:30:00
2000-01-01 00:45:00
2000-01-01 01:00:00
2000-01-01 01:15:00
2000-01-01 01:30:00
2000-01-01 01:45:00
2000-01-01 02:00:00

file1.csv:

文件1.csv:

header,value1
2000-01-01 00:15:00,value12
2000-01-01 00:30:00,value13
2000-01-01 01:30:00,value17
2000-01-01 01:45:00,value18
2000-01-01 02:00:00,value19

file2.csv:

文件2.csv:

header,value2
2000-01-01 00:00:00,value21
2000-01-01 00:15:00,value22
2000-01-01 00:30:00,value23
2000-01-01 00:45:00,value24

file3.csv:

文件3.csv:

header,value3
2000-01-01 00:15:00,value32
2000-01-01 00:45:00,value34
2000-01-01 01:15:00,value36
2000-01-01 01:30:00,value37
2000-01-01 01:45:00,value38

output.csv:

输出.csv:

header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

采纳答案by user43791

You could use multiple calls to join:

您可以多次调用join

join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv

Or more clearer :

或更清楚:

alias myjoin='join -t , -a 1 -o auto'
myjoin template.csv file1.csv | myjoin - file2.csv | myjoin - file3.csv

Explanation :

解释 :

  • -t ,specifies the field separator (,)
  • -a 1instructs to print unpairable lines coming from the first file (an assumption is made that the header file contains all possible headers)
  • -o autocontrols formatting and is necessary to print the empty fields
  • -t ,指定字段分隔符 ( ,)
  • -a 1指示打印来自第一个文件的不成对行(假设头文件包含所有可能的头文件)
  • -o auto控制格式,是打印空字段所必需的

Proof :

证明 :

$ join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

Note :

笔记 :

For this to work, the files MUST be sorted on the join fields (the header in your case). You can use the sortcommand if this is not the case.

为此,文件必须按连接字段(在您的情况下为标题)进行排序。sort如果不是这种情况,您可以使用该命令。

回答by Ed Morton

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR { key[++numRows] =  }
{ fld[,ARGIND] = $NF }
END {
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", fld[key[rowNr],colNr], (colNr<ARGIND ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk template.csv file1.csv file2.csv file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

The above uses GNU awk for ARGIND, with other awks just add a line that says FNR==1 { ++ARGIND }.

上面使用 GNU awk for ARGIND,其他 awk 只需添加一行表示FNR==1 { ++ARGIND }.

回答by Wintermute

This should work (for explanation read the comments):

这应该工作(解释阅读评论):

#!/bin/sh

awk -F, -v file=0 '
  FNR == 1 {                     # first line in the file
    if(file == 0) {              # if in first file (template.csv):
      header =                 # init header
    } else {
      header = header ","      # else append field name
    }
    next                         # forward to next line.
  }
  file == 0 {                    # if in first file:
    key[FNR] =                 # remember key
    next                         # next line.
  }
  {
    field[][file] =          # otherwise: remember field
  }
  ENDFILE {                      # at the end of a file:
    file = file + 1              # increase counter
  }
  END {                          # in the end, assemble and
    print header                 # print lines.
    asort(key)
    for(k in key) {
      line = ""
      for(i = 1; i < file; ++i) {
        line = line "," field[key[k]][i]
      }
      print key[k] line
    }
  }
  ' template.csv file1.csv file2.csv file3.csv

回答by Jan Legner

I would go with this, however it surely is not the fastest running solution, but for your data it returns correct result and code is short:

我会这样做,但是它肯定不是运行最快的解决方案,但是对于您的数据,它返回正确的结果并且代码很短:

#!/bin/bash
CONTENT=$(cat template.scv)
for line in $CONTENT; do
    TMP=$(echo $line)
    for file in file1.csv file2.csv file3.csv; do
        RESULT=$(grep "^$line," $file | cut -d',' -f2)
        TMP=$(echo $TMP,$RESULT)
    done
    echo $TMP
done

output:

输出:

header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,

EDIT: my code was missing a comma (,), so for longer ids it did not work properly EDIT 2: Well it is not "not the fastest solution", it is really slow one

编辑:我的代码缺少逗号 ( ,),因此对于更长的 ID,它无法正常工作 编辑 2:它不是“不是最快的解决方案”,它真的很慢