linux awk 比较两个 csv 文件并创建一个带有标志的新文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9528202/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
linux awk comparing two csv files and creating a new file with a flag
提问by Sanath
I have 2 CSV files that i need to compare and get the difference to a newly formatted file. The samples are given below.
我有 2 个 CSV 文件,我需要比较它们并获取与新格式化文件的差异。下面给出了样品。
OLD file
旧文件
DTL,11111111,1111111111111111,11111111111,Y,N,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,Y,cc,cc
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd
DTL,44444444,4444444444444444,44444444444,Y,Y,ss,ss
DTL,55555555,5555555555555555,55555555555,Y,Y,qq,qq
NEW file
新文件
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc
DTL,44444444,4444444444444444,44444444444,Y,Y,ss,ss
DTL,55555555,5555555555555555,55555555555,Y,Y,qq,qq
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee
Output file
输出文件
I want to compare the old and new CSV files and to find the changes that has effected in the new file and UPDATE a FLAG to denote these changes
我想比较旧的和新的 CSV 文件并找到影响新文件的更改并更新一个标志来表示这些更改
U - if the new file record is UPDATED D - if a record existing in the old file is deleted in the new file N - if a record existing in the new file is not available in the old file
U - 如果新文件记录被更新 D - 如果旧文件中存在的记录在新文件中被删除 N - 如果新文件中存在的记录在旧文件中不可用
the sample output file is this.
示例输出文件是这样的。
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx U
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc U
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd D
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee N
I used diff command but it will repeat the UPDATED record too which is not I want.
我使用了 diff 命令,但它也会重复更新记录,这不是我想要的。
DTL,11111111,1111111111111111,11111111111,Y,N,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,Y,cc,cc
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd
---
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc
5a5
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee
I used an AWK single line command to filter out my records as well
我也使用了 AWK 单行命令来过滤掉我的记录
awk 'NR==FNR{A[];next}!( in A)' FS=: old.csv new.csv
the problem with this is is doesnt get me the records only belonging to the OLD file. which is
问题是没有让我得到只属于旧文件的记录。这是
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd
I initiated an driven bash script as well to ahieve this but didnt find much help with a good example.
我也启动了一个驱动的 bash 脚本来实现这一点,但没有找到一个很好的例子的帮助。
myscript.awk
BEGIN {
FS = "," # input field seperator
OFS = "," # output field seperator
}
NR > 1 {
#flag
# N - new record D- Deleted U - Updated
id =
name =
flag = 'N'
# This prints the columns in the new order. The commas tell Awk to use the character set in OFS
print id,name,flag
}
>> awk -f myscript.awk old.csv new.csv > formatted.csv
采纳答案by potong
This might work for you:
这可能对你有用:
diff -W999 --side-by-side OLD NEW |
sed '/^[^\t]*\t\s*|\t\(.*\)/{s// U/;b};/^\([^\t]*\)\t*\s*<$/{s// D/;b};/^.*>\t\(.*\)/{s// N/;b};d'
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx U
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc U
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd D
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee N
an awk solution along the same lines:
一个 awk 解决方案:
diff -W999 --side-by-side OLD NEW |
awk '/[|][\t]/{split( diff -e OLD NEW
,a,"[|][\t]");print a[2]" U"};/[\t] *<$/{split( 5a
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee
.
1,3c
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc
,a,"[\t]* *<$");print a[1]" D"};/>[\t]/{split(##代码##,a,">[\t]");print a[2]" N"}'
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx U
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc U
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd D
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee N
回答by lnafziger
A good starting point would probably be:
一个好的起点可能是:
##代码##This outputs:
这输出:
##代码##Meaning that it Added a record on line 5 (5a) and changed the records on lines 1 and 3 (1,3c).
这意味着它在第 5 行 (5a) 上添加了一条记录并更改了第 1 行和第 3 行 (1,3c) 上的记录。
If you can't use this format as-is (which would be good to use a standard) then you would need to write a script which converts it to the format that you describe.
如果您不能按原样使用这种格式(最好使用标准),那么您需要编写一个脚本将其转换为您描述的格式。