bash 排序 | 唯一| xargs grep ... 其中行包含空格
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/612439/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
sort | uniq | xargs grep ... where lines contain spaces
提问by Sukotto
I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)
我有一个逗号分隔的文件“myfile.csv”,其中第 5 列是日期/时间戳。(mm/dd/yyyy hh:mm)。我需要列出所有包含重复日期的行(有很多)
I'm using a bash shell via cygwin for WinXP
我正在通过 cygwin 为 WinXP 使用 bash shell
$ cut -d, -f 5 myfile.csv | sort | uniq -d
correctly returns a list of the duplicate dates
正确返回重复日期的列表
01/01/2005 00:22
01/01/2005 00:37
[snip]
02/29/2009 23:54
But I cannot figure out how to feed this to grep to give me all the rows.
Obviously, I can't use xargsstraight up since the output contains spaces. I thought I could do uniq -z -dbut for some reason, combining those flags causes uniq to (apparently) return nothing.
但我不知道如何将它提供给 grep 来给我所有的行。显然,我不能xargs直接使用,因为输出包含空格。我以为我可以做,uniq -z -d但出于某种原因,组合这些标志会导致 uniq(显然)什么都不返回。
So, given that
所以,鉴于
$ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
doesn't work... what can I do?
不起作用……我该怎么办?
I know that I could do this in perlor another scripting language... but my stubborn nature insists that I should be able to do it in bashusing standard commandline tools like sort, uniq, find, grep, cut, etc.
我知道我可以做到这一点perl或其他脚本语言......但我生性倔强坚持认为我应该能够做到这一点在bash使用标准的命令行工具一样sort,uniq,find,grep,cut,等。
Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?
教我,哦,bash 大师。如何使用典型的 cli 工具获取我需要的行列表?
回答by Andrew Barnett
- sort -k5,5 will do the sort on fields and avoid the cut;
- uniq -f 4 will ignore the first 4 fields for the uniq;
- Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
- but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.
- sort -k5,5 将对字段进行排序并避免剪切;
- uniq -f 4 将忽略 uniq 的前 4 个字段;
- 在 uniq 上加上 -D 会让你得到所有重复的行(vs -d,它只得到一个);
- 但 uniq 将需要制表符分隔而不是 csv,所以 tr '\t' ',' 来解决这个问题。
Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.
问题是#5 之后的字段是否不同。你的日期都一样长吗?您可以向 uniq 添加 -w 16(包括时间)或 -w 10(仅用于日期)。
So:
所以:
tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16
回答by kmkaplan
The -zoption of uniqneeds the input to be NUL separated. You can filter the output of cutthrough:
的-z选项uniq需要输入被 NUL 分隔。您可以过滤cut通过的输出:
tr '\n' 'cut -d, -f 5 myfile.csv | tr '\n' 'cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv
0' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
0'
To get zero separated rows. Then sort, uniqand xargshave options to handle that. Try something like:
获得零分隔行。然后sort,uniq并xargs有选择来处理它。尝试类似:
echo 01/01/2005 00:37 | sed 's/ /\ /g'
cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\ /g' | xargs -I '{}' grep '{}' myfile.csv
Edit: the position of trin the pipe was wrong.
编辑:tr在管道中的位置是错误的。
回答by Andru Luvisi
You can tell xargs to use each line as an argument in its entirety using the -d option. Try:
您可以使用 -d 选项告诉 xargs 将每一行用作整个参数。尝试:
BEGIN { FS="," }
{ split(,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }
回答by Andru Luvisi
Try escaping the spaces with sed:
尝试使用 sed 转义空格:
##代码##(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)
(另一种方法是将重复的日期行读入 IFS=$'\n' 数组并在 for 循环中对其进行迭代。)
回答by porges
This is a good candidate for awk:
这是 awk 的一个很好的候选:
##代码##- Set field seperator to ',' (CSV).
- Split fifth field on the space, stick result in A.
- Concatenate the line number to the list of what we have already stored for that date.
- Print out the line numbers for each date.
- 将字段分隔符设置为“,”(CSV)。
- 在空间上拆分第五个字段,将结果保留在 A 中。
- 将行号连接到我们已经为该日期存储的列表中。
- 打印出每个日期的行号。

