bash 排序 | 唯一| xargs grep ... 其中行包含空格

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/612439/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 20:43:05  来源:igfitidea点击:

sort | uniq | xargs grep ... where lines contain spaces

bashcommand-linescriptingcygwin

提问by Sukotto

I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)

我有一个逗号分隔的文件“myfile.csv”,其中第 5 列是日期/时间戳。(mm/dd/yyyy hh:mm)。我需要列出所有包含重复日期的行(有很多)

I'm using a bash shell via cygwin for WinXP

我正在通过 cygwin 为 WinXP 使用 bash shell

$ cut -d, -f 5 myfile.csv | sort | uniq -d 

correctly returns a list of the duplicate dates

正确返回重复日期的列表

01/01/2005 00:22
01/01/2005 00:37
[snip]    
02/29/2009 23:54

But I cannot figure out how to feed this to grep to give me all the rows. Obviously, I can't use xargsstraight up since the output contains spaces. I thought I could do uniq -z -dbut for some reason, combining those flags causes uniq to (apparently) return nothing.

但我不知道如何将它提供给 grep 来给我所有的行。显然,我不能xargs直接使用,因为输出包含空格。我以为我可以做,uniq -z -d但出于某种原因,组合这些标志会导致 uniq(显然)什么都不返回。

So, given that

所以,鉴于

 $ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv

doesn't work... what can I do?

不起作用……我该怎么办?

I know that I could do this in perlor another scripting language... but my stubborn nature insists that I should be able to do it in bashusing standard commandline tools like sort, uniq, find, grep, cut, etc.

我知道我可以做到这一点perl或其他脚本语言......但我生性倔强坚持认为我应该能够做到这一点在bash使用标准的命令行工具一样sortuniqfindgrepcut,等。

Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?

教我,哦,bash 大师。如何使用典型的 cli 工具获取我需要的行列表?

回答by Andrew Barnett

  1. sort -k5,5 will do the sort on fields and avoid the cut;
  2. uniq -f 4 will ignore the first 4 fields for the uniq;
  3. Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
  4. but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.
  1. sort -k5,5 将对字段进行排序并避免剪切;
  2. uniq -f 4 将忽略 uniq 的前 4 个字段;
  3. 在 uniq 上加上 -D 会让你得到所有重复的行(vs -d,它只得到一个);
  4. 但 uniq 将需要制表符分隔而不是 csv,所以 tr '\t' ',' 来解决这个问题。

Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.

问题是#5 之后的字段是否不同。你的日期都一样长吗?您可以向 uniq 添加 -w 16(包括时间)或 -w 10(仅用于日期)。

So:

所以:

tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16

回答by kmkaplan

The -zoption of uniqneeds the input to be NUL separated. You can filter the output of cutthrough:

-z选项uniq需要输入被 NUL 分隔。您可以过滤cut通过的输出:

tr '\n' '
cut -d, -f 5 myfile.csv | tr '\n' '
cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv
0' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
0'

To get zero separated rows. Then sort, uniqand xargshave options to handle that. Try something like:

获得零分隔行。然后sortuniqxargs有选择来处理它。尝试类似:

echo 01/01/2005 00:37 | sed 's/ /\ /g'
cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\ /g' | xargs -I '{}' grep '{}' myfile.csv

Edit: the position of trin the pipe was wrong.

编辑:tr在管道中的位置是错误的。

回答by Andru Luvisi

You can tell xargs to use each line as an argument in its entirety using the -d option. Try:

您可以使用 -d 选项告诉 xargs 将每一行用作整个参数。尝试:

BEGIN { FS="," }
{ split(,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }

回答by Andru Luvisi

Try escaping the spaces with sed:

尝试使用 sed 转义空格:

##代码##

(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)

(另一种方法是将重复的日期行读入 IFS=$'\n' 数组并在 for 循环中对其进行迭代。)

回答by porges

This is a good candidate for awk:

这是 awk 的一个很好的候选:

##代码##
  1. Set field seperator to ',' (CSV).
  2. Split fifth field on the space, stick result in A.
  3. Concatenate the line number to the list of what we have already stored for that date.
  4. Print out the line numbers for each date.
  1. 将字段分隔符设置为“,”(CSV)。
  2. 在空间上拆分第五个字段,将结果保留在 A 中。
  3. 将行号连接到我们已经为该日期存储的列表中。
  4. 打印出每个日期的行号。