bash 使用 shell 脚本计算列中的唯一值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2781491/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Counting unique values in a column with a shell script
提问by Lilly Tooner
I have a tab delimited file with 5 columns and need to retrieve a count of just the number of unique lines from column 2. I would normally do this with Perl/Python but I am forced to use the shell for this one.
我有一个包含 5 列的制表符分隔文件,需要从第 2 列中检索唯一行数的计数。我通常会使用 Perl/Python 来执行此操作,但我不得不为此使用 shell。
I have successfully in the past used *nix uniq function piped to wc but it looks like I am going to have to use awk in here.
我过去曾成功地使用 *nix uniq 函数通过管道传输到 wc,但看起来我将不得不在这里使用 awk。
Any advice would be greatly appreciated. (I have asked a similar question previously about column checks using awk but this is a little different and I wanted to separate it so if someone in the future has this question this will be here)
任何建议将不胜感激。(我之前问过一个关于使用 awk 进行列检查的类似问题,但这有点不同,我想把它分开,所以如果将来有人有这个问题,这将在这里)
Many many thanks!
Lilly
非常感谢!
礼来
回答by unwind
No need to use awk.
无需使用 awk。
$ cut -f2 file.txt | sort | uniq | wc -l
should do it.
应该这样做。
This uses the fact that tab is cut
's default field separator, so we'll get just the content from column two this way. Then a pass through sort
works as a pre-stage to uniq
, which removes the duplicates. Finally we count the lines, which is the sought number.
这使用了 tab 是cut
默认字段分隔符的事实,因此我们将通过这种方式仅获取第二列的内容。然后传递sort
作为到 的前阶段uniq
,删除重复项。最后我们计算行数,这就是所寻求的数字。
回答by martin clayton
I go for
我去
$ cut -f2 file.txt | sort -u | wc -l
At least in some versions, uniq
relies on the input data being sorted (it looks only at adjacent lines).
至少在某些版本中,uniq
依赖于被排序的输入数据(它只查看相邻的行)。
For example in the Solaris docs:
例如在Solaris 文档中:
The uniq utility will read an input file comparing adjacent lines, and write one copy of each input line on the output. The second and succeeding copies of repeated adjacent input lines will not be written.
Repeated lines in the input will not be detected if they are not adjacent.
uniq 实用程序将读取比较相邻行的输入文件,并在输出上写入每个输入行的一个副本。不会写入重复的相邻输入行的第二个和后续副本。
如果输入中的重复行不相邻,则不会检测到它们。
回答by Vijay
awk '{if(##代码##~/Not Running/)a++;else if(##代码##~/Running/)b++}END{print a,b}' temp