Bash 脚本:计算文件中的唯一行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15984414/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash Script: count unique lines in file
提问by Wug
Situation:
情况:
I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:
我有一个大文件(数百万行),其中包含来自几个小时网络捕获的 IP 地址和端口,每行一个 ip/端口。行是这种格式:
ip.ad.dre.ss[:port]
Desired result:
想要的结果:
There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format
我在记录时收到的每个数据包都有一个条目,因此有很多重复的地址。我希望能够通过某种 shell 脚本来运行它,这将能够将它减少到格式的行
ip.ad.dre.ss[:port] count
where count
is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.
其中count
是该特定地址(和端口)的出现次数。无需做特殊工作,将不同的端口视为不同的地址。
So far, I'm using this command to scrape all of the ip addresses from the log file:
到目前为止,我正在使用此命令从日志文件中抓取所有 ip 地址:
grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt
From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)
从那以后,我可以使用一个相当简单的正则表达式来刮出我的地址发送的所有 IP 地址(我不在乎)
I can then use the following to extract the unique entries:
然后我可以使用以下内容来提取唯一条目:
sort -u ips.txt > intermediate.txt
I don't know how I can aggregate the line counts somehow with sort.
我不知道如何通过排序以某种方式汇总行数。
回答by Michael Hoffman
You can use the uniq
command to get counts of sorted repeated lines:
您可以使用该uniq
命令获取已排序重复行的计数:
sort ips.txt | uniq -c
To get the most frequent results at top (thanks to Peter Jaric):
要在顶部获得最频繁的结果(感谢 Peter Jaric):
sort ips.txt | uniq -c | sort -bgr
回答by qwr
To countthe total number of unique lines (i.e. not considering duplicate lines) we can use uniq
or Awk with wc
:
要计算唯一行的总数(即不考虑重复行),我们可以使用uniq
或 Awk wc
:
sort ips.txt | uniq | wc -l
awk '!seen[$ for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175
real 0m1.193s
user 0m0.701s
sys 0m0.388s
$ time awk '!seen[awk '{!seen[sort ips.txt | uniq -c | sort -n
]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n
]++' random.txt | wc -l
31175
real 0m0.675s
user 0m0.108s
sys 0m0.171s
]++' ips.txt | wc -l
Awk's arrays are associative so it may run a little faster than sorting.
Awk 的数组是关联的,因此它的运行速度可能比排序快一点。
Generating text file:
生成文本文件:
##代码##回答by Luca Mastrostefano
This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:
这是获取重复行数并按最不频繁到最频繁的顺序将它们很好地打印出来的最快方法:
##代码##If you don't care about performance and you want something easier to remember, then simply run:
如果您不关心性能并且想要更容易记住的东西,那么只需运行:
##代码##PS:
PS:
sort -nparse the field as a number, that is correct since we're sorting using the counts.
sort -n将字段解析为数字,这是正确的,因为我们使用计数进行排序。