Bash 脚本：计算文件中的唯一行

Question

提问by Wug

Situation:

情况：

I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:

我有一个大文件（数百万行），其中包含来自几个小时网络捕获的 IP 地址和端口，每行一个 ip/端口。行是这种格式：

ip.ad.dre.ss[:port]

Desired result:

想要的结果：

There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format

我在记录时收到的每个数据包都有一个条目，因此有很多重复的地址。我希望能够通过某种 shell 脚本来运行它，这将能够将它减少到格式的行

ip.ad.dre.ss[:port] count

where countis the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.

其中count是该特定地址（和端口）的出现次数。无需做特殊工作，将不同的端口视为不同的地址。

So far, I'm using this command to scrape all of the ip addresses from the log file:

到目前为止，我正在使用此命令从日志文件中抓取所有 ip 地址：

grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt

From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)

从那以后，我可以使用一个相当简单的正则表达式来刮出我的地址发送的所有 IP 地址（我不在乎）

I can then use the following to extract the unique entries:

然后我可以使用以下内容来提取唯一条目：

sort -u ips.txt > intermediate.txt

I don't know how I can aggregate the line counts somehow with sort.

我不知道如何通过排序以某种方式汇总行数。

Answer 1

回答by Michael Hoffman

You can use the uniqcommand to get counts of sorted repeated lines:

您可以使用该uniq命令获取已排序重复行的计数：

sort ips.txt | uniq -c

To get the most frequent results at top (thanks to Peter Jaric):

要在顶部获得最频繁的结果（感谢 Peter Jaric）：

sort ips.txt | uniq -c | sort -bgr

Answer 2

回答by qwr

To countthe total number of unique lines (i.e. not considering duplicate lines) we can use uniqor Awk with wc:

要计算唯一行的总数（即不考虑重复行），我们可以使用uniq或 Awk wc：

sort ips.txt | uniq | wc -l
awk '!seen[$  for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175

real    0m1.193s
user    0m0.701s
sys     0m0.388s

$ time awk '!seen[awk '{!seen[sort ips.txt | uniq -c | sort -n
]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n
]++' random.txt | wc -l
31175

real    0m0.675s
user    0m0.108s
sys     0m0.171s
]++' ips.txt | wc -l

Awk's arrays are associative so it may run a little faster than sorting.

Awk 的数组是关联的，因此它的运行速度可能比排序快一点。

Generating text file:

生成文本文件：

##代码##

Answer 3

回答by Luca Mastrostefano

This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:

这是获取重复行数并按最不频繁到最频繁的顺序将它们很好地打印出来的最快方法：

##代码##

If you don't care about performance and you want something easier to remember, then simply run:

如果您不关心性能并且想要更容易记住的东西，那么只需运行：

##代码##

PS:

PS：

sort -nparse the field as a number, that is correct since we're sorting using the counts.

sort -n将字段解析为数字，这是正确的，因为我们使用计数进行排序。

Bash 脚本：计算文件中的唯一行

提问by Wug

Situation:

情况：

Desired result:

想要的结果：

回答by Michael Hoffman

回答by qwr

回答by Luca Mastrostefano

相关推荐

最近更新

标签

Bash 脚本：计算文件中的唯一行

提问by Wug

Situation:

情况：

Desired result:

想要的结果：

回答by Michael Hoffman

回答by qwr

回答by Luca Mastrostefano

相关推荐

bash printf 带新行

来自 bash 的 cygwin clearscreen

bash 从所有子目录复制具有特定扩展名的所有文件

bash 在文件中打印一行的最快方法

相关推荐

最近更新

标签