bash 从bash模拟“分组依据”的最佳方法？

Question

提问by Zizzencs

Suppose you have a file that contains IP addresses, one address in each line:

假设您有一个包含 IP 地址的文件，每行一个地址：

10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1

You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output:

您需要一个 shell 脚本来计算每个 IP 地址在文件中出现的次数。对于先前的输入，您需要以下输出：

10.0.10.1 3
10.0.10.2 1
10.0.10.3 1

One way to do this is:

一种方法是：

cat ip_addresses |uniq |while read ip
do
    echo -n $ip" "
    grep -c $ip ip_addresses
done

However it is really far from being efficient.

然而，它离高效还差得很远。

How would you solve this problem more efficiently using bash?

您将如何使用 bash 更有效地解决这个问题？

(One thing to add: I know it can be solved from perl or awk, I'm interested in a better solution in bash, not in those languages.)

（要补充一点：我知道它可以通过 perl 或 awk 解决，我对 bash 中更好的解决方案感兴趣，而不是那些语言。）

ADDITIONAL INFO:

附加信息：

Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.

假设源文件为 5GB，运行算法的机器为 4GB。所以 sort 不是一个有效的解决方案，也不是多次读取文件。

I liked the hashtable-like solution - anybody can provide improvements to that solution?

我喜欢类似哈希表的解决方案 - 任何人都可以对该解决方案进行改进？

ADDITIONAL INFO #2:

附加信息#2：

Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn't available for me. It was a custom built linux machine without most of the tools I'm used to. And I think it was an interesting problem.

有些人问我为什么要费心在 bash 中做它，而在 perl 中它更容易。原因是在我必须执行此操作的机器上，perl 对我不可用。这是一台定制的 linux 机器，没有我习惯的大部分工具。我认为这是一个有趣的问题。

So please, don't blame the question, just ignore it if you don't like it. :-)

所以请不要责怪这个问题，如果你不喜欢它，就忽略它。:-)

Answer 1

回答by Joachim Sauer

sort ip_addresses | uniq -c

This will print the count first, but other than that it should be exactly what you want.

这将首先打印计数，但除此之外，它应该正是您想要的。

Answer 2

回答by Francois Wolmarans

The quick and dirty method is as follows:

快速肮脏的方法如下：

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

如果您需要使用 bash 中的值，您可以将整个命令分配给一个 bash 变量，然后遍历结果。

PS

聚苯乙烯

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.

如果省略 sort 命令，您将不会得到正确的结果，因为 uniq 只查看连续的相同行。

Answer 3

回答by Anonymous

for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )

要根据一组现有字段汇总多个字段，请使用以下示例：（根据您的要求替换 $1、$2、$3、$4）

cat file

US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000

awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[,]+=+ }END {for (i in arr) print i,arr[i]}' file

US|A|3000
US|B|3000
US|C|3000
UK|1|9000

Answer 4

回答by Diomidis Spinellis

The canonical solution is the one mentioned by another respondent:

规范的解决方案是另一位受访者提到的：

sort | uniq -c

It is shorter and more concise than what can be written in Perl or awk.

它比用 Perl 或 awk 编写的更短更简洁。

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.

你写道你不想使用排序，因为数据的大小大于机器的主内存大小。不要低估 Unix sort 命令的执行质量。Sort 用于在具有 128k（即 131,072 字节）内存 (PDP-11) 的机器上处理非常大量的数据（想想最初的 AT&T 的计费数据）。当 sort 遇到超过预设限制的数据（通常调整为接近机器主内存的大小）时，它会对它在主内存中读取的数据进行排序并将其写入临时文件。然后它对下一个数据块重复该操作。最后，它对这些中间文件执行归并排序。这允许 sort 处理比机器主内存大很多倍的数据。

Answer 5

回答by zjor

cat ip_addresses | sort | uniq -c | sort -nr | awk '{print  " " }'

this command would give you desired output

这个命令会给你想要的输出

Answer 6

回答by kairouan2020

Solution ( group by like mysql)

解决方案（group by like mysql）

grep -ioh "facebook\|xing\|linkedin\|googleplus" access-log.txt | sort | uniq -c | sort -n

Result

结果

3249  googleplus
4211 linkedin
5212 xing
7928 facebook

Answer 7

回答by Vinko Vrsalovic

It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the ~~quadratic~~superlinear versions.

似乎您必须使用大量代码来模拟 bash 中的散列以获得线性行为，或者坚持使用二次超线性版本。

Among those versions, saua's solution is the best (and simplest):

在这些版本中，saua的解决方案是最好的（也是最简单的）：

sort -n ip_addresses.txt | uniq -c

I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...

我找到了http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html。但是真的好丑啊……

Answer 8

回答by Vinko Vrsalovic

I feel awk associative array is also handy in this case

我觉得 awk 关联数组在这种情况下也很方便

$ awk '{count[]++}END{for(j in count) print j,count[j]}' ips.txt

A group by post here

在这里发帖一组

Answer 9

回答by PolyThinker

You probably can use the file system itself as a hash table. Pseudo-code as follows:

您可能可以将文件系统本身用作哈希表。伪代码如下：

for every entry in the ip address file; do
  let addr denote the ip address;

  if file "addr" does not exist; then
    create file "addr";
    write a number "0" in the file;
  else 
    read the number from "addr";
    increase the number by 1 and write it back;
  fi
done

In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.

最后，您需要做的就是遍历所有文件并打印其中的文件名和编号。或者，您可以每次向文件附加一个空格或换行符，而不是保持计数，最后只查看文件大小（以字节为单位）。

Answer 10

回答by Aron Curzon

Most of the other solutions count duplicates. If you really need to group key value pairs, try this:

大多数其他解决方案计算重复。如果您确实需要对键值对进行分组，请尝试以下操作：

Here is my example data:

这是我的示例数据：

find . | xargs md5sum
fe4ab8e15432161f452e345ff30c68b0 a.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt

This will print the key value pairs grouped by the md5 checksum.

这将打印按 md5 校验和分组的键值对。

cat table.txt | awk '{print }' | sort | uniq  | xargs -i grep {} table.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 a.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt

bash 从bash模拟“分组依据”的最佳方法？

提问by Zizzencs

回答by Joachim Sauer

回答by Francois Wolmarans

回答by Anonymous

回答by Diomidis Spinellis

回答by zjor

回答by kairouan2020

回答by Vinko Vrsalovic

回答by Vinko Vrsalovic

回答by PolyThinker

回答by Aron Curzon

相关推荐

最近更新

标签

bash 从bash模拟“分组依据”的最佳方法？

提问by Zizzencs

回答by Joachim Sauer

回答by Francois Wolmarans

回答by Anonymous

回答by Diomidis Spinellis

回答by zjor

回答by kairouan2020

回答by Vinko Vrsalovic

回答by Vinko Vrsalovic

回答by PolyThinker

回答by Aron Curzon

相关推荐

bash 如何使用前缀/后缀重命名？

Bash 的隐藏功能

如何将标准输入中的多行输入读入变量以及如何在 shell(sh,bash) 中打印出来？

bash 如何从管道分隔文件打印字段？

相关推荐

最近更新

标签