bash 用出现次数列出文本文件中的所有单词？

Question

提问by HymanWM

Suppose I have file text.txtas below:

假设我有text.txt如下文件：

she likes cats, and he likes cats too.

I'd like my result to look like:

我希望我的结果看起来像：

she 1
likes 2
cats 2
and 1
he 1
too 1

If putting space , .into it would make the scripts easier, that would be fine.

如果放入space , .它会使脚本更容易，那就没问题了。

Is there a simple shell pipeline that could achieve this?

是否有一个简单的 shell 管道可以实现这一目标？

Answer 1

Here's a one-liner near and dear to my heart:

这是我心中最亲近的单行诗：

cat text.txt | sed 's|[,.]||g' | tr ' ' '\n' | sort | uniq -c

The sed strips punctuation (tune regex to taste), the tr puts the results one word per line.

sed 去掉标点符号（根据口味调整正则表达式），tr 将结果每行一个字。

Answer 2

With GNU awk you can just specify the Record Separator (RS) to be any sequence of non-alphabetic characters:

使用 GNU awk，您可以将记录分隔符 (RS) 指定为任何非字母字符序列：

$ gawk -v RS='[^[:alpha:]]+' '{sum[##代码##]++} END{for (word in sum) print word,sum[word]}' file
she 1
likes 2
and 1
too 1
he 1
cats 2

but that won't solve your problem of how to identify "words" in general.

但这并不能解决您一般如何识别“单词”的问题。