bash 用出现次数列出文本文件中的所有单词?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15400638/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 04:51:08  来源:igfitidea点击:

List all the words in a text file with occurrence counts?

bashsedawk

提问by HymanWM

Suppose I have file text.txtas below:

假设我有text.txt如下文件:

she likes cats, and he likes cats too.

she likes cats, and he likes cats too.

I'd like my result to look like:

我希望我的结果看起来像:

she 1
likes 2
cats 2
and 1
he 1
too 1

If putting space , .into it would make the scripts easier, that would be fine.

如果放入space , .它会使脚本更容易,那就没问题了。

Is there a simple shell pipeline that could achieve this?

是否有一个简单的 shell 管道可以实现这一目标?

回答by phs

Here's a one-liner near and dear to my heart:

这是我心中最亲近的单行诗:

cat text.txt | sed 's|[,.]||g' | tr ' ' '\n' | sort | uniq -c

The sed strips punctuation (tune regex to taste), the tr puts the results one word per line.

sed 去掉标点符号(根据口味调整正则表达式),tr 将结果每行一个字。

回答by Ed Morton

With GNU awk you can just specify the Record Separator (RS) to be any sequence of non-alphabetic characters:

使用 GNU awk,您可以将记录分隔符 (RS) 指定为任何非字母字符序列:

$ gawk -v RS='[^[:alpha:]]+' '{sum[##代码##]++} END{for (word in sum) print word,sum[word]}' file
she 1
likes 2
and 1
too 1
he 1
cats 2

but that won't solve your problem of how to identify "words" in general.

但这并不能解决您一般如何识别“单词”的问题。