bash 如何递归计算目录中的单词数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35559648/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 14:16:54  来源:igfitidea点击:

How can I count the number of words in a directory recursively?

bashvimcountgrepword

提问by Alistair Colling

I'm trying to calculate the number of words written in a project. There are a few levels of folders and lots of text files within them.

我正在尝试计算项目中编写的字数。有几个级别的文件夹和许多文本文件。

Can anyone help me find out a quick way to do this?

谁能帮我找出一个快速的方法来做到这一点?

bash or vim would be good!

bash 或 vim 会很好!

Thanks

谢谢

回答by karakfa

use findthe scan the dir tree and wcwill do the rest

使用find扫描目录树并wc完成剩下的工作

$ find path -type f | xargs wc -w | tail -1

last line gives the totals.

最后一行给出了总数。

回答by janos

You could find and print all the content and pipe to wc:

您可以找到并打印所有内容和管道wc

find path -type f -exec cat {} \; -exec echo \; | wc -w

Note: the -exec echo \;is needed in case a file doesn't end with a newline character, in which case the last word of one file and the first word of the next will not be separated.

注意:-exec echo \;如果文件不以换行符结尾,则需要使用 ,在这种情况下,一个文件的最后一个单词和下一个文件的第一个单词将不会被分隔。

Or you could find and wcand use awk to aggregate the counts:

或者你可以找到并wc使用 awk 来聚合计数:

find . -type f -exec wc -w {} \; | awk '{ sum +=  } END { print sum }'

回答by rubicks

tldr;

tldr;

$ find . -type f -exec wc -w {} + | awk '/total/{print }' | paste -sd+ | bc

Explanation:

解释:

The find . -type f -exec wc -w {} +will run wc -won all the files (recursively) contained by .(the current working directory). findwill execute wcas few times as possible but as many times as is necessaryto comply with ARG_MAX--- the system command length limit. When the quantity of files (and/or their constituent lengths) exceeds ARG_MAX, then findinvokes wc -wmore than once, giving multiple totallines:

find . -type f -exec wc -w {} +将运行wc -w上的所有文件(递归)包含由.(当前工作目录)。find将执行wc尽可能少的次数,但根据需要执行尽可能多的次数以符合ARG_MAX--- 系统命令长度限制。当文件的数量(和/或它们的组成长度)超过 时ARG_MAXfind调用wc -w不止一次,给出多total行:

$ find . -type f -exec wc -w {} + | awk '/total/{print 
$ find . -type f -exec wc -w {} + | awk '/total/{print }'
8264577
654892
1109527
149522
174922
181897
1229726
2305504
1196390
5509702
9886665
}' 8264577 total 654892 total 1109527 total 149522 total 174922 total 181897 total 1229726 total 2305504 total 1196390 total 5509702 total 9886665 total

Isolate these partial sums by printing only the first whitespace-delimited field of each totalline:

通过仅打印每total行的第一个以空格分隔的字段来隔离这些部分和:

$ find . -type f -exec wc -w {} + | awk '/total/{print }' | paste -sd+
8264577+654892+1109527+149522+174922+181897+1229726+2305504+1196390+5509702+9886665

pastethe partial sums with a +delimiter to give an infix summation:

paste带有+定界符的部分总和给出中缀总和:

$ find . -type f -exec wc -w {} + | awk '/total/{print }' | paste -sd+ | bc
30663324

Evaluate the infix summation using bc, which supports both infix expressions and arbitrary precision:

使用 评估中缀bc和,它支持中缀表达式和任意精度:

#!/usr/bin/env bash

shopt -s globstar
count=0
for f in **/*.txt
do
    words=$(wc -w "$f" | awk '{print }')
    count=$(($count + $words))
done
echo $count

References:

参考:

回答by miken32

If there's one thing I've learned from all the bashquestions on SO, it's that a filename with a space will mess you up. This script will work even if you have whitespace in the file names.

如果我从SO 上的所有bash问题中学到了一件事,那就是带有空格的文件名会让您感到困惑。即使文件名中有空格,此脚本也能正常工作。

wc -l *


10  000292_0
500 000297_0
510 total

回答by Yeikel

Assuming you don't need to recursively count the words and that you want to include all the files in the current directory , you can use a simple approach such as:

假设您不需要递归计算单词并且您希望包含当前目录中的所有文件,您可以使用一种简单的方法,例如:

cat *.txt | wc -l

If you want to count the words for only a specific extension in the current directory , you could try :

如果您只想计算当前目录中特定扩展名的字数,您可以尝试:

##代码##