如何计算 Bash 中所有人类可读的文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12654026/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to count all the human readable files in Bash?
提问by Rekson
I'm taking an intro course to UNIX and have a homework question that follows:
我正在学习 UNIX 入门课程,并有一个如下的作业问题:
How many files in the previous question are text files? A text file is any file containing human-readable content. (TRICK QUESTION. Run the file command on a file to see whether the file is a text file or a binary data file! If you simply count the number of files with the
.txtextension you will get no points for this question.)
上一题中有多少个文件是文本文件?文本文件是包含人类可读内容的任何文件。(技巧问题。在文件上运行 file 命令以查看该文件是文本文件还是二进制数据文件!如果您仅计算带有
.txt扩展名的文件数,您将无法获得该问题的分数。)
The previous question simply asked how many regular files there were, which was easy to figure out by doing find . -type f | wc -l.
上一个问题只是询问有多少常规文件,通过执行find . -type f | wc -l.
I'm just having trouble determining what "human readable content" is, since I'm assuming it means anything besides binary/assembly, but I thought that's what -type fdisplays. Maybe that's what the professor meant by saying "trick question"?
我只是无法确定“人类可读的内容”是什么,因为我假设它意味着除二进制/汇编之外的任何内容,但我认为这就是-type f显示的内容。也许这就是教授所说的“技巧问题”的意思?
This question has a follow up later that also asks "What text files contain the string "csc" in any mix of upper and lower case?". Obviously "text" is referring to more than just .txtfiles, but I need to figure out the first question to determine this!
这个问题稍后有一个后续问题,该问题还询问“哪些文本文件包含大小写混合的字符串“csc”?”。显然“文本”不仅仅是指.txt文件,但我需要弄清楚第一个问题来确定这一点!
回答by John Kugelman
Quotes added for clarity:
为清楚起见添加了引号:
Run the "file" command on a file to see whether the file is a text file or a binary data file!
对文件运行“file”命令,查看该文件是文本文件还是二进制数据文件!
The filecommand will inspect files and tell you what kind of file they appear to be. The word "text" will (almost) always be in the description for text files.
该file命令将检查文件并告诉您它们看起来是什么类型的文件。“文本”一词(几乎)总是出现在文本文件的描述中。
For example:
例如:
desktop.ini: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
tw2-wasteland.jpg: JPEG image data, JFIF standard 1.02
So the first part is asking you to run the filecommand and parse its output.
所以第一部分是要求您运行file命令并解析其输出。
I'm just having trouble determining what "human readable content" is, since I'm assuming it means anything besides binary/assembly, but I thought that's what -type f displays.
我只是无法确定“人类可读的内容”是什么,因为我假设它意味着除二进制/汇编之外的任何内容,但我认为这就是 -type f 显示的内容。
find -type ffinds files. It filters out other filesystem objects like directories, symlinks, and sockets. It will match any type of file, though: binary files, text files, anything.
find -type f查找文件。它过滤掉其他文件系统对象,如目录、符号链接和套接字。它会匹配任何类型的文件,但是:二进制文件、文本文件,任何东西。
Maybe that's what the professor meant by saying "trick question"?
也许这就是教授所说的“技巧问题”的意思?
It sounds like he's just saying don't do find -name '*.txt'or some such command to find text files. Don't assume a particular file extension. File extensions have much less meaning in UNIX than they do in Windows. Lots of files don't even have file extensions!
听起来他只是在说不要执行find -name '*.txt'或使用某些此类命令来查找文本文件。不要假设特定的文件扩展名。文件扩展名在 UNIX 中的意义远低于在 Windows 中的意义。许多文件甚至没有文件扩展名!
I'm thinking the professor wants us to be able to run the file command on all files and count the number of ones with 'text' in it.
我认为教授希望我们能够对所有文件运行 file 命令并计算其中包含“文本”的文件数量。
How about a multi-part answer? I'll give the straightforward solution in #1, which is probably what your professor is looking for. And if you are interested I'll explain its shortcomings and how you can improve upon it.
多部分答案怎么样?我将在 #1 中给出直接的解决方案,这可能是您的教授正在寻找的。如果您有兴趣,我会解释它的缺点以及如何改进它。
One way is to use
xargs, if you've learned about that.xargsruns another command, using the data from stdin as that command's arguments.$ find . -type f | xargs file ./netbeans-6.7.1.desktop: ASCII text ./VMWare.desktop: a /usr/bin/env xdg-open script text executable ./VMWare: cannot open `./VMWare' (No such file or directory) (copy).desktop: cannot open `(copy).desktop' (No such file or directory) ./Eclipse.desktop: a /usr/bin/env xdg-open script text executableThat works. Sort of. It'd be good enough for a homework assignment. But not good enough for a real world script.
Notice how it broke on the file
VMWare (copy).desktopbecause it has a space in it. This is due toxargs's default behavior of splitting the arguments on whitespace. We can fix that by usingxargs -0to split command arguments on NUL characters instead of whitespace. File names can't contain NUL characters, so this will be able to handle anything.$ find . -type f -print0 | xargs -0 file ./netbeans-6.7.1.desktop: ASCII text ./VMWare.desktop: a /usr/bin/env xdg-open script text executable ./VMWare (copy).desktop: a /usr/bin/env xdg-open script text executable ./Eclipse.desktop: a /usr/bin/env xdg-open script text executableThis is good enough for a production script, and is something you'll encounter a lot. But I personally prefer an alternative syntax which doesn't require a pipe, and so is slightly more efficient.
$ find . -type f -exec file {} \; ./netbeans-6.7.1.desktop: ASCII text ./VMWare.desktop: a /usr/bin/env xdg-open script text executable ./VMWare (copy).desktop: a /usr/bin/env xdg-open script text executable ./Eclipse.desktop: a /usr/bin/env xdg-open script text executableTo understand that,
-execcallsfilerepeatedly, replacing{}with each file name it finds. The semi-colon\;marks the end of thefilecommand.
一种方法是使用
xargs,如果您已经了解了。xargs运行另一个命令,使用来自 stdin 的数据作为该命令的参数。$ find . -type f | xargs file ./netbeans-6.7.1.desktop: ASCII text ./VMWare.desktop: a /usr/bin/env xdg-open script text executable ./VMWare: cannot open `./VMWare' (No such file or directory) (copy).desktop: cannot open `(copy).desktop' (No such file or directory) ./Eclipse.desktop: a /usr/bin/env xdg-open script text executable那个有效。有点。对于家庭作业来说已经足够了。但对于现实世界的脚本来说还不够好。
注意它是如何破坏文件的,
VMWare (copy).desktop因为它有一个空格。这是由于xargs在空格上拆分参数的默认行为。我们可以通过使用xargs -0在 NUL 字符而不是空格上拆分命令参数来解决这个问题。文件名不能包含 NUL 字符,因此这将能够处理任何事情。$ find . -type f -print0 | xargs -0 file ./netbeans-6.7.1.desktop: ASCII text ./VMWare.desktop: a /usr/bin/env xdg-open script text executable ./VMWare (copy).desktop: a /usr/bin/env xdg-open script text executable ./Eclipse.desktop: a /usr/bin/env xdg-open script text executable这对于生产脚本来说已经足够了,并且您会经常遇到这种情况。但我个人更喜欢不需要管道的替代语法,因此效率更高。
$ find . -type f -exec file {} \; ./netbeans-6.7.1.desktop: ASCII text ./VMWare.desktop: a /usr/bin/env xdg-open script text executable ./VMWare (copy).desktop: a /usr/bin/env xdg-open script text executable ./Eclipse.desktop: a /usr/bin/env xdg-open script text executable要理解这一点,
-exec请file重复调用,用{}它找到的每个文件名替换。分号\;表示file命令的结束。
回答by gts
there's a nice and easy way to determine whether a file is a human readable text file, just use file --mime-type <filename>and look for 'text/plain'. It will work no matter if the file has an ending or has a different ending to .txt
有一种很好且简单的方法来确定文件是否是人类可读的文本文件,只需使用file --mime-type <filename>并查找'text/plain'. 无论文件是否有结尾或与 .txt 不同的结尾,它都会起作用
So you would do sth like:
所以你会这样做:
FILES=`find $YOUR_DIR -type f`
for file in $FILES ;
do
mime=`/usr/bin/file --mime-type $YOUR_DIR/$file | /bin/sed 's/^.* //'`
if [ $mime = "text/plain" ]; then
fileTotal=$(( fileTotal + 1 ))
echo "$fileTotal - $file"
fi
done
echo "$fileTotal human readable files found!"
and the output would be sth like:
输出将类似于:
1 - /sampledir/samplefile
2 - /sampledir/anothersamplefile
....
23 human readable files found!
If you want to take it further to more mime types that are human readable(e.g. does HTML and/or XML count?) have a look at http://www.feedforall.com/mime-types.htm
如果您想进一步了解更多人类可读的 mime 类型(例如 HTML 和/或 XML 是否计数?),请查看http://www.feedforall.com/mime-types.htm

