大量文件的快速 Linux 文件计数

Question

提问by

I'm trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files ( > 100,000).

当有大量文件（> 100,000）时，我试图找出在特定目录中查找文件数的最佳方法。

When there are that many files, performing ls | wc -ltakes quite a long time to execute. I believe this is because it's returning the names of all the files. I'm trying to take up as little of the disk IO as possible.

当有这么多文件时，执行ls | wc -l需要相当长的时间来执行。我相信这是因为它返回了所有文件的名称。我试图占用尽可能少的磁盘 IO。

I have experimented with some shell and Perl scripts to no avail. Any ideas?

我已经尝试了一些 shell 和 Perl 脚本，但无济于事。有任何想法吗？

Answer 1

回答by igustin

Did you try find? For example:

你尝试找到了吗？例如：

find . -name "*.ext" | wc -l

Answer 2

回答by Peter van der Heijden

You could try if using opendir()and readdir()in Perlis faster. For an example of those function look here

如果使用opendir()and readdir()inPerl更快，您可以尝试。有关这些功能的示例，请查看此处

Answer 3

回答by mark4o

By default lssorts the names, which can take a while if there are a lot of them. Also there will be no output until all of the names are read and sorted. Use the ls -foption to turn off sorting.

默认情况下ls对名称进行排序，如果名称很多，这可能需要一段时间。此外，在读取和排序所有名称之前，不会有任何输出。使用该ls -f选项关闭排序。

ls -f | wc -l

Note that this will also enable -a, so ., .., and other files starting with .will be counted.

请注意，这也将让-a，所以.，..开始和其他文件.将被计算在内。

Answer 4

回答by Bogdan St?ncescu

Surprisingly for me, a bare-bones find is very much comparable to ls -f

令我惊讶的是，一个简单的发现与 ls -f 非常相似

> time ls -f my_dir | wc -l
17626

real    0m0.015s
user    0m0.011s
sys     0m0.009s

versus

相对

> time find my_dir -maxdepth 1 | wc -l
17625

real    0m0.014s
user    0m0.008s
sys     0m0.010s

Of course, the values on the third decimal place shift around a bit every time you execute any of these, so they're basically identical. Notice however that findreturns one extra unit, because it counts the actual directory itself (and, as mentioned before, ls -freturns two extra units, since it also counts . and ..).

当然，每次执行其中任何一个时，小数点后第三位的值都会移动一点，因此它们基本相同。但是请注意，它find返回一个额外的单位，因为它计算实际目录本身（并且，如前所述，ls -f返回两个额外的单位，因为它也计算 . 和 ..）。

Answer 5

回答by Thomas

find, ls and perl tested against 40 000 files: same speed (though I didn't try to clear the cache):

find、ls 和 perl 针对 40 000 个文件进行了测试：相同的速度（虽然我没有尝试清除缓存）：

[user@server logs]$ time find . | wc -l
42917

real    0m0.054s
user    0m0.018s
sys     0m0.040s
[user@server logs]$ time /bin/ls -f | wc -l
42918

real    0m0.059s
user    0m0.027s
sys     0m0.037s

and with perl opendir/readdir, same time:

并使用 perl opendir/readdir，同时：

[user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"'
42918

real    0m0.057s
user    0m0.024s
sys     0m0.033s

note: I used /bin/ls -f to make sure to bypass the alias option which mightslow a little bit and -f to avoid file ordering. ls without -f is twice slower than find/perl except if ls is used with -f, it seems to be the same time:

注意：我使用 /bin/ls -f 来确保绕过可能会慢一点的别名选项和 -f 以避免文件排序。不带 -f 的 ls 比 find/perl 慢两倍，除非 ls 与 -f 一起使用，它似乎是同一时间：

[user@server logs]$ time /bin/ls . | wc -l
42916

real    0m0.109s
user    0m0.070s
sys     0m0.044s

I also would like to have some script to ask the file system directly without all the unnecessary information.

我还想有一些脚本直接询问文件系统，而不需要所有不必要的信息。

tests based on answer of Peter van der Heijden, glenn Hymanman and mark4o.

测试基于 Peter van der Heijden、glenn Hymanman 和 mark4o 的答案。

Thomas

托马斯

Answer 6

回答by mightybs

You can change the output based on your requirements, but here is a bash one-liner I wrote to recursively count and report the number of files in a series of numerically named directories.

您可以根据您的要求更改输出，但这里是我编写的 bash 单行程序，用于递归计算和报告一系列以数字命名的目录中的文件数。

dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }

This looks recursively for all files (not directories) in the given directory and returns the results in a hash-like format. Simple tweaks to the find command could make what kind of files you're looking to count more specific, etc.

这将递归查找给定目录中的所有文件（不是目录），并以类似散列的格式返回结果。对 find 命令的简单调整可以使您要计算的文件类型更加具体，等等。

Results in something like this:

结果是这样的：

1 => 38,
65 => 95052,
66 => 12823,
67 => 10572,
69 => 67275,
70 => 8105,
71 => 42052,
72 => 1184,

Answer 7

回答by user2546874

First 10 directores with the higest no of files.

文件数最多的前 10 个目录。

dir=/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$(find ${dir}${i} \
    -type f | wc -l) => $i,"; } | sort -nr | head -10

Answer 8

回答by Benubird

Just adding this for the sake of completeness. The correct answer of course has already been posted by someone else, but you can also get a count of files and directories with the tree program.

只是为了完整起见添加这个。正确答案当然已经由其他人发布了，但是您也可以使用 tree 程序获取文件和目录的数量。

Run the command tree | tail -n 1to get the last line, which will say something like "763 directories, 9290 files". This counts files and folders recursively, excluding hidden files, which can be added with the flag -a. For reference, it took 4.8 seconds on my computer, for tree to count my whole home dir, which was 24777 directories, 238680 files. find -type f | wc -ltook 5.3 seconds, half a second longer, so I think tree is pretty competitive speed-wise.

运行命令tree | tail -n 1以获取最后一行，它会显示类似“763 个目录，9290 个文件”的内容。这会递归计算文件和文件夹，不包括隐藏文件，可以使用标志添加-a。作为参考，在我的电脑上用了 4.8 秒，树计算了我的整个主目录，即 24777 个目录，238680 个文件。find -type f | wc -l用了 5.3 秒，多半秒，所以我认为 tree 在速度方面很有竞争力。

As long as you don't have any subfolders, tree is a quick and easy way to count the files.

只要您没有任何子文件夹，树就是一种快速简便地计算文件的方法。

Also, and purely for the fun of it, you can use tree | grep '^├'to only show the files/folders in the current directory - this is basically a much-slower version of ls.

此外，纯粹为了好玩，您可以使用tree | grep '^├'仅显示当前目录中的文件/文件夹 - 这基本上是ls.

Answer 9

回答by Christopher Schultz

The fastest way is a purpose-built program, like this:

最快的方法是专门构建的程序，如下所示：

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count = 0;

    dir = opendir(argv[1]);

    while((ent = readdir(dir)))
            ++count;

    closedir(dir);

    printf("%s contains %ld files\n", argv[1], count);

    return 0;
}

From my testing without regard to cache, I ran each of these about 50 times each against the same directory, over and over, to avoid cache-based data skew, and I got roughly the following performance numbers (in real clock time):

从我不考虑缓存的测试中，我对同一目录一遍又一遍地运行了大约 50 次，以避免基于缓存的数据倾斜，我大致得到了以下性能数据（实时时钟时间）：

ls -1  | wc - 0:01.67
ls -f1 | wc - 0:00.14
find   | wc - 0:00.22
dircnt | wc - 0:00.04

That last one, dircnt, is the program compiled from the above source.

最后一个dircnt是从上述源代码编译的程序。

EDIT 2016-09-26

编辑 2016-09-26

Due to popular demand, I've re-written this program to be recursive, so it will drop into subdirectories and continue to count files and directories separately.

由于大众需求，我将这个程序重新编写为递归的，因此它会放入子目录并继续分别计算文件和目录。

Since it's clear some folks want to know howto do all this, I have a lot of comments in the code to try to make it obvious what's going on. I wrote this and tested it on 64-bit Linux, but it shouldwork on any POSIX-compliant system, including Microsoft Windows. Bug reports are welcome; I'm happy to update this if you can't get it working on your AIX or OS/400 or whatever.

因为很明显有些人想知道如何做这一切，所以我在代码中有很多注释，试图让人们明白发生了什么。我编写了这个并在 64 位 Linux 上进行了测试，但它应该适用于任何符合 POSIX 的系统，包括 Microsoft Windows。欢迎报告错误；如果你不能让它在你的 AIX 或 OS/400 或其他任何东西上工作，我很高兴更新它。

As you can see, it's muchmore complicated than the original and necessarily so: at least one function must exist to be called recursively unless you want the code to become very complex (e.g. managing a subdirectory stack and processing that in a single loop). Since we have to check file types, differences between different OSs, standard libraries, etc. come into play, so I have written a program that tries to be usable on any system where it will compile.

正如你所看到的，它的很多比原来的和必然如此复杂：至少一个功能必须存在递归的，除非你想要的代码变得非常复杂，被称为（如管理一个子目录栈和处理，在一个循环中）。由于我们必须检查文件类型，不同操作系统、标准库等之间的差异开始发挥作用，所以我编写了一个程序，试图在任何可以编译的系统上使用。

There is very little error checking, and the countfunction itself doesn't really report errors. The only calls that can really fail are opendirand stat(if you aren't lucky and have a system where direntcontains the file type already). I'm not paranoid about checking the total length of the subdir pathnames, but theoretically, the system shouldn't allow any path name that is longer than than PATH_MAX. If there are concerns, I can fix that, but it's just more code that needs to be explained to someone learning to write C. This program is intended to be an example of how to dive into subdirectories recursively.

错误检查很少，而且count函数本身并没有真正报告错误。唯一可能真正失败的调用是opendirand stat（如果你不走运并且有一个dirent已经包含文件类型的系统）。我对检查 subdir 路径名的总长度并不偏执，但理论上，系统不应允许任何长度超过PATH_MAX. 如果有问题，我可以解决这个问题，但只是需要向学习编写 C 的人解释更多代码。该程序旨在成为如何递归深入子目录的示例。

#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>

#if defined(WIN32) || defined(_WIN32) 
#define PATH_SEPARATOR '\' 
#else
#define PATH_SEPARATOR '/' 
#endif

/* A custom structure to hold separate file and directory counts */
struct filecount {
  long dirs;
  long files;
};

/*
 * counts the number of files and directories in the specified directory.
 *
 * path - relative pathname of a directory whose files should be counted
 * counts - pointer to struct containing file/dir counts
 */
void count(char *path, struct filecount *counts) {
    DIR *dir;                /* dir structure we are reading */
    struct dirent *ent;      /* directory entry currently being processed */
    char subpath[PATH_MAX];  /* buffer for building complete subdir and file names */
    /* Some systems don't have dirent.d_type field; we'll have to use stat() instead */
#if !defined ( _DIRENT_HAVE_D_TYPE )
    struct stat statbuf;     /* buffer for stat() info */
#endif

/* fprintf(stderr, "Opening dir %s\n", path); */
    dir = opendir(path);

    /* opendir failed... file likely doesn't exist or isn't a directory */
    if(NULL == dir) {
        perror(path);
        return;
    }

    while((ent = readdir(dir))) {
      if (strlen(path) + 1 + strlen(ent->d_name) > PATH_MAX) {
          fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + 1 + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
          return;
      }

/* Use dirent.d_type if present, otherwise use stat() */
#if defined ( _DIRENT_HAVE_D_TYPE )
/* fprintf(stderr, "Using dirent.d_type\n"); */
      if(DT_DIR == ent->d_type) {
#else
/* fprintf(stderr, "Don't have dirent.d_type, falling back to using stat()\n"); */
      sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
      if(lstat(subpath, &statbuf)) {
          perror(subpath);
          return;
      }

      if(S_ISDIR(statbuf.st_mode)) {
#endif
          /* Skip "." and ".." directory entries... they are not "real" directories */
          if(0 == strcmp("..", ent->d_name) || 0 == strcmp(".", ent->d_name)) {
/*              fprintf(stderr, "This is %s, skipping\n", ent->d_name); */
          } else {
              sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
              counts->dirs++;
              count(subpath, counts);
          }
      } else {
          counts->files++;
      }
    }

/* fprintf(stderr, "Closing dir %s\n", path); */
    closedir(dir);
}

int main(int argc, char *argv[]) {
    struct filecount counts;
    counts.files = 0;
    counts.dirs = 0;
    count(argv[1], &counts);

    /* If we found nothing, this is probably an error which has already been printed */
    if(0 < counts.files || 0 < counts.dirs) {
        printf("%s contains %ld files and %ld directories\n", argv[1], counts.files, counts.dirs);
    }

    return 0;
}

EDIT 2017-01-17

编辑 2017-01-17

I've incorporated two changes suggested by @FlyingCodeMonkey:

我已经合并了@FlyingCodeMonkey 建议的两个更改：

Use lstatinstead of stat. This will change the behavior of the program if you have symlinked directories in the directory you are scanning. The previous behavior was that the (linked) subdirectory would have its file count added to the overall count; the new behavior is that the linked directory will count as a single file, and its contents will not be counted.
If the path of a file is too long, an error message will be emitted and the program will halt.

使用lstat代替stat。如果您正在扫描的目录中有符号链接的目录，这将改变程序的行为。以前的行为是（链接的）子目录将其文件计数添加到总计数中；新行为是链接目录将计为单个文件，其内容不会被计算在内。
如果文件的路径太长，则会发出错误消息并且程序将停止。

EDIT 2017-06-29

编辑 2017-06-29

With any luck, this will be the lastedit of this answer :)

幸运的话，这将是此答案的最后一次编辑:)

I've copied this code into a GitHub repositoryto make it a bit easier to get the code (instead of copy/paste, you can just download the source), plus it makes it easier for anyone to suggest a modification by submitting a pull-request from GitHub.

我已将此代码复制到GitHub 存储库中，以便更轻松地获取代码（而不是复制/粘贴，您只需下载源代码即可），此外，它使任何人都可以更轻松地通过提交 pull 来提出修改建议- 来自 GitHub 的请求。

The source is available under Apache License 2.0. Patches^*welcome!

该源代码在 Apache 许可证 2.0 下可用。补丁^*欢迎！

"patch" is what old people like me call a "pull request".

“补丁”就是像我这样的老人所说的“拉取请求”。

Answer 10

回答by Mohammad Anini

lsspends more time sorting the files names, using -fto disable the sorting will save sometime:

ls花费更多时间对文件名进行排序，使用-f禁用排序将节省一些时间：

ls -f | wc -l

or you can use find:

或者你可以使用find：

find . -type f | wc -l

大量文件的快速 Linux 文件计数

提问by

回答by igustin

回答by Peter van der Heijden

回答by mark4o

回答by Bogdan St?ncescu

回答by Thomas

回答by mightybs

回答by user2546874

回答by Benubird

回答by Christopher Schultz

回答by Mohammad Anini

相关推荐

最近更新

标签

大量文件的快速 Linux 文件计数

提问by

回答by igustin

回答by Peter van der Heijden

回答by mark4o

回答by Bogdan St?ncescu

回答by Thomas

回答by mightybs

回答by user2546874

回答by Benubird

回答by Christopher Schultz

回答by Mohammad Anini

相关推荐

C# 将字典值转换为数组

如何复制一个巨型文件的前几行，并使用一些 Linux 命令在它的末尾添加一行文本？

C# Windows 窗体应用程序，如具有多个进程的 Google Chrome

Linux Tar 错误：存档中有意外的 EOF

相关推荐

最近更新

标签