C# 在 .NET 中是否有更快的方法来递归扫描目录?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/724148/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 22:35:36  来源:igfitidea点击:

Is there a faster way to scan through a directory recursively in .NET?

c#.netfilesystems

提问by Sam Saffron

I am writing a directory scanner in .NET.

我正在 .NET 中编写目录扫描程序。

For each File/Dir I need the following info.

对于每个文件/目录,我需要以下信息。

   class Info {
        public bool IsDirectory;
        public string Path;
        public DateTime ModifiedDate;
        public DateTime CreatedDate;
    }

I have this function:

我有这个功能:

      static List<Info> RecursiveMovieFolderScan(string path){

        var info = new List<Info>();
        var dirInfo = new DirectoryInfo(path);
        foreach (var dir in dirInfo.GetDirectories()) {
            info.Add(new Info() {
                IsDirectory = true,
                CreatedDate = dir.CreationTimeUtc,
                ModifiedDate = dir.LastWriteTimeUtc,
                Path = dir.FullName
            });

            info.AddRange(RecursiveMovieFolderScan(dir.FullName));
        }

        foreach (var file in dirInfo.GetFiles()) {
            info.Add(new Info()
            {
                IsDirectory = false,
                CreatedDate = file.CreationTimeUtc,
                ModifiedDate = file.LastWriteTimeUtc,
                Path = file.FullName
            });
        }

        return info; 
    }

Turns out this implementation is quite slow. Is there any way to speed this up? I'm thinking of hand coding this with FindFirstFileW but would like to avoid that if there is a built in way that is faster.

事实证明,这个实现很慢。有没有办法加快这个速度?我正在考虑使用 FindFirstFileW 对此进行手动编码,但如果有更快的内置方式,我想避免这种情况。

采纳答案by Sam Saffron

This implementation, which needs a bit of tweaking is 5-10X faster.

这个需要稍作调整的实现速度提高了 5-10 倍。

    static List<Info> RecursiveScan2(string directory) {
        IntPtr INVALID_HANDLE_VALUE = new IntPtr(-1);
        WIN32_FIND_DATAW findData;
        IntPtr findHandle = INVALID_HANDLE_VALUE;

        var info = new List<Info>();
        try {
            findHandle = FindFirstFileW(directory + @"\*", out findData);
            if (findHandle != INVALID_HANDLE_VALUE) {

                do {
                    if (findData.cFileName == "." || findData.cFileName == "..") continue;

                    string fullpath = directory + (directory.EndsWith("\") ? "" : "\") + findData.cFileName;

                    bool isDir = false;

                    if ((findData.dwFileAttributes & FileAttributes.Directory) != 0) {
                        isDir = true;
                        info.AddRange(RecursiveScan2(fullpath));
                    }

                    info.Add(new Info()
                    {
                        CreatedDate = findData.ftCreationTime.ToDateTime(),
                        ModifiedDate = findData.ftLastWriteTime.ToDateTime(),
                        IsDirectory = isDir,
                        Path = fullpath
                    });
                }
                while (FindNextFile(findHandle, out findData));

            }
        } finally {
            if (findHandle != INVALID_HANDLE_VALUE) FindClose(findHandle);
        }
        return info;
    }

extension method:

扩展方法:

 public static class FILETIMEExtensions {
        public static DateTime ToDateTime(this System.Runtime.InteropServices.ComTypes.FILETIME filetime ) {
            long highBits = filetime.dwHighDateTime;
            highBits = highBits << 32;
            return DateTime.FromFileTimeUtc(highBits + (long)filetime.dwLowDateTime);
        }
    }

interop defs are:

互操作定义是:

    [DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
    public static extern IntPtr FindFirstFileW(string lpFileName, out WIN32_FIND_DATAW lpFindFileData);

    [DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
    public static extern bool FindNextFile(IntPtr hFindFile, out WIN32_FIND_DATAW lpFindFileData);

    [DllImport("kernel32.dll")]
    public static extern bool FindClose(IntPtr hFindFile);

    [StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
    public struct WIN32_FIND_DATAW {
        public FileAttributes dwFileAttributes;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
        public int nFileSizeHigh;
        public int nFileSizeLow;
        public int dwReserved0;
        public int dwReserved1;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
        public string cFileName;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
        public string cAlternateFileName;
    }

回答by Jimmy

try this (i.e. do the initialization first, and then reuse your list and your directoryInfo objects):

试试这个(即先进行初始化,然后重用您的列表和 directoryInfo 对象):

  static List<Info> RecursiveMovieFolderScan1() {
      var info = new List<Info>();
      var dirInfo = new DirectoryInfo(path);
      RecursiveMovieFolderScan(dirInfo, info);
      return info;
  } 

  static List<Info> RecursiveMovieFolderScan(DirectoryInfo dirInfo, List<Info> info){

    foreach (var dir in dirInfo.GetDirectories()) {

        info.Add(new Info() {
            IsDirectory = true,
            CreatedDate = dir.CreationTimeUtc,
            ModifiedDate = dir.LastWriteTimeUtc,
            Path = dir.FullName
        });

        RecursiveMovieFolderScan(dir, info);
    }

    foreach (var file in dirInfo.GetFiles()) {
        info.Add(new Info()
        {
            IsDirectory = false,
            CreatedDate = file.CreationTimeUtc,
            ModifiedDate = file.LastWriteTimeUtc,
            Path = file.FullName
        });
    }

    return info; 
}

回答by tylerl

Depending on how much time you're trying to shave off the function, it may be worth your while to call the Win32 API functions directly, since the existing API does a lot of extra processing to check things that you may not be interested in.

根据您尝试删除函数的时间,直接调用 Win32 API 函数可能是值得的,因为现有 API 会执行大量额外处理来检查您可能不感兴趣的内容。

If you haven't done so already, and assuming you don't intend to contribute to the Mono project, I would strongly recommend downloading Reflectorand having a look at how Microsoft implemented the API calls you're currently using. This will give you an idea of what you need to call and what you can leave out.

如果您还没有这样做,并且假设您不打算为 Mono 项目做出贡献,我强烈建议您下载Reflector并查看 Microsoft 如何实现您当前使用的 API 调用。这将使您了解需要调用的内容以及可以省略的内容。

You might, for example, opt to create an iterator that yields directory names instead of a function that returns a list, that way you don't end up iterating over the same list of names two or three times through all the various levels of code.

例如,您可以选择创建一个包含yield目录名称的迭代器,而不是一个返回列表的函数,这样您就不会在所有不同级别的代码中对同一个名称列表迭代两到三次.

回答by Bertvan

I'd use or base myself on this multi-threaded library: http://www.codeproject.com/KB/files/FileFind.aspx

我会使用或基于这个多线程库:http: //www.codeproject.com/KB/files/FileFind.aspx

回答by Robert Paulson

Its pretty shallow, 371 dirs with average of 10 files in each directory. some dirs contain other sub dirs

它非常浅,有 371 个目录,每个目录中平均有 10 个文件。一些目录包含其他子目录

This is just a comment, but your numbers do appear to be quite high. I ran the below using essentially the same recursive method you are using and my times are far lower despite creating string output.

这只是一个评论,但你的数字似乎相当高。我使用与您正在使用的基本相同的递归方法运行以下命令,尽管创建了字符串输出,但我的时间要低得多。

    public void RecurseTest(DirectoryInfo dirInfo, 
                            StringBuilder sb, 
                            int depth)
    {
        _dirCounter++;
        if (depth > _maxDepth)
            _maxDepth = depth;

        var array = dirInfo.GetFileSystemInfos();
        foreach (var item in array)
        {
            sb.Append(item.FullName);
            if (item is DirectoryInfo)
            {
                sb.Append(" (D)");
                sb.AppendLine();

                RecurseTest(item as DirectoryInfo, sb, depth+1);
            }
            else
            { _fileCounter++; }

            sb.AppendLine();
        }
    }

I ran the above code on a number of different directories. On my machine the 2nd call to scan a directory tree was usually faster due to caching either by the runtime or the file system. Note that this system isn't anything too special, just a 1yr old development workstation.

我在许多不同的目录上运行了上面的代码。在我的机器上,由于运行时或文件系统的缓存,第二次调用扫描目录树通常更快。请注意,该系统并没有什么特别之处,只是一个使用了 1 年的开发工作站。

// cached call
Dirs = 150, files = 420, max depth = 5
Time taken = 53 milliseconds

// cached call
Dirs = 1117, files = 9076, max depth = 11
Time taken = 433 milliseconds

// first call
Dirs = 1052, files = 5903, max depth = 12
Time taken = 11921 milliseconds

// first call
Dirs = 793, files = 10748, max depth = 10
Time taken = 5433 milliseconds (2nd run 363 milliseconds)

Concerned that I wasn't getting the create and modified date, the code was modified to output this as well with the following times.

担心我没有得到创建和修改日期,代码被修改为在以下时间输出。

// now grabbing last update and creation time.
Dirs = 150, files = 420, max depth = 5
Time taken = 103 milliseconds (2nd run 93 milliseconds)

Dirs = 1117, files = 9076, max depth = 11
Time taken = 992 milliseconds (2nd run 984 milliseconds)

Dirs = 793, files = 10748, max depth = 10
Time taken = 1382 milliseconds (2nd run 735 milliseconds)

Dirs = 1052, files = 5903, max depth = 12
Time taken = 936 milliseconds (2nd run 595 milliseconds)

Note: System.Diagnostics.StopWatch class used for timing.

注意:System.Diagnostics.StopWatch 类用于计时。

回答by Jim Mischel

I just ran across this. Nice implementation of the native version.

我刚刚遇到了这个。原生版本的良好实现。

This version, while still slower than the version that uses FindFirstand FindNext, is quite a bit faster than the your original .NET version.

此版本虽然仍然比使用FindFirstand的版本慢,但FindNext比原始 .NET 版本快很多。

    static List<Info> RecursiveMovieFolderScan(string path)
    {
        var info = new List<Info>();
        var dirInfo = new DirectoryInfo(path);
        foreach (var entry in dirInfo.GetFileSystemInfos())
        {
            bool isDir = (entry.Attributes & FileAttributes.Directory) != 0;
            if (isDir)
            {
                info.AddRange(RecursiveMovieFolderScan(entry.FullName));
            }
            info.Add(new Info()
            {
                IsDirectory = isDir,
                CreatedDate = entry.CreationTimeUtc,
                ModifiedDate = entry.LastWriteTimeUtc,
                Path = entry.FullName
            });
        }
        return info;
    }

It should produce the same output as your native version. My testing shows that this version takes about 1.7 times as long as the version that uses FindFirstand FindNext. Timings obtained in release mode running without the debugger attached.

它应该产生与您的本机版本相同的输出。我的测试表明,这个版本的时间大约是使用FindFirst和的版本的 1.7 倍FindNext。在未连接调试器的情况下在发布模式下运行时获得的时间。

Curiously, changing the GetFileSystemInfosto EnumerateFileSystemInfosadds about 5% to the running time in my tests. I rather expected it to run at the same speed or possibly faster because it didn't have to create the array of FileSystemInfoobjects.

奇怪的是,在我的测试中,更改GetFileSystemInfostoEnumerateFileSystemInfos会增加大约 5% 的运行时间。我宁愿期望它以相同的速度或可能更快地运行,因为它不必创建FileSystemInfo对象数组。

The following code is shorter still, because it lets the Framework take care of recursion. But it's a good 15% to 20% slower than the version above.

下面的代码更短,因为它让框架负责递归。但它比上面的版本慢了 15% 到 20%。

    static List<Info> RecursiveScan3(string path)
    {
        var info = new List<Info>();

        var dirInfo = new DirectoryInfo(path);
        foreach (var entry in dirInfo.EnumerateFileSystemInfos("*", SearchOption.AllDirectories))
        {
            info.Add(new Info()
            {
                IsDirectory = (entry.Attributes & FileAttributes.Directory) != 0,
                CreatedDate = entry.CreationTimeUtc,
                ModifiedDate = entry.LastWriteTimeUtc,
                Path = entry.FullName
            });
        }
        return info;
    }

Again, if you change that to GetFileSystemInfos, it will be slightly (but only slightly) faster.

同样,如果您将其更改为GetFileSystemInfos,它会稍微(但只是稍微)快一点。

For my purposes, the first solution above is quite fast enough. The native version runs in about 1.6 seconds. The version that uses DirectoryInforuns in about 2.9 seconds. I suppose if I were running these scans very frequently, I'd change my mind.

就我而言,上面的第一个解决方案足够快。本机版本的运行时间约为 1.6 秒。使用的版本DirectoryInfo运行时间约为 2.9 秒。我想如果我经常运行这些扫描,我会改变主意。

回答by csharptest.net

There is a long history of the .NET file enumeration methods being slow. The issue is there is not an instantaneous way of enumerating large directory structures. Even the accepted answer here has it's issues with GC allocations.

.NET 文件枚举方法缓慢的历史由来已久。问题是没有枚举大型目录结构的即时方法。即使这里接受的答案也有 GC 分配的问题。

The best I've been able do is wrapped up in my library and exposed as the FileFile(source) class in the CSharpTest.Net.IO namespace. This class can enumerate files and folders without unneeded GC allocations and string marshaling.

我所能做的最好的事情是封装在我的库中并作为CSharpTest.Net.IO 命名空间中FileFile)类公开。此类可以枚举文件和文件夹,而无需不必要的 GC 分配和字符串封送处理。

The usage is simple enough, and the RaiseOnAccessDenied property will skip the directories and files the user does not have access to:

用法很简单,RaiseOnAccessDenied 属性会跳过用户无权访问的目录和文件:

    private static long SizeOf(string directory)
    {
        var fcounter = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
        fcounter.RaiseOnAccessDenied = false;

        long size = 0, total = 0;
        fcounter.FileFound +=
            (o, e) =>
            {
                if (!e.IsDirectory)
                {
                    Interlocked.Increment(ref total);
                    size += e.Length;
                }
            };

        Stopwatch sw = Stopwatch.StartNew();
        fcounter.Find();
        Console.WriteLine("Enumerated {0:n0} files totaling {1:n0} bytes in {2:n3} seconds.",
                          total, size, sw.Elapsed.TotalSeconds);
        return size;
    }

For my local C:\ drive this outputs the following:

对于我的本地 C:\ 驱动器,它输出以下内容:

Enumerated 810,046 files totaling 307,707,792,662 bytes in 232.876 seconds.

在 232.876 秒内枚举了 810,046 个文件,总计 307,707,792,662 字节。

Your mileage may vary by drive speed, but this is the fastest method I've found of enumerating files in managed code. The event parameter is a mutating class of type FindFile.FileFoundEventArgsso be sure you do not keep a reference to it as it's values will change for each event raised.

您的里程可能因驱动器速度而异,但这是我发现的在托管代码中枚举文件的最快方法。event 参数是FindFile.FileFoundEventArgs类型的变异类,因此请确保不要保留对它的引用,因为它的值会随着每个引发的事件而改变。

You might also note that the DateTime's exposed are only in UTC. The reason is that the conversion to local time is semi-expensive. You might consider using UTC times to improve performance rather than converting these to local time.

您可能还注意到 DateTime 的公开仅在 UTC 中。原因是转换为本地时间是半昂贵的。您可能会考虑使用 UTC 时间来提高性能,而不是将这些时间转换为本地时间。

回答by user3918709

Recently I have the same question, I think it is also good to output all folders and files into a text file, and then use streamreader to read the text file, do what you want to process with multi-thread.

最近我也有同样的问题,我觉得把所有文件夹和文件输出到一个文本文件中,然后用streamreader读取文本文件,多线程做你想做的事情也不错。

cmd.exe /u /c dir "M:\" /s /b >"c:\flist1.txt"

[update] Hi Moby, you are correct. My approach is slower due to overhead of reading back the output text file. Actually I took some time to test the top answer and cmd.exe with 2 million files.

[更新] 嗨 Moby,你是对的。由于读回输出文本文件的开销,我的方法较慢。实际上,我花了一些时间用 200 万个文件测试了最佳答案和 cmd.exe。

The top answer: 2010100 files, time: 53023
cmd.exe method: 2010100 files, cmd time: 64907, scan output file time: 19832.

The top answer method(53023) is faster than cmd.exe(64907), not to mention how to improve reading output text file. Although my original point is to provide a not-too-bad answer, still feel sorry, ha.

首选方法(53023)比cmd.exe(64907)还快,更不用说如何改进读取输出文本文件了。虽然我的原点是提供一个还不错的答案,但还是觉得抱歉,哈。

回答by Justin Shidell

I recently (2020) discovered this post because of a need to count files and directories across slow connections, and this was the fastest implementation I could come up with. The .NET enumeration methods (GetFiles(), GetDirectories()) perform a lot of under-the-hood work that slows them down tremendously by comparison.

我最近(2020 年)发现了这篇文章,因为需要计算慢速连接中的文件和目录,这是我能想到的最快的实现。.NET 枚举方法(GetFiles()、GetDirectories())执行了大量的底层工作,相比之下,这大大减慢了它们的速度。

This solution utilizes the Win32 API and .NET's Parallel.ForEach() to leverage the threadpool to maximize performance.

该解决方案利用 Win32 API 和 .NET 的 Parallel.ForEach() 来利用线程池来最大化性能。

P/Invoke:

P/调用:

/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findfirstfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr FindFirstFile(
    string lpFileName,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findnextfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindNextFile(
    IntPtr hFindFile,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findclose
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindClose(
    IntPtr hFindFile
    );

Method:

方法:

public static Tuple<long, long> CountFilesDirectories(
    string path,
    CancellationToken token
    )
{
    if (String.IsNullOrWhiteSpace(path))
        throw new ArgumentNullException("path", "The provided path is NULL or empty.");

    // If the provided path doesn't end in a backslash, append one.
    if (path.Last() != '\')
        path += '\';

    IntPtr hFile = IntPtr.Zero;
    Win32.Kernel32.WIN32_FIND_DATA fd = new Win32.Kernel32.WIN32_FIND_DATA();

    long files = 0;
    long dirs = 0;

    try
    {
        hFile = Win32.Kernel32.FindFirstFile(
            path + "*", // Discover all files/folders by ending a directory with "*", e.g. "X:\*".
            ref fd
            );

        // If we encounter an error, or there are no files/directories, we return no entries.
        if (hFile.ToInt64() == -1)
            return Tuple.Create<long, long>(0, 0);

        //
        // Find (and count) each file/directory, then iterate through each directory in parallel to maximize performance.
        //

        List<string> directories = new List<string>();

        do
        {
            // If a directory (and not a Reparse Point), and the name is not "." or ".." which exist as concepts in the file system,
            // count the directory and add it to a list so we can iterate over it in parallel later on to maximize performance.
            if ((fd.dwFileAttributes & FileAttributes.Directory) != 0 &&
                (fd.dwFileAttributes & FileAttributes.ReparsePoint) == 0 &&
                fd.cFileName != "." && fd.cFileName != "..")
            {
                directories.Add(System.IO.Path.Combine(path, fd.cFileName));
                dirs++;
            }
            // Otherwise, if this is a file ("archive"), increment the file count.
            else if ((fd.dwFileAttributes & FileAttributes.Archive) != 0)
            {
                files++;
            }
        }
        while (Win32.Kernel32.FindNextFile(hFile, ref fd));

        // Iterate over each discovered directory in parallel to maximize file/directory counting performance,
        // calling itself recursively to traverse each directory completely.
        Parallel.ForEach(
            directories,
            new ParallelOptions()
            {
                CancellationToken = token
            },
            directory =>
            {
                var count = CountFilesDirectories(
                    directory,
                    token
                    );

                lock (directories)
                {
                    files += count.Item1;
                    dirs += count.Item2;
                }
            });
    }
    catch (Exception)
    {
        // Handle as desired.
    }
    finally
    {
        if (hFile.ToInt64() != 0)
            Win32.Kernel32.FindClose(hFile);
    }

    return Tuple.Create<long, long>(files, dirs);
}

On my local system, the performance of GetFiles()/GetDirectories() can be close to this, but across slower connections (VPNs, etc.) I found that this is tremendously faster—45 minutes vs. 90 seconds to access a remote directory of ~40k files, ~40 GB in size.

在我的本地系统上,GetFiles()/GetDirectories() 的性能可能接近此值,但在较慢的连接(VPN 等)上,我发现这要快得多——访问远程目录需要 45 分钟与 90 秒约 40k 个文件,大小约 40 GB。

This can also fairly easily be modified to include other data, like the total file size of all files counted, or rapidly recursing through and deleting empty directories, starting at the furthest branch.

这也可以很容易地进行修改以包含其他数据,例如计算的所有文件的总文件大小,或者从最远的分支开始快速递归并删除空目录。