windows 目录文件大小计算 - 如何使其更快?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2979432/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Directory file size calculation - how to make it faster?
提问by Jey Geethan
Using C#, I am finding the total size of a directory. The logic is this way : Get the files inside the folder. Sum up the total size. Find if there are sub directories. Then do a recursive search.
使用 C#,我找到了目录的总大小。逻辑是这样的:获取文件夹内的文件。总结总大小。查找是否有子目录。然后进行递归搜索。
I tried one another way to do this too : Using FSO (obj.GetFolder(path).Size
). There's not much of difference in time in both these approaches.
我也尝试了另一种方法来做到这一点:使用 FSO ( obj.GetFolder(path).Size
)。这两种方法在时间上没有太大区别。
Now the problem is, I have tens of thousands of files in a particular folder and its taking like atleast 2 minute to find the folder size. Also, if I run the program again, it happens very quickly (5 secs). I think the windows is caching the file sizes.
现在的问题是,我在特定文件夹中有数万个文件,并且至少需要 2 分钟才能找到文件夹大小。此外,如果我再次运行该程序,它会发生得非常快(5 秒)。我认为 Windows 正在缓存文件大小。
Is there any way I can bring down the time taken when I run the program first time??
有什么办法可以减少我第一次运行程序时花费的时间?
回答by spookycoder
If fiddled with it a while, trying to Parallelize it, and surprisingly - it speeded up here on my machine (up to 3 times on a quadcore), don't know if it is valid in all cases, but give it a try...
如果对它摆弄一段时间,尝试对其进行并行化,并且令人惊讶的是 - 它在我的机器上加速(在四核上最多 3 倍),不知道它是否在所有情况下都有效,但试一试。 ..
.NET4.0 Code (or use 3.5 with TaskParallelLibrary)
.NET4.0 代码(或将 3.5 与 TaskParallelLibrary 一起使用)
private static long DirSize(string sourceDir, bool recurse)
{
long size = 0;
string[] fileEntries = Directory.GetFiles(sourceDir);
foreach (string fileName in fileEntries)
{
Interlocked.Add(ref size, (new FileInfo(fileName)).Length);
}
if (recurse)
{
string[] subdirEntries = Directory.GetDirectories(sourceDir);
Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
{
subtotal += DirSize(subdirEntries[i], true);
return subtotal;
}
return 0;
},
(x) => Interlocked.Add(ref size, x)
);
}
return size;
}
回答by stuck
Hard disks are an interesting beast - sequential access (reading a big contiguous file for example) is super zippy, figure 80megabytes/sec. however random access is very slow. this is what you're bumping into - recursing into the folders wont read much (in terms of quantity) data, but will require many random reads. The reason you're seeing zippy perf the second go around is because the MFT is still in RAM (you're correct on the caching thought)
硬盘是一种有趣的野兽 - 顺序访问(例如读取一个大的连续文件)非常快,数字 80 兆字节/秒。但是随机访问非常慢。这就是您遇到的问题 - 递归到文件夹中不会读取太多(就数量而言)数据,但需要多次随机读取。您第二次看到 zippy perf 的原因是因为 MFT 仍在 RAM 中(您对缓存的想法是正确的)
The best mechanism I've seen to achieve this is to scan the MFT yourself. The idea is you read and parse the MFT in one linear pass building the information you need as you go. The end result will be something much closer to 15 seconds on a HD that is very full.
我见过的最好的机制是自己扫描 MFT。这个想法是你在一次线性传递中阅读和解析 MFT,在你进行时构建你需要的信息。最终结果将在非常完整的 HD 上接近 15 秒。
some good reading: NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspxWindows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&s=books&qid=1277085832&sr=8-1
一些不错的阅读:NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspxWindows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Inducing- Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&s=books&qid=1277085832&sr=8-1
FWIW: this method is very complicated as there really isn't a great way to do this in Windows (or any OS I'm aware of) - the problem is that the act of figuring out which folders/files are needed requires much head movement on the disk. It'd be very tough for Microsoft to build a general solution to the problem you describe.
FWIW:这种方法非常复杂,因为在 Windows(或我知道的任何操作系统)中确实没有很好的方法来做到这一点 - 问题是找出需要哪些文件夹/文件的行为需要很多头脑在磁盘上移动。微软很难为你描述的问题建立一个通用的解决方案。
回答by Evan
The short answer is no. The way Windows could make the directory size computation a faster would be to update the directory size and all parent directory sizes on each file write. However, that would make file writes a slower operation. Since it is much more common to do file writes than read directory sizes it is a reasonable tradeoff.
最简洁的答案是不。Windows 可以使目录大小计算更快的方式是在每次文件写入时更新目录大小和所有父目录大小。但是,这会使文件写入操作变慢。由于写入文件比读取目录大小更常见,因此这是一个合理的权衡。
I am not sure what exact problem is being solved but if it is file system monitoring it might be worth checking out: http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx
我不确定正在解决什么确切的问题,但如果是文件系统监控,它可能值得一试:http: //msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx
回答by AMissico
Peformance will suffer using any methodwhen scanning a folder with tens of thousands of files.
扫描包含数万个文件的文件夹时,使用任何方法都会影响性能。
Using the Windows API FindFirstFile... and FindNextFile... functions provides the fastest access.
Due to marshalling overhead, even if you use the Windows API functions, performance will not increase. The framework already wraps these API functions, so there is no sense doing it yourself.
How you handle the results for any file access method determines the performance of your application. For instance, even if you use the Windows API functions, updating a list-box is where performance will suffer.
You cannot compare the execution speed to Windows Explorer. From my experimentation, I believe Windows Explorer reads directly from the file-allocation-table in many cases.
I do know that the fastest access to the file system is the
DIR
command. You cannot compare performance to this command. It definitely reads directly from the file-allocation-table (propbably using BIOS).Yes, the operating-system caches file access.
使用 Windows API FindFirstFile... 和 FindNextFile... 函数提供最快的访问。
由于编组开销,即使您使用 Windows API 函数,性能也不会提高。框架已经封装了这些API函数,所以自己做没有意义。
您如何处理任何文件访问方法的结果决定了您的应用程序的性能。例如,即使您使用 Windows API 函数,更新列表框也会影响性能。
您无法将执行速度与 Windows 资源管理器进行比较。根据我的实验,我相信 Windows 资源管理器在许多情况下直接从文件分配表中读取。
我知道对文件系统的最快访问是
DIR
命令。您无法将性能与此命令进行比较。它肯定是直接从文件分配表中读取的(可能使用 BIOS)。是的,操作系统缓存文件访问。
Suggestions
建议
I wonder if
BackupRead
would help in your case?What if you shell out to DIR and capture then parse its output? (You are not really parsing because each DIR line is fixed-width, so it is just a matter of calling substring.)
What if you shell out to
DIR /B > NULL
on a background thread then run your program? While DIR is running, you will benefit from the cached file access.
我想知道
BackupRead
对你的情况有帮助吗?如果您向 DIR 输出并捕获然后解析其输出怎么办?(您并没有真正解析,因为每个 DIR 行都是固定宽度的,所以这只是调用子字符串的问题。)
如果你
DIR /B > NULL
在后台线程上运行然后运行你的程序怎么办?当 DIR 运行时,您将受益于缓存文件访问。
回答by Hans Olsson
I don't think it will change a lot, but it might go a little faster if you use the API functions FindFirstFile
and NextFile
to do it.
我不认为它会改变很多,但是如果您使用 API 函数FindFirstFile
并NextFile
执行它,它可能会更快一点。
I don't think there's any really quick way of doing it however. For comparison purposes you could try doing dir /a /x /s > dirlist.txt
and to list the directory in Windows Explorer to see how fast they are, but I think they will be similar to FindFirstFile
.
然而,我认为没有任何真正快速的方法可以做到这一点。出于比较目的,您可以尝试dir /a /x /s > dirlist.txt
在 Windows 资源管理器中列出目录以查看它们的速度,但我认为它们将类似于FindFirstFile
.
PInvokehas a sample of how to use the API.
PInvoke有一个关于如何使用 API 的示例。
回答by Adam Calvet Bohl
Based on the answer by spookycoder, I found this variation (using DirectoryInfo
) at least 2 times faster (and up to 10 times faster on complex folder structures!) :
根据 spookycoder 的回答,我发现这种变化(使用DirectoryInfo
)至少快了 2 倍(在复杂的文件夹结构上快了 10 倍!):
public static long CalcDirSize(string sourceDir, bool recurse = true)
{
return _CalcDirSize(new DirectoryInfo(sourceDir), recurse);
}
private static long _CalcDirSize(DirectoryInfo di, bool recurse = true)
{
long size = 0;
FileInfo[] fiEntries = di.GetFiles();
foreach (var fiEntry in fiEntries)
{
Interlocked.Add(ref size, fiEntry.Length);
}
if (recurse)
{
DirectoryInfo[] diEntries = di.GetDirectories("*.*", SearchOption.TopDirectoryOnly);
System.Threading.Tasks.Parallel.For<long>(0, diEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((diEntries[i].Attributes & FileAttributes.ReparsePoint) == FileAttributes.ReparsePoint) return 0;
subtotal += __CalcDirSize(diEntries[i], true);
return subtotal;
},
(x) => Interlocked.Add(ref size, x)
);
}
return size;
}
回答by Adrian Regan
I gave up on the .NET implementations (for performance reasons) and used the Native function GetFileAttributesEx(...)
我放弃了 .NET 实现(出于性能原因)并使用了本机函数 GetFileAttributesEx(...)
Try this:
尝试这个:
[StructLayout(LayoutKind.Sequential)]
public struct WIN32_FILE_ATTRIBUTE_DATA
{
public uint fileAttributes;
public System.Runtime.InteropServices.ComTypes.FILETIME creationTime;
public System.Runtime.InteropServices.ComTypes.FILETIME lastAccessTime;
public System.Runtime.InteropServices.ComTypes.FILETIME lastWriteTime;
public uint fileSizeHigh;
public uint fileSizeLow;
}
public enum GET_FILEEX_INFO_LEVELS
{
GetFileExInfoStandard,
GetFileExMaxInfoLevel
}
public class NativeMethods {
[DllImport("KERNEL32.dll", CharSet = CharSet.Auto)]
public static extern bool GetFileAttributesEx(string path, GET_FILEEX_INFO_LEVELS level, out WIN32_FILE_ATTRIBUTE_DATA data);
}
Now simply do the following:
现在只需执行以下操作:
WIN32_FILE_ATTRIBUTE_DATA data;
if(NativeMethods.GetFileAttributesEx("[your path]", GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, out data)) {
long size = (data.fileSizeHigh << 32) & data.fileSizeLow;
}
回答by Chris Kemp
With tens of thousands of files, you're not going to win with a head-on assault. You need to try to be a bit more creative with the solution. With that many files you could probably even find that in the time it takes you calculate the size, the files have changed and your data is already wrong.
拥有数以万计的文件,正面攻击不会让您获胜。您需要尝试对解决方案更具创意。有了这么多文件,您甚至可能会发现,在计算大小的过程中,文件已更改并且您的数据已经错误。
So, you need to move the load to somewhere else. For me, the answer would be to use System.IO.FileSystemWatcher
and write some code that monitors the directory and updates an index.
因此,您需要将负载移动到其他地方。对我来说,答案是使用System.IO.FileSystemWatcher
和编写一些代码来监视目录并更新索引。
It should take only a short time to write a Windows Service that can be configured to monitor a set of directories and write the results to a shared output file. You can have the service recalculate the file sizes on startup, but then just monitor for changes whenever a Create/Delete/Changed event is fired by the System.IO.FileSystemWatcher
. The benefit of monitoring the directory is that you are only interested in small changes, which means that your figures have a higher chance of being correct (remember all data is stale!)
编写可配置为监视一组目录并将结果写入共享输出文件的 Windows 服务应该只需要很短的时间。您可以让服务在启动时重新计算文件大小,但是只要System.IO.FileSystemWatcher
. 监视目录的好处是您只对小的更改感兴趣,这意味着您的数字有更高的正确机会(请记住,所有数据都是陈旧的!)
Then, the only thing to look out for would be that you would have multiple resources both trying to access the resulting output file. So just make sure that you take that into account.
然后,唯一需要注意的是您将有多个资源试图访问生成的输出文件。所以只要确保你考虑到这一点。