Windows 上 Python 中的快速文件夹大小计算

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1987119/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 13:40:33  来源:igfitidea点击:

fast folder size calculation in Python on Windows

pythonwindowssizedirectory

提问by Laurent Luce

I am looking for a fast way to calculate the size of a folder in Python on Windows. This is what I have so far:

我正在寻找一种在 Windows 上的 Python 中计算文件夹大小的快速方法。这是我到目前为止:

def get_dir_size(path):
  total_size = 0
  if platform.system() == 'Windows':
    try:
      items = win32file.FindFilesW(path + '\*')
    except Exception, err:
      return 0

    # Add the size or perform recursion on folders.
    for item in items:
      attr = item[0]
      name = item[-2]
      size = item[5]

      if (attr & win32con.FILE_ATTRIBUTE_DIRECTORY) and \
         not (attr & win32con.FILE_ATTRIBUTE_SYSTEM):  # skip system dirs
        if name not in DIR_EXCLUDES:
          total_size += get_dir_size("%s\%s" % (path, name))

      total_size += size

  return total_size

This is not good enough when size of folder is over 100G. Any ideas how to improve it?

当文件夹大小超过 100G 时,这还不够好。任何想法如何改进它?

On a fast machine (2Ghz+ - 5G of RAM), it took 72 seconds to go over 422GB in 226,001 files and 12,043 folders. It takes 40 seconds using the explorer properties option.

在一台快速的机器上(2Ghz+ - 5G 的 RAM),在 226,001 个文件和 12,043 个文件夹中超过 422GB 需要 72 秒。使用资源管理器属性选项需要 40 秒。

I know I am being a bit greedy but I am hoping for a better solution.

我知道我有点贪心,但我希望有更好的解决方案。

Laurent Luce

洛朗·卢斯

采纳答案by Peter Hansen

A quick profiling of your code suggests that over 90% of the time is consumed in the FindFilesW()call alone. This means any improvements by tweaking the Python code would be minor.

对代码的快速分析表明,超过 90% 的时间仅用于FindFilesW()调用。这意味着通过调整 Python 代码所做的任何改进都是微不足道的。

Tiny tweaks (if you were to stick with FindFilesW) could include ensuring DIR_EXCLUDES is a set instead of a list, avoiding the repeated lookups on other modules, and indexing into item[] lazily, as well as moving the sys.platform check outside. This incorporates these changes and others, but it won't give more than a 1-2% speedup.

微小的调整(如果你坚持使用 FindFilesW)可能包括确保 DIR_EXCLUDES 是一个集合而不是一个列表,避免在其他模块上重复查找,懒惰地索引到 item[],以及将 sys.platform 检查移到外面。这包含了这些更改和其他更改,但不会提供超过 1-2% 的加速

DIR_EXCLUDES = set(['.', '..'])
MASK = win32con.FILE_ATTRIBUTE_DIRECTORY | win32con.FILE_ATTRIBUTE_SYSTEM
REQUIRED = win32con.FILE_ATTRIBUTE_DIRECTORY
FindFilesW = win32file.FindFilesW

def get_dir_size(path):
    total_size = 0
    try:
        items = FindFilesW(path + r'\*')
    except pywintypes.error, ex:
        return total_size

    for item in items:
        total_size += item[5]
        if (item[0] & MASK == REQUIRED):
            name = item[8]
            if name not in DIR_EXCLUDES:
                total_size += get_dir_size(path + '\' + name)

    return total_size

The only significantspeedup would come from using a different API, or a different technique. You mentioned in a comment doing this in the background, so you could structure it to do an incremental update using one of the packages for monitoring changes in folders. Possibly the FindFirstChangeNotification APIor something like it. You could set up to monitor the entire tree, or depending on how that routine works (I haven't used it) you might be better off registering multiple requests on various subsets of the full tree, if that reduces the amount of searching you have to do (when notified) to figure out what actually changed and what size it is now.

唯一显着的加速来自使用不同的 API 或不同的技术。您在评论中提到在后台执行此操作,因此您可以构建它以使用其中一个包来执行增量更新以监视文件夹中的更改。可能是FindFirstChangeNotification API或类似的东西。您可以设置监视整个树,或者根据该例程的工作方式(我没有使用过它),您最好在整个树的各个子集上注册多个请求,如果这样可以减少您的搜索量要做(在收到通知时)以找出实际更改的内容以及它现在的大小。

Edit:I asked in a comment whether you were taking into account the heavy filesystem metadata caching that Windows XP and later do. I just checked performance of your code (and mine) against Windows itself, selecting all items in my C:\ folder and hitting Alt-Enter to bring up the properties window. After doing this once (using your code) and getting a 40s elapsed time, I now get 20s elapsed from both methods. In other words, your code is actually just as fast as Windows itself, at least on my machine.

编辑:我在评论中询问您是否考虑了 Windows XP 及更高版本所做的大量文件系统元数据缓存。我刚刚针对 Windows 本身检查了您的代码(和我的)的性能,选择了我的 C:\ 文件夹中的所有项目,然后按 Alt-Enter 以显示属性窗口。执行此操作一次(使用您的代码)并获得 40 秒的消逝时间后,我现在从这两种方法中获得了 20 秒的消逝时间。换句话说,您的代码实际上和 Windows 本身一样快,至少在我的机器上是这样。

回答by jbochi

You don't need to use a recursive algorithm if you use os.walk. Please check this question.

如果使用 os.walk,则不需要使用递归算法。请检查这个问题

You should time both approaches, but this is supposed to be much faster:

您应该对两种方法计时,但这应该会快得多:

import os

def get_dir_size(root):
    size = 0
    for path, dirs, files in os.walk(root):
        for f in files:
            size +=  os.path.getsize( os.path.join( path, f ) )
    return size

回答by ephemient

I don't have a Windows box to test on at the moment, but the documentation states that win32file.FindFilesIteratoris "similar to win32file.FindFiles, but avoid the creation of the list for huge directories". Does that help?

我目前没有要测试的 Windows 框,但文档指出 win32file.FindFilesIterator“类似于win32file.FindFiles,但避免为大目录创建列表”。这有帮助吗?

回答by Jürgen A. Erhard

It's a whopper of a directory tree. As others have said, I'm not sure you can speed it up... not like that, cold w/o data. And that means...

这是一个目录树的大佬。正如其他人所说,我不确定你能不能加快速度……不是那样的,没有数据的冷。这意味着...

If you can cachedata, somehow (not sure what the actual implication is), then you could speed things up (I think... as always, measure, measure, measure).

如果您可以以某种方式缓存数据(不确定实际含义是什么),那么您可以加快速度(我认为......一如既往,测量,测量,测量)。

I don't think I have to tell you how to do caching, I guess, you seem like a knowledgeable person. And I wouldn't know off the cuff for Windows anyway. ;-)

我想我不必告诉你如何做缓存,我猜,你看起来像个知识渊博的人。而且我无论如何都不知道适用于 Windows 的情况。;-)

回答by Robert P

This jumps out at me:

这让我眼前一亮:

try:
  items = win32file.FindFilesW(path + '\*')
except Exception, err:
  return 0

Exception handling can add significant time to your algorithm. If you can specify the path differently, in a way that you always know is safe, and thus prevent the need to capture exceptions (eg, checking first to see if the given path is a folder before finding files in that folder), you may find a significant speedup.

异常处理可以为您的算法增加大量时间。如果您可以以不同的方式指定路径,以您始终知道是安全的方式,从而防止需要捕获异常(例如,在查找文件夹中的文件之前首先检查给定路径是否为文件夹),您可以找到显着的加速。

回答by Boubakr Nour

# Size of File Folder/Directory in MBytes

import os

# pick a folder you have ...
folder = 'D:\zz1'
folder_size = 0
for (path, dirs, files) in os.walk(folder):
  for file in files:
    filename = os.path.join(path, file)
    folder_size += os.path.getsize(filename)

print "Folder = %0.1f MB" % (folder_size/(1024*1024.0))