windows Windows的快速文件/目录扫描方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/397293/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fast file/directory scan method for windows?
提问by Parand
I'm looking for a high performance method or library for scanning all files on disk or in a given directory and grabbing their basic stats - filename, size, and modification date.
我正在寻找一种高性能的方法或库来扫描磁盘上或给定目录中的所有文件并获取它们的基本统计信息 - 文件名、大小和修改日期。
I've written a python program that uses os.walk
along with os.path.getsize
to get the file list, and it works fine, but is not particularly fast. I noticed one of the freeware programs I had downloaded accomplished the same scan much faster than my program.
我编写了一个 python 程序,它使用os.walk
withos.path.getsize
来获取文件列表,它工作正常,但不是特别快。我注意到我下载的一个免费软件程序比我的程序更快地完成了相同的扫描。
Any ideas for speeding up the file scan? Here's my python code, but keep in mind that I'm not at all married to os.walk and perfectly willing to use others APIs (including windows native APIs) if there are better alternatives.
任何加快文件扫描速度的想法?这是我的 python 代码,但请记住,如果有更好的替代方案,我完全不喜欢 os.walk 并且非常愿意使用其他 API(包括 Windows 本机 API)。
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
...
I should also note I realize the python code probably can't be sped up that much; I'm particularly interested in any native APIs that provide better speed.
我还应该注意,我意识到 Python 代码可能无法加速那么多;我对任何提供更好速度的本机 API 都特别感兴趣。
回答by rob
Well, I would expect this to be heavily I/O bound task.
As such, optimizations on python side would be quite ineffective; the only optimization I could think of is some different way of accessing/listing files, in order to reduce the actual read from the file system.
This of course requires a deep knowledge of the file system, that I do not have, and I do not expect python's developer to have while implementing os.walk
.
好吧,我希望这是一个严重 I/O 绑定的任务。因此,python 端的优化将非常无效;我能想到的唯一优化是访问/列出文件的一些不同方式,以减少从文件系统的实际读取。这当然需要对文件系统有深入的了解,而我没有,而且我不希望 python 的开发人员在实现os.walk
.
What about spawning a command prompt, and then issue 'dir' and parse the results? It could be a bit an overkill, but with any luck, 'dir' is making some effort for such optimizations.
生成命令提示符,然后发出“dir”并解析结果怎么样?这可能有点矫枉过正,但幸运的是,'dir' 正在为此类优化做出一些努力。
回答by Ole
It seems as if os.walk has been considerably improvedin python 2.5, so you might check if you're running that version.
似乎 os.walk在 python 2.5 中得到了相当大的改进,因此您可以检查是否正在运行该版本。
Other than that, someone has already compared the speed of os.walk to lsand noticed a clear advance of the latter, but not in a range that would actually justify using it.
除此之外,有人已经将 os.walk 的速度与 ls 进行了比较,并注意到后者有明显的进步,但不在实际可以证明使用它的范围内。
回答by Benjamin Peterson
You might want to look at the code for some Python version control systems like Mercurial or Bazaar. They have devoted a lot of time to coming up with ways to quickly traverse a directory tree and detect changes (or "finding basic stats about the files").
您可能想查看一些 Python 版本控制系统(如 Mercurial 或 Bazaar)的代码。他们花了很多时间想出快速遍历目录树并检测更改(或“查找有关文件的基本统计信息”)的方法。
回答by panofish
Use scandir python module (formerly betterwalk) on github by Ben Hoyt.
使用 Ben Hoyt 在 github 上的 scandir python 模块(以前的 Betterwalk)。
http://github.com/benhoyt/scandir
http://github.com/benhoyt/scandir
It is much faster than python walk, but uses the same syntax. Just import scandir and change os.walk() to scandir.walk(). That's it. It is the fastest way to traverse directories and files in python.
它比 python walk 快得多,但使用相同的语法。只需导入 scandir 并将 os.walk() 更改为 scandir.walk()。就是这样。这是python中最快的遍历目录和文件的方式。
回答by Ian Clark
Python 3.5 just introduced os.scandir
(see PEP-0471) which avoids a number of non-required system calls such as stat()
and GetFileAttributes()
to provide a significantly quicker file-system iterator.
刚刚引入的 Python 3.5 os.scandir
(参见PEP-0471)避免了许多非必需的系统调用,例如stat()
并GetFileAttributes()
提供了一个明显更快的文件系统迭代器。
os.walk()
will now be implemented using os.scandir()
as its iterator, and so you should see potentially large performance improvements whilst continuing to use os.walk()
.
os.walk()
现在将使用os.scandir()
作为其迭代器来实现,因此在继续使用os.walk()
.
Example usage:
用法示例:
for entry in os.scandir(path):
if not entry.name.startswith('.') and entry.is_file():
print(entry.name)
回答by Triptych
I'm wondering if you might want to group your I/O operations.
我想知道您是否想对 I/O 操作进行分组。
For instance, if you're walking a dense directory tree with thousands of files, you might try experimenting with walking the entire tree and storing all the file locations, and then looping through the (in-memory) locations and getting file statistics.
例如,如果您正在遍历包含数千个文件的密集目录树,您可以尝试遍历整个树并存储所有文件位置,然后遍历(内存中)位置并获取文件统计信息。
If your OS stores these two data in different locations (directory structure in one place, file stats in another), then this might be a significant optimization.
如果您的操作系统将这两个数据存储在不同的位置(一个位置的目录结构,另一个位置的文件统计信息),那么这可能是一个重要的优化。
Anyway, that's something I'd try before digging further.
无论如何,这是我在进一步挖掘之前会尝试的东西。
回答by S.Lott
When you look at the code for os.walk
, you'll see that there's not much fat to be trimmed.
当您查看 的代码时os.walk
,您会发现要修剪的脂肪并不多。
For example, the following is only a hair faster than os.walk.
比如下面只比os.walk快了一个头发。
import os
import stat
listdir= os.listdir
pathjoin= os.path.join
fstat= os.stat
is_dir= stat.S_ISDIR
is_reg= stat.S_ISREG
def yieldFiles( path ):
for f in listdir(path):
nm= pathjoin( path, f )
s= fstat( nm ).st_mode
if is_dir( s ):
for sub in yieldFiles( nm ):
yield sub
elif is_reg( s ):
yield f
else:
pass # ignore these
Consequently, the overheads must he in the os
module itself. You'll have to resort to making direct Windows API calls.
因此,开销必须在os
模块本身中。您将不得不求助于直接进行 Windows API 调用。
Look at the Python for Windows Extensions.
回答by S.Lott
The os.path module has a directory tree walking function as well. I've never run any sort of benchmarks on it, but you could give it a try. I'm not sure there's a faster way than os.walk/os.path.walk in Python, however.
os.path 模块也有一个目录树遍历功能。我从来没有对它运行过任何类型的基准测试,但你可以试一试。但是,我不确定是否有比 Python 中的 os.walk/os.path.walk 更快的方法。
回答by tzot
This is only partial help, more like pointers; however:
这只是部分帮助,更像是指针;然而:
I believe you need to do the following:
我相信你需要做到以下几点:
fp = open("C:/$MFT", "rb")
using an account that includes SYSTEM permissions, because even as an admin, you can't open the "Master File Table" (kind of an inode table) of an NTFS filesystem. After you succeed in that, then you'll just have to locate information on the web that explains the structure of each file record (I believe it's commonly 1024 bytes per on-disk file, which includes the file's primary pathname) and off you go for super-high speeds of disk structure reading.
使用包含系统权限的帐户,因为即使作为管理员,您也无法打开 NTFS 文件系统的“主文件表”(一种 inode 表)。成功后,您只需在网络上找到解释每个文件记录结构的信息(我相信每个磁盘文件通常为 1024 字节,其中包括文件的主路径名)然后就可以了用于超高速的磁盘结构读取。