递归子文件夹搜索并返回列表python中的文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18394147/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Recursive sub folder search and return files in a list python
提问by user2709514
I am working on a script to recursively go through subfolders in a mainfolder and build a list off a certain file type. I am having an issue with the script. Its currently set as follows
我正在编写一个脚本,以递归方式遍历主文件夹中的子文件夹并根据特定文件类型构建一个列表。我的脚本有问题。它目前设置如下
for root, subFolder, files in os.walk(PATH):
for item in files:
if item.endswith(".txt") :
fileNamePath = str(os.path.join(root,subFolder,item))
the problem is that the subFolder variable is pulling in a list of subfolders rather than the folder that the ITEM file is located. I was thinking of running a for loop for the subfolder before and join the first part of the path but I figured Id double check to see if anyone has any suggestions before that. Thanks for your help!
问题是 subFolder 变量正在拉入子文件夹列表,而不是 ITEM 文件所在的文件夹。我之前曾考虑为子文件夹运行 for 循环并加入路径的第一部分,但我想我要仔细检查一下,看看是否有人在此之前有任何建议。谢谢你的帮助!
回答by John La Rooy
You should be using the dirpath
which you call root
. The dirnames
are supplied so you can prune it if there are folders that you don't wish os.walk
to recurse into.
您应该使用dirpath
您调用的root
。在dirnames
供给这样你就可以进行清理,如果有文件夹,你不想os.walk
递归到。
import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']
Edit:
编辑:
After the latest downvote, it occurred to me that glob
is a better tool for selecting by extension.
在最近的投票之后,我发现这glob
是一个更好的扩展选择工具。
import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
Also a generator version
还有一个生成器版本
from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))
Edit2 for Python 3.4+
适用于 Python 3.4+ 的 Edit2
from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))
回答by Rotareti
Changed in Python 3.5: Support for recursive globs using “**”.
在Python 3.5 中更改:支持使用“**”的递归全局变量。
glob.glob()
got a new recursive parameter.
glob.glob()
得到了一个新的递归参数。
If you want to get every .txt
file under my_path
(recursively including subdirs):
如果您想获取(递归包括子目录).txt
下的每个文件my_path
:
import glob
files = glob.glob(my_path + '/**/*.txt', recursive=True)
# my_path/ the dir
# **/ every file and dir under my_path
# *.txt every file that ends with '.txt'
If you need an iterator you can use iglobas an alternative:
如果你需要一个迭代器,你可以使用iglob作为替代:
for file in glob.iglob(my_path, recursive=False):
# ...
回答by dermen
Its not the most pythonic answer, but I'll put it here for fun because it's a neat lesson in recursion
它不是最 Pythonic 的答案,但我会把它放在这里是为了好玩,因为它是递归的一个很好的教训
def find_files( files, dirs=[], extensions=[]):
new_dirs = []
for d in dirs:
try:
new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
except OSError:
if os.path.splitext(d)[1] in extensions:
files.append(d)
if new_dirs:
find_files(files, new_dirs, extensions )
else:
return
On my machine I have two folders, root
and root2
在我的机器上,我有两个文件夹,root
并且root2
mender@multivax ]ls -R root root2
root:
temp1 temp2
root/temp1:
temp1.1 temp1.2
root/temp1/temp1.1:
f1.mid
root/temp1/temp1.2:
f.mi f.mid
root/temp2:
tmp.mid
root2:
dummie.txt temp3
root2/temp3:
song.mid
Lets say I want to find all .txt
and all .mid
files in either of these directories, then I can just do
假设我想在这些目录中的任何一个中找到所有.txt
和所有.mid
文件,然后我就可以了
files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)
#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']
回答by Jefferson Lima
I will translate John La Rooy's list comprehensionto nested for's, just in case anyone else has trouble understanding it.
我会将John La Rooy 的列表理解翻译成嵌套的 for ,以防其他人无法理解它。
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
Should be equivalent to:
应该相当于:
import glob
result = []
for x in os.walk(PATH):
for y in glob.glob(os.path.join(x[0], '*.txt')):
result.append(y)
Here's the documentation for list comprehensionand the functions os.walkand glob.glob.
这是列表理解和函数os.walk和glob.glob 的文档。
回答by Emre
The new pathlib
library simplifies this to one line:
新pathlib
库将其简化为一行:
from pathlib import Path
result = list(Path(PATH).glob('**/*.txt'))
You can also use the generator version:
您还可以使用生成器版本:
from pathlib import Path
for file in Path(PATH).glob('**/*.txt'):
pass
This returns Path
objects, which you can use for pretty much anything, or get the file name as a string by file.name
.
这将返回Path
对象,您几乎可以将其用于任何事情,或者通过file.name
.
回答by prosti
Recursive is new in Python 3.5, so it won't work on Python 2.7. Here is the example that uses r
strings so you just need to provide the path as is on either Win, Lin, ...
递归是 Python 3.5 中的新功能,因此它不适用于 Python 2.7。这是使用r
字符串的示例,因此您只需提供 Win、Lin、...
import glob
mypath=r"C:\Users\dj\Desktop\nba"
files = glob.glob(mypath + r'\**\*.py', recursive=True)
# print(files) # as list
for f in files:
print(f) # nice looking single line per file
Note: It will list all files, no matter how deep it should go.
注意:它将列出所有文件,无论它应该去多深。
回答by Yossarian42
This function will recursively put only files into a list. Hope this will you.
此函数将递归地仅将文件放入列表中。希望这会是你。
import os
def ls_files(dir):
files = list()
for item in os.listdir(dir):
abspath = os.path.join(dir, item)
try:
if os.path.isdir(abspath):
files = files + ls_files(abspath)
else:
files.append(abspath)
except FileNotFoundError as err:
print('invalid directory\n', 'Error: ', err)
return files
回答by WilliamCanin
You can do it this way to return you a list of absolute path files.
您可以通过这种方式返回绝对路径文件列表。
def list_files_recursive(path):
"""
Function that receives as a parameter a directory path
:return list_: File List and Its Absolute Paths
"""
import os
files = []
# r = root, d = directories, f = files
for r, d, f in os.walk(path):
for file in f:
files.append(os.path.join(r, file))
lst = [file for file in files]
return lst
if __name__ == '__main__':
result = list_files_recursive('/tmp')
print(result)
回答by Minh Nguyen
If you don't mind installing an additional light library, you can do this:
如果你不介意安装一个额外的灯光库,你可以这样做:
pip install plazy
Usage:
用法:
import plazy
txt_filter = lambda x : True if x.endswith('.txt') else False
files = plazy.list_files(root='data', filter_func=txt_filter, is_include_root=True)
The result should look something like this:
结果应该是这样的:
['data/a.txt', 'data/b.txt', 'data/sub_dir/c.txt']
It works on both Python 2.7 and Python 3.
它适用于 Python 2.7 和 Python 3。
Github: https://github.com/kyzas/plazy#list-files
Github:https: //github.com/kyzas/plazy#list-files
Disclaimer: I'm an author of plazy
.
免责声明:我是plazy
.
回答by user136036
This seems to be the fastest solution I could come up with, and is faster than os.walk
and a lot faster than any glob
solution.
这似乎是最快的解决方案,我能想出,并且是比快os.walk
和比快得多glob
的解决方案。
- It will also give you a list of all nested subfolders at basically no cost.
- You can search for several different extensions.
- You can also choose to return either full paths or just the names for the files by changing
f.path
tof.name
(do not change it for subfolders!).
- 它还将为您提供所有嵌套子文件夹的列表,并且基本上免费。
- 您可以搜索多个不同的扩展名。
- 您还可以通过更改
f.path
为f.name
(不要更改子文件夹!)来选择返回完整路径或仅返回文件的名称。
Args: dir: str, ext: list
.
Function returnstwo lists: subfolders, files
.
参数:dir: str, ext: list
。
函数返回两个列表:subfolders, files
.
See below for a detailed speed anaylsis.
有关详细的速度分析,请参见下文。
def run_fast_scandir(dir, ext): # dir: str, ext: list
subfolders, files = [], []
for f in os.scandir(dir):
if f.is_dir():
subfolders.append(f.path)
if f.is_file():
if os.path.splitext(f.name)[1].lower() in ext:
files.append(f.path)
for dir in list(subfolders):
sf, f = run_fast_scandir(dir, ext)
subfolders.extend(sf)
files.extend(f)
return subfolders, files
subfolders, files = run_fast_scandir(folder, [".jpg"])
Speed analysis
速度分析
for various methods to get all files with a specific file extension inside all subfolders and the main folder.
用于获取所有子文件夹和主文件夹中具有特定文件扩展名的所有文件的各种方法。
tl;dr:
- fast_scandir
clearly wins and is twice as fast as all other solutions, except os.walk.
- os.walk
is second place slighly slower.
- using glob
will greatly slow down the process.
- None of the results use natural sorting. This means results will be sorted like this: 1, 10, 2. To get natural sorting (1, 2, 10), please have a look at https://stackoverflow.com/a/48030307/2441026
tl;dr:
-fast_scandir
显然是赢家,速度是所有其他解决方案的两倍,除了 os.walk。
-os.walk
是第二名稍微慢一点。
- 使用glob
会大大减慢进程。
- 结果都没有使用自然排序。这意味着结果将按以下方式排序:1, 10, 2。要获得自然排序 (1, 2, 10),请查看https://stackoverflow.com/a/48030307/2441026
Results:
结果:
fast_scandir took 499 ms. Found files: 16596. Found subfolders: 439
os.walk took 589 ms. Found files: 16596
find_files took 919 ms. Found files: 16596
glob.iglob took 998 ms. Found files: 16596
glob.glob took 1002 ms. Found files: 16596
pathlib.rglob took 1041 ms. Found files: 16596
os.walk-glob took 1043 ms. Found files: 16596
Tests were done with W7x64, Python 3.8.1, 20 runs. 16596 files in 439 (partially nested) subfolders.find_files
is from https://stackoverflow.com/a/45646357/2441026and lets you search for several extensions.fast_scandir
was written by myself and will also return a list of subfolders. You can give it a list of extensions to search for (I tested a list with one entry to a simple if ... == ".jpg"
and there was no significant difference).
使用 W7x64、Python 3.8.1、20 次运行完成了测试。439 个(部分嵌套)子文件夹中的 16596 个文件。find_files
来自https://stackoverflow.com/a/45646357/2441026并允许您搜索多个扩展。fast_scandir
是我自己写的,也会返回一个子文件夹列表。你可以给它一个要搜索的扩展列表(我测试了一个包含一个简单条目的列表,if ... == ".jpg"
没有显着差异)。
# -*- coding: utf-8 -*-
# Python 3
import time
import os
from glob import glob, iglob
from pathlib import Path
directory = r"<folder>"
RUNS = 20
def run_os_walk():
a = time.time_ns()
for i in range(RUNS):
fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
os.path.splitext(f)[1].lower() == '.jpg']
print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_os_walk_glob():
a = time.time_ns()
for i in range(RUNS):
fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_glob():
a = time.time_ns()
for i in range(RUNS):
fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_iglob():
a = time.time_ns()
for i in range(RUNS):
fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_pathlib_rglob():
a = time.time_ns()
for i in range(RUNS):
fu = list(Path(directory).rglob("*.jpg"))
print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def find_files(files, dirs=[], extensions=[]):
# https://stackoverflow.com/a/45646357/2441026
new_dirs = []
for d in dirs:
try:
new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
except OSError:
if os.path.splitext(d)[1].lower() in extensions:
files.append(d)
if new_dirs:
find_files(files, new_dirs, extensions )
else:
return
def run_fast_scandir(dir, ext): # dir: str, ext: list
# https://stackoverflow.com/a/59803793/2441026
subfolders, files = [], []
for f in os.scandir(dir):
if f.is_dir():
subfolders.append(f.path)
if f.is_file():
if os.path.splitext(f.name)[1].lower() in ext:
files.append(f.path)
for dir in list(subfolders):
sf, f = run_fast_scandir(dir, ext)
subfolders.extend(sf)
files.extend(f)
return subfolders, files
if __name__ == '__main__':
run_os_walk()
run_os_walk_glob()
run_glob()
run_iglob()
run_pathlib_rglob()
a = time.time_ns()
for i in range(RUNS):
files = []
find_files(files, dirs=[directory], extensions=[".jpg"])
print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")
a = time.time_ns()
for i in range(RUNS):
subf, files = run_fast_scandir(directory, [".jpg"])
print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")