递归子文件夹搜索并返回列表python中的文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18394147/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:38:05  来源:igfitidea点击:

Recursive sub folder search and return files in a list python

pythonlistrecursionos.walk

提问by user2709514

I am working on a script to recursively go through subfolders in a mainfolder and build a list off a certain file type. I am having an issue with the script. Its currently set as follows

我正在编写一个脚本,以递归方式遍历主文件夹中的子文件夹并根据特定文件类型构建一个列表。我的脚本有问题。它目前设置如下

for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,subFolder,item))

the problem is that the subFolder variable is pulling in a list of subfolders rather than the folder that the ITEM file is located. I was thinking of running a for loop for the subfolder before and join the first part of the path but I figured Id double check to see if anyone has any suggestions before that. Thanks for your help!

问题是 subFolder 变量正在拉入子文件夹列表,而不是 ITEM 文件所在的文件夹。我之前曾考虑为子文件夹运行 for 循环并加入路径的第一部分,但我想我要仔细检查一下,看看是否有人在此之前有任何建议。谢谢你的帮助!

回答by John La Rooy

You should be using the dirpathwhich you call root. The dirnamesare supplied so you can prune it if there are folders that you don't wish os.walkto recurse into.

您应该使用dirpath您调用的root。在dirnames供给这样你就可以进行清理,如果有文件夹,你不想os.walk递归到。

import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']

Edit:

编辑:

After the latest downvote, it occurred to me that globis a better tool for selecting by extension.

在最近的投票之后,我发现这glob是一个更好的扩展选择工具。

import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

Also a generator version

还有一个生成器版本

from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))

Edit2 for Python 3.4+

适用于 Python 3.4+ 的 Edit2

from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))

回答by Rotareti

Changed in Python 3.5: Support for recursive globs using “**”.

Python 3.5 中更改:支持使用“**”的递归全局变量。

glob.glob()got a new recursive parameter.

glob.glob()得到了一个新的递归参数

If you want to get every .txtfile under my_path(recursively including subdirs):

如果您想获取(递归包括子目录).txt下的每个文件my_path

import glob

files = glob.glob(my_path + '/**/*.txt', recursive=True)

# my_path/     the dir
# **/       every file and dir under my_path
# *.txt     every file that ends with '.txt'

If you need an iterator you can use iglobas an alternative:

如果你需要一个迭代器,你可以使用iglob作为替代:

for file in glob.iglob(my_path, recursive=False):
    # ...

回答by dermen

Its not the most pythonic answer, but I'll put it here for fun because it's a neat lesson in recursion

它不是最 Pythonic 的答案,但我会把它放在这里是为了好玩,因为它是递归的一个很好的教训

def find_files( files, dirs=[], extensions=[]):
    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1] in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return

On my machine I have two folders, rootand root2

在我的机器上,我有两个文件夹,root并且root2

mender@multivax ]ls -R root root2
root:
temp1 temp2

root/temp1:
temp1.1 temp1.2

root/temp1/temp1.1:
f1.mid

root/temp1/temp1.2:
f.mi  f.mid

root/temp2:
tmp.mid

root2:
dummie.txt temp3

root2/temp3:
song.mid

Lets say I want to find all .txtand all .midfiles in either of these directories, then I can just do

假设我想在这些目录中的任何一个中找到所有.txt和所有.mid文件,然后我就可以了

files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)

#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']

回答by Jefferson Lima

I will translate John La Rooy's list comprehensionto nested for's, just in case anyone else has trouble understanding it.

我会将John La Rooy 的列表理解翻译成嵌套的 for ,以防其他人无法理解它。

result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

Should be equivalent to:

应该相当于:

import glob

result = []

for x in os.walk(PATH):
    for y in glob.glob(os.path.join(x[0], '*.txt')):
        result.append(y)

Here's the documentation for list comprehensionand the functions os.walkand glob.glob.

这是列表理解和函数os.walkglob.glob 的文档

回答by Emre

The new pathliblibrary simplifies this to one line:

pathlib库将其简化为一行:

from pathlib import Path
result = list(Path(PATH).glob('**/*.txt'))

You can also use the generator version:

您还可以使用生成器版本:

from pathlib import Path
for file in Path(PATH).glob('**/*.txt'):
    pass

This returns Pathobjects, which you can use for pretty much anything, or get the file name as a string by file.name.

这将返回Path对象,您几乎可以将其用于任何事情,或者通过file.name.

回答by prosti

Recursive is new in Python 3.5, so it won't work on Python 2.7. Here is the example that uses rstrings so you just need to provide the path as is on either Win, Lin, ...

递归是 Python 3.5 中的新功能,因此它不适用于 Python 2.7。这是使用r字符串的示例,因此您只需提供 Win、Lin、...

import glob

mypath=r"C:\Users\dj\Desktop\nba"

files = glob.glob(mypath + r'\**\*.py', recursive=True)
# print(files) # as list
for f in files:
    print(f) # nice looking single line per file

Note: It will list all files, no matter how deep it should go.

注意:它将列出所有文件,无论它应该去多深。

回答by Yossarian42

This function will recursively put only files into a list. Hope this will you.

此函数将递归地仅将文件放入列表中。希望这会是你。

import os


def ls_files(dir):
    files = list()
    for item in os.listdir(dir):
        abspath = os.path.join(dir, item)
        try:
            if os.path.isdir(abspath):
                files = files + ls_files(abspath)
            else:
                files.append(abspath)
        except FileNotFoundError as err:
            print('invalid directory\n', 'Error: ', err)
    return files

回答by WilliamCanin

You can do it this way to return you a list of absolute path files.

您可以通过这种方式返回绝对路径文件列表。

def list_files_recursive(path):
    """
    Function that receives as a parameter a directory path
    :return list_: File List and Its Absolute Paths
    """

    import os

    files = []

    # r = root, d = directories, f = files
    for r, d, f in os.walk(path):
        for file in f:
            files.append(os.path.join(r, file))

    lst = [file for file in files]
    return lst


if __name__ == '__main__':

    result = list_files_recursive('/tmp')
    print(result)

回答by Minh Nguyen

If you don't mind installing an additional light library, you can do this:

如果你不介意安装一个额外的灯光库,你可以这样做:

pip install plazy

Usage:

用法:

import plazy

txt_filter = lambda x : True if x.endswith('.txt') else False
files = plazy.list_files(root='data', filter_func=txt_filter, is_include_root=True)

The result should look something like this:

结果应该是这样的:

['data/a.txt', 'data/b.txt', 'data/sub_dir/c.txt']

It works on both Python 2.7 and Python 3.

它适用于 Python 2.7 和 Python 3。

Github: https://github.com/kyzas/plazy#list-files

Github:https: //github.com/kyzas/plazy#list-files

Disclaimer: I'm an author of plazy.

免责声明:我是plazy.

回答by user136036

This seems to be the fastest solution I could come up with, and is faster than os.walkand a lot faster than any globsolution.

这似乎是最快的解决方案,我能想出,并且是比快os.walk比快得多glob的解决方案

  • It will also give you a list of all nested subfolders at basically no cost.
  • You can search for several different extensions.
  • You can also choose to return either full paths or just the names for the files by changing f.pathto f.name(do not change it for subfolders!).
  • 它还将为您提供所有嵌套子文件夹的列表,并且基本上免费。
  • 您可以搜索多个不同的扩展名。
  • 您还可以通过更改f.pathf.name(不要更改子文件夹!)来选择返回完整路径或仅返回文件的名称。

Args: dir: str, ext: list.
Function returnstwo lists: subfolders, files.

参数:dir: str, ext: list
函数返回两个列表:subfolders, files.

See below for a detailed speed anaylsis.

有关详细的速度分析,请参见下文。

def run_fast_scandir(dir, ext):    # dir: str, ext: list
    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files


subfolders, files = run_fast_scandir(folder, [".jpg"])



Speed analysis

速度分析

for various methods to get all files with a specific file extension inside all subfolders and the main folder.

用于获取所有子文件夹和主文件夹中具有特定文件扩展名的所有文件的各种方法。

tl;dr:
- fast_scandirclearly wins and is twice as fast as all other solutions, except os.walk.
- os.walkis second place slighly slower.
- using globwill greatly slow down the process.
- None of the results use natural sorting. This means results will be sorted like this: 1, 10, 2. To get natural sorting (1, 2, 10), please have a look at https://stackoverflow.com/a/48030307/2441026

tl;dr:
-fast_scandir显然是赢家,速度是所有其他解决方案的两倍,除了 os.walk。
-os.walk是第二名稍微慢一点。
- 使用glob会大大减慢进程。
- 结果都没有使用自然排序。这意味着结果将按以下方式排序:1, 10, 2。要获得自然排序 (1, 2, 10),请查看https://stackoverflow.com/a/48030307/2441026


Results:


结果:

fast_scandir    took  499 ms. Found files: 16596. Found subfolders: 439
os.walk         took  589 ms. Found files: 16596
find_files      took  919 ms. Found files: 16596
glob.iglob      took  998 ms. Found files: 16596
glob.glob       took 1002 ms. Found files: 16596
pathlib.rglob   took 1041 ms. Found files: 16596
os.walk-glob    took 1043 ms. Found files: 16596

Tests were done with W7x64, Python 3.8.1, 20 runs. 16596 files in 439 (partially nested) subfolders.
find_filesis from https://stackoverflow.com/a/45646357/2441026and lets you search for several extensions.
fast_scandirwas written by myself and will also return a list of subfolders. You can give it a list of extensions to search for (I tested a list with one entry to a simple if ... == ".jpg"and there was no significant difference).

使用 W7x64、Python 3.8.1、20 次运行完成了测试。439 个(部分嵌套)子文件夹中的 16596 个文件。
find_files来自https://stackoverflow.com/a/45646357/2441026并允许您搜索多个扩展。
fast_scandir是我自己写的,也会返回一个子文件夹列表。你可以给它一个要搜索的扩展列表(我测试了一个包含一个简单条目的列表,if ... == ".jpg"没有显着差异)。



# -*- coding: utf-8 -*-
# Python 3


import time
import os
from glob import glob, iglob
from pathlib import Path


directory = r"<folder>"
RUNS = 20


def run_os_walk():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
                  os.path.splitext(f)[1].lower() == '.jpg']
    print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_os_walk_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
    print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_iglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
    print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_pathlib_rglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(Path(directory).rglob("*.jpg"))
    print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def find_files(files, dirs=[], extensions=[]):
    # https://stackoverflow.com/a/45646357/2441026

    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1].lower() in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return


def run_fast_scandir(dir, ext):    # dir: str, ext: list
    # https://stackoverflow.com/a/59803793/2441026

    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files



if __name__ == '__main__':
    run_os_walk()
    run_os_walk_glob()
    run_glob()
    run_iglob()
    run_pathlib_rglob()


    a = time.time_ns()
    for i in range(RUNS):
        files = []
        find_files(files, dirs=[directory], extensions=[".jpg"])
    print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")


    a = time.time_ns()
    for i in range(RUNS):
        subf, files = run_fast_scandir(directory, [".jpg"])
    print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")