Python Boto3 S3：获取文件而不获取文件夹

Question

提问by Vingtoft

Using boto3, how can I retrieve all files in my S3 bucket without retrieving the folders?

使用 boto3，如何在不检索文件夹的情况下检索 S3 存储桶中的所有文件？

Consider the following file structure:

考虑以下文件结构：

file_1.txt
folder_1/
    file_2.txt
    file_3.txt
    folder_2/
        folder_3/
            file_4.txt

In this example Im only interested in the 4 files.

在这个例子中，我只对 4 个文件感兴趣。

EDIT:

编辑：

A manual solution is:

手动解决方案是：

def count_files_in_folder(prefix):
    total = 0
    keys = s3_client.list_objects(Bucket=bucket_name, Prefix=prefix)
    for key in keys['Contents']:
        if key['Key'][-1:] != '/':
            total += 1
    return total

In this case total would be 4.

在这种情况下，总数将为 4。

If I just did

如果我只是

count = len(s3_client.list_objects(Bucket=bucket_name, Prefix=prefix))

the result would be 7 objects (4 files and 3 folders):

结果将是 7 个对象（4 个文件和 3 个文件夹）：

file.txt
folder_1/
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/
folder_1/folder_2/folder_3/
folder_1/folder_2/folder_3/file_4.txt

I JUST want:

我只是想：

file.txt
folder_1/file_2.txt
folder_1/file_3.txt  
folder_1/folder_2/folder_3/file_4.txt

Answer 1

回答by mootmoot

S3 is an OBJECT STORE. It DOES NOT store file/object under directories tree. New comer always confuse the "folder" option given by them, which in fact an arbitrary prefix for the object.

S3 是一个对象商店。它不会在目录树下存储文件/对象。新人总是混淆他们给出的“文件夹”选项，这实际上是对象的任意前缀。

object PREFIXis a way to retrieve your object organised by predefined fix file name(key) prefix structure, e.g. .

objectPREFIX是一种检索由预定义的修复文件名（键）前缀结构组织的对象的方法，例如 .

You can imagine using a file system that don't allow you to create a directory, but allow you to create file name with a slash "/" or backslash "\" as delimiter, and you can denote "level" of the file by a common prefix.

您可以想象使用一个文件系统，它不允许您创建目录，但允许您使用斜杠“/”或反斜杠“\”作为分隔符创建文件名，您可以通过以下方式表示文件的“级别”一个共同的前缀。

Thus in S3, you can use following to "simulate directory" that is not a directory.

因此，在 S3 中，您可以使用以下内容来“模拟不是目录的目录”。

folder1-folder2-folder3-myobject
folder1/folder2/folder3/myobject
folder1\folder2\folder3\myobject

As you can see, object name can store inside S3 regardless what kind of arbitrary folder separator(delimiter) you use.

如您所见，无论您使用哪种任意文件夹分隔符（分隔符），对象名称都可以存储在 S3 中。

However, to help user to make bulks file transfer to S3, tools such as aws cli, s3_transfer api attempt to simplify the step and create object name follow your input local folder structure.

但是，为了帮助用户将批量文件传输到 S3，aws cli、s3_transfer api 等工具尝试简化步骤并按照您输入的本地文件夹结构创建对象名称。

So if you are sure that all the S3 object is using /or \as separator , you can use tools like S3transfer or AWSCcli to make a simple download by using the key name.

因此，如果您确定所有 S3 对象都使用/或\作为分隔符，则可以使用 S3transfer 或 AWSCcli 等工具通过使用密钥名称进行简单下载。

Here is the quick and dirty code using the resource iterator. Using s3.resource.object.filter will return iterator that doesn't have same 1000 keys limit as list_objects()/list_objects_v2().

这是使用资源迭代器的快速而肮脏的代码。使用 s3.resource.object.filter 将返回与 list_objects()/list_objects_v2() 没有相同 1000 个键限制的迭代器。

import os 
import boto3
s3 = boto3.resource('s3')
mybucket = s3.Bucket("mybucket")
# if blank prefix is given, return everything)
bucket_prefix="/some/prefix/here"
objs = mybucket.objects.filter(
    Prefix = bucket_prefix)

for obj in objs:
    path, filename = os.path.split(obj.key)
    # boto3 s3 download_file will throw exception if folder not exists
    try:
        os.makedirs(path) 
    except FileExistsError:
        pass
    mybucket.download_file(obj.key, obj.key)

Answer 2

回答by garnaat

There are no folders in S3. What you have is four files named:

S3 中没有文件夹。您拥有的是四个名为的文件：

file_1.txt
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/folder_3/file_4.txt

Those are the actual names of the objects in S3. If what you want is to end up with:

这些是 S3 中对象的实际名称。如果你想要的是最终结果：

file_1.txt
file_2.txt
file_3.txt
file_4.txt

all sitting in the same directory on a local file system you would need to manipulate the name of the object to strip out just the file name. Something like this would work:

所有这些都位于本地文件系统上的同一目录中，您需要操纵对象的名称以仅去除文件名。像这样的事情会起作用：

import os.path

full_name = 'folder_1/folder_2/folder_3/file_4.txt'
file_name = os.path.basename(full_name)

The variable file_namewould then contain 'file_4.txt'.

然后变量file_name将包含'file_4.txt'.

Answer 3

回答by btomtom5

One way to filter out folders is by checking the end character of the Object if you are certain that no files end in a forward slash:

如果您确定没有文件以正斜杠结尾，则过滤出文件夹的一种方法是检查对象的结束字符：

for object_summary in objects.all():
    if object_summary.key[-1] == "/":
        continue

Answer 4

回答by airborne

As stated in the other answers, s3 does not actually have directories trees. But there is a convenient workaround taking advantage of the fact that the s3 "folders" have zero size by using paginators. This code-snippet will print out the desired output if all your files in the bucket have size > 0 (of course you need to adapt your region) :

正如其他答案中所述，s3 实际上没有目录树。但是有一个方便的解决方法，即通过使用分页器来利用 s3“文件夹”的大小为零这一事实。如果存储桶中的所有文件的大小 > 0（当然您需要调整您的区域），此代码片段将打印出所需的输出：

bucket_name = "bucketname"
s3 = boto3.client('s3', region_name='eu-central-1')
paginator = s3.get_paginator('list_objects')
[print(page['Key']) for page in paginator.paginate(Bucket=bucket_name).search("Contents[?Size > `0`][]")]

The filtering is done using JMESPath.

过滤是使用JMESPath完成的。

Note: Of course this would also exclude files with size 0, but usually you don't need storage for empty files.

注意：当然这也会排除大小为 0 的文件，但通常您不需要存储空文件。

Python Boto3 S3：获取文件而不获取文件夹

提问by Vingtoft

回答by mootmoot

回答by garnaat

回答by btomtom5

回答by airborne

相关推荐

最近更新

标签

Python Boto3 S3：获取文件而不获取文件夹

提问by Vingtoft

回答by mootmoot

回答by garnaat

回答by btomtom5

回答by airborne

相关推荐

Python 对 scikit learn 决策树中的 random_state 感到困惑

Python Windows 上的 OpenAI Gym Atari

Python 如何在没有列名或行名的熊猫中选择列和行？

Python 如何解决自动生成的 manage.py 上的 SyntaxError？

相关推荐

最近更新

标签