Python Boto3 从 S3 Bucket 下载所有文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31918960/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:44:54  来源:igfitidea点击:

Boto3 to download all files from a S3 Bucket

pythonamazon-web-servicesamazon-s3boto3

提问by Shan

I'm using boto3 to get files from s3 bucket. I need a similar functionality like aws s3 sync

我正在使用 boto3 从 s3 存储桶中获取文件。我需要类似的功能aws s3 sync

My current code is

我目前的代码是

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='my_bucket_name')['Contents']
for key in list:
    s3.download_file('my_bucket_name', key['Key'], key['Key'])

This is working fine, as long as the bucket has only files. If a folder is present inside the bucket, its throwing an error

只要存储桶只有文件,就可以正常工作。如果存储桶中存在文件夹,则会引发错误

Traceback (most recent call last):
  File "./test", line 6, in <module>
    s3.download_file('my_bucket_name', key['Key'], key['Key'])
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file
    self._get_object(bucket, key, filename, extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object
    with self._osutil.open(filename, 'wb') as f:
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open
    return open(filename, mode)
IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'

Is this a proper way to download a complete s3 bucket using boto3. How to download folders.

这是使用 boto3 下载完整 s3 存储桶的正确方法吗?如何下载文件夹。

采纳答案by Grant Langseth

When working with buckets that have 1000+ objects its necessary to implement a solution that uses the NextContinuationTokenon sequential sets of, at most, 1000 keys. This solution first compiles a list of objects then iteratively creates the specified directories and downloads the existing objects.

当使用具有 1000 多个对象的存储桶时,有必要实现一个使用NextContinuationToken最多 1000 个键的顺序集的解决方案。此解决方案首先编译对象列表,然后迭代创建指定目录并下载现有对象。

import boto3
import os

s3_client = boto3.client('s3')

def download_dir(prefix, local, bucket, client=s3_client):
    """
    params:
    - prefix: pattern to match in s3
    - local: local path to folder in which to place files
    - bucket: s3 bucket with target contents
    - client: initialized s3 client object
    """
    keys = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket':bucket,
        'Prefix':prefix,
    }
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                keys.append(k)
            else:
                dirs.append(k)
        next_token = results.get('NextContinuationToken')
    for d in dirs:
        dest_pathname = os.path.join(local, d)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
    for k in keys:
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        client.download_file(bucket, k, dest_pathname)

回答by John Rotenstein

Amazon S3 does not have folders/directories. It is a flat file structure.

Amazon S3 没有文件夹/目录。它是一个平面文件结构

To maintain the appearance of directories, path names are stored as part of the object Key(filename). For example:

为了保持目录的外观,路径名被存储为对象键(文件名)的一部分。例如:

  • images/foo.jpg
  • images/foo.jpg

In this case, the whole Key is images/foo.jpg, rather than just foo.jpg.

在这种情况下,整个 Key 是images/foo.jpg,而不仅仅是foo.jpg

I suspect that your problem is that botois returning a file called my_folder/.8Df54234and is attempting to save it to the local filesystem. However, your local filesystem interprets the my_folder/portion as a directory name, and that directory does not exist on your local filesystem.

我怀疑您的问题是boto返回一个名为的文件my_folder/.8Df54234并试图将其保存到本地文件系统。但是,您的本地文件系统将该my_folder/部分解释为目录名称,并且该目录在您的本地文件系统中不存在

You could either truncatethe filename to only save the .8Df54234portion, or you would have to create the necessary directoriesbefore writing files. Note that it could be multi-level nested directories.

您可以截断文件名以仅保存该.8Df54234部分,或者您必须在写入文件之前创建必要的目录。请注意,它可能是多级嵌套目录。

An easier way would be to use the AWS Command-Line Interface (CLI), which will do all this work for you, eg:

一种更简单的方法是使用AWS 命令​​行界面 (CLI),它将为您完成所有这些工作,例如:

aws s3 cp --recursive s3://my_bucket_name local_folder

There's also a syncoption that will only copy new and modified files.

还有一个sync选项只会复制新的和修改过的文件。

回答by Shan

I'm currently achieving the task, by using the following

我目前正在通过使用以下方法完成任务

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='bucket')['Contents']
for s3_key in list:
    s3_object = s3_key['Key']
    if not s3_object.endswith("/"):
        s3.download_file('bucket', s3_object, s3_object)
    else:
        import os
        if not os.path.exists(s3_object):
            os.makedirs(s3_object)

Although, it does the job, I'm not sure its good to do this way. I'm leaving it here to help other users and further answers, with better manner of achieving this

虽然它可以完成工作,但我不确定这样做是否有好处。我把它留在这里是为了帮助其他用户和进一步的答案,更好地实现这一目标

回答by glefait

I have the same needs and created the following function that download recursively the files.

我有相同的需求并创建了以下函数来递归下载文件。

The directories are created locally only if they contain files.

仅当目录包含文件时,才会在本地创建这些目录。

import boto3
import os

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            dest_pathname = os.path.join(local, file.get('Key'))
            if not os.path.exists(os.path.dirname(dest_pathname)):
                os.makedirs(os.path.dirname(dest_pathname))
            resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)

The function is called that way:

该函数是这样调用的:

def _start():
    client = boto3.client('s3')
    resource = boto3.resource('s3')
    download_dir(client, resource, 'clientconf/', '/tmp', bucket='my-bucket')

回答by Ganatra

It is a very bad idea to get all files in one go, you should rather get it in batches.

一次性获取所有文件是一个非常糟糕的主意,您应该分批获取。

One implementation which I use to fetch a particular folder (directory) from S3 is,

我用来从 S3 获取特定文件夹(目录)的一种实现是,

def get_directory(directory_path, download_path, exclude_file_names):
    # prepare session
    session = Session(aws_access_key_id, aws_secret_access_key, region_name)

    # get instances for resource and bucket
    resource = session.resource('s3')
    bucket = resource.Bucket(bucket_name)

    for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
        s3_object = s3_key['Key']
        if s3_object not in exclude_file_names:
            bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])

and still if you want to get the whole bucket use it via CIL as @John Rotenstein mentionedas below,

并且仍然如果您想通过 CIL 使用整个存储桶,就像下面提到的@John Rotenstein

aws s3 cp --recursive s3://bucket_name download_path

回答by Tushar Niras

import os
import boto3

#initiate s3 resource
s3 = boto3.resource('s3')

# select bucket
my_bucket = s3.Bucket('my_bucket_name')

# download file into current directory
for s3_object in my_bucket.objects.all():
    # Need to split s3_object.key into path and file name, else it will give error file not found.
    path, filename = os.path.split(s3_object.key)
    my_bucket.download_file(s3_object.key, filename)

回答by ifoukarakis

Better late than never:) The previous answer with paginator is really good. However it is recursive, and you might end up hitting Python's recursion limits. Here's an alternate approach, with a couple of extra checks.

迟到总比不到好:) 以前使用分页器的答案非常好。然而它是递归的,你可能最终会达到 Python 的递归限制。这是另一种方法,有一些额外的检查。

import os
import errno
import boto3


def assert_dir_exists(path):
    """
    Checks if directory tree in path exists. If not it created them.
    :param path: the path to check if it exists
    """
    try:
        os.makedirs(path)
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise


def download_dir(client, bucket, path, target):
    """
    Downloads recursively the given S3 path to the target directory.
    :param client: S3 client to use.
    :param bucket: the name of the bucket to download from
    :param path: The S3 directory to download.
    :param target: the local directory to download the files to.
    """

    # Handle missing / at end of prefix
    if not path.endswith('/'):
        path += '/'

    paginator = client.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket, Prefix=path):
        # Download each file individually
        for key in result['Contents']:
            # Calculate relative path
            rel_path = key['Key'][len(path):]
            # Skip paths ending in /
            if not key['Key'].endswith('/'):
                local_file_path = os.path.join(target, rel_path)
                # Make sure directories exist
                local_file_dir = os.path.dirname(local_file_path)
                assert_dir_exists(local_file_dir)
                client.download_file(bucket, key['Key'], local_file_path)


client = boto3.client('s3')

download_dir(client, 'bucket-name', 'path/to/data', 'downloads')

回答by mattalxndr

I have a workaround for this that runs the AWS CLI in the same process.

我有一个解决方法,可以在同一进程中运行 AWS CLI。

Install awsclias python lib:

安装awscli为 python 库:

pip install awscli

Then define this function:

然后定义这个函数:

from awscli.clidriver import create_clidriver

def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)

To execute:

执行:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')

回答by Rajesh Rajendran

for objs in my_bucket.objects.all():
    print(objs.key)
    path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
    try:
        if not os.path.exists(path):
            os.makedirs(path)
        my_bucket.download_file(objs.key, '/tmp/'+objs.key)
    except FileExistsError as fe:                          
        print(objs.key+' exists')

This code will download the content in /tmp/directory. If you want you can change the directory.

此代码将下载/tmp/目录中的内容。如果需要,您可以更改目录。

回答by HazimoRa3d

If you want to call a bash script using python, here is a simple method to load a file from a folder in S3 bucket to a local folder (in a Linux machine) :

如果您想使用 python 调用 bash 脚本,这里有一个简单的方法将文件从 S3 存储桶中的文件夹加载到本地文件夹(在 Linux 机器中):

import boto3
import subprocess
import os

###TOEDIT###
my_bucket_name = "your_my_bucket_name"
bucket_folder_name = "your_bucket_folder_name"
local_folder_path = "your_local_folder_path"
###TOEDIT###

# 1.Load thes list of files existing in the bucket folder
FILES_NAMES = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('{}'.format(my_bucket_name))
for object_summary in my_bucket.objects.filter(Prefix="{}/".format(bucket_folder_name)):
#     print(object_summary.key)
    FILES_NAMES.append(object_summary.key)

# 2.List only new files that do not exist in local folder (to not copy everything!)
new_filenames = list(set(FILES_NAMES )-set(os.listdir(local_folder_path)))

# 3.Time to load files in your destination folder 
for new_filename in new_filenames:
    upload_S3files_CMD = """aws s3 cp s3://{}/{}/{} {}""".format(my_bucket_name,bucket_folder_name,new_filename ,local_folder_path)

    subprocess_call = subprocess.call([upload_S3files_CMD], shell=True)
    if subprocess_call != 0:
        print("ALERT: loading files not working correctly, please re-check new loaded files")