如何在 Python 中使用 Pandas 从 s3 存储桶中读取 csv 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30818341/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:01:35  来源:igfitidea点击:

How to read a csv file from an s3 bucket using Pandas in Python

pythonamazon-web-servicespandasamazon-s3

提问by Paul_M

I am trying to read a CSV file located in an AWS S3 bucket into memory as a pandas dataframe using the following code:

我正在尝试使用以下代码将位于 AWS S3 存储桶中的 CSV 文件作为 Pandas 数据帧读取到内存中:

import pandas as pd
import boto

data = pd.read_csv('s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv')

In order to give complete access I have set the bucket policy on the S3 bucket as follows:

为了提供完整的访问权限,我在 S3 存储桶上设置了存储桶策略,如下所示:

{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
    {
        "Sid": "statement1",
        "Effect": "Allow",
        "Principal": "*",
        "Action": "s3:*",
        "Resource": "arn:aws:s3:::example_bucket"
    }
]

}

}

Unfortunately I still get the following error in python:

不幸的是,我仍然在 python 中收到以下错误:

boto.exception.S3ResponseError: S3ResponseError: 405 Method Not Allowed

Wondering if someone could help explain how to either correctly set the permissions in AWS S3 or configure pandas correctly to import the file. Thanks!

想知道是否有人可以帮助解释如何在 AWS S3 中正确设置权限或正确配置 pandas 以导入文件。谢谢!

回答by Paul_M

I eventually realised that you also need to set the permissions on each individual object within the bucket in order to extract it by using the following code:

我最终意识到您还需要为存储桶中的每个单独对象设置权限,以便使用以下代码提取它:

from boto.s3.key import Key
k = Key(bucket)
k.key = 'data_1.csv'
k.set_canned_acl('public-read')

And I also had to modify the address of the bucket in the pd.read_csv command as follows:

而且我还必须在 pd.read_csv 命令中修改存储桶的地址,如下所示:

data = pd.read_csv('https://s3-ap-southeast-2.amazonaws.com/example_bucket/data_1.csv')

回答by BigDataSaurius

You don't need pandas.. you can just use the default csv library of python

你不需要熊猫..你可以使用python的默认csv库

def read_file(bucket_name,region, remote_file_name, aws_access_key_id, aws_secret_access_key):
    # reads a csv from AWS

    # first you stablish connection with your passwords and region id

    conn = boto.s3.connect_to_region(
        region,
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key)

    # next you obtain the key of the csv you want to read
    # you will need the bucket name and the csv file name

    bucket = conn.get_bucket(bucket_name, validate=False)
    key = Key(bucket)
    key.key = remote_file_name
    data = key.get_contents_as_string()
    key.close()

    # you store it into a string, therefore you will need to split it
    # usually the split characters are '\r\n' if not just read the file normally 
    # and find out what they are 

    reader = csv.reader(data.split('\r\n'))
    data = []
    header = next(reader)
    for row in reader:
        data.append(row)

    return data

hope it solved your problem, good luck! :)

希望它解决了你的问题,祝你好运!:)

回答by jpobst

Using pandas 0.20.3

使用熊猫 0.20.3

import os
import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

# get your credentials from environment variables
aws_id = os.environ['AWS_ID']
aws_secret = os.environ['AWS_SECRET']

client = boto3.client('s3', aws_access_key_id=aws_id,
        aws_secret_access_key=aws_secret)

bucket_name = 'my_bucket'

object_key = 'my_file.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))

回答by kepler

Based on this answerthat suggested using smart_openfor reading from S3, this is how I used it with Pandas:

基于这个建议smart_open用于从 S3 读取的答案,这就是我将它与 Pandas 一起使用的方式:

import os
import pandas as pd
from smart_open import smart_open

aws_key = os.environ['AWS_ACCESS_KEY']
aws_secret = os.environ['AWS_SECRET_ACCESS_KEY']

bucket_name = 'my_bucket'
object_key = 'my_file.csv'

path = 's3://{}:{}@{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)

df = pd.read_csv(smart_open(path))