如何在 Python 中使用 Pandas 从 s3 存储桶中读取 csv 文件

Question

提问by Paul_M

I am trying to read a CSV file located in an AWS S3 bucket into memory as a pandas dataframe using the following code:

我正在尝试使用以下代码将位于 AWS S3 存储桶中的 CSV 文件作为 Pandas 数据帧读取到内存中：

import pandas as pd
import boto

data = pd.read_csv('s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv')

In order to give complete access I have set the bucket policy on the S3 bucket as follows:

为了提供完整的访问权限，我在 S3 存储桶上设置了存储桶策略，如下所示：

{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
    {
        "Sid": "statement1",
        "Effect": "Allow",
        "Principal": "*",
        "Action": "s3:*",
        "Resource": "arn:aws:s3:::example_bucket"
    }
]

}

Unfortunately I still get the following error in python:

不幸的是，我仍然在 python 中收到以下错误：

boto.exception.S3ResponseError: S3ResponseError: 405 Method Not Allowed

Wondering if someone could help explain how to either correctly set the permissions in AWS S3 or configure pandas correctly to import the file. Thanks!

想知道是否有人可以帮助解释如何在 AWS S3 中正确设置权限或正确配置 pandas 以导入文件。谢谢！

Answer 1

回答by Paul_M

I eventually realised that you also need to set the permissions on each individual object within the bucket in order to extract it by using the following code:

我最终意识到您还需要为存储桶中的每个单独对象设置权限，以便使用以下代码提取它：

from boto.s3.key import Key
k = Key(bucket)
k.key = 'data_1.csv'
k.set_canned_acl('public-read')

And I also had to modify the address of the bucket in the pd.read_csv command as follows:

而且我还必须在 pd.read_csv 命令中修改存储桶的地址，如下所示：

data = pd.read_csv('https://s3-ap-southeast-2.amazonaws.com/example_bucket/data_1.csv')

Answer 2

回答by BigDataSaurius

You don't need pandas.. you can just use the default csv library of python

你不需要熊猫..你可以使用python的默认csv库

def read_file(bucket_name,region, remote_file_name, aws_access_key_id, aws_secret_access_key):
    # reads a csv from AWS

    # first you stablish connection with your passwords and region id

    conn = boto.s3.connect_to_region(
        region,
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key)

    # next you obtain the key of the csv you want to read
    # you will need the bucket name and the csv file name

    bucket = conn.get_bucket(bucket_name, validate=False)
    key = Key(bucket)
    key.key = remote_file_name
    data = key.get_contents_as_string()
    key.close()

    # you store it into a string, therefore you will need to split it
    # usually the split characters are '\r\n' if not just read the file normally 
    # and find out what they are 

    reader = csv.reader(data.split('\r\n'))
    data = []
    header = next(reader)
    for row in reader:
        data.append(row)

    return data

hope it solved your problem, good luck! :)

希望它解决了你的问题，祝你好运！:)

Answer 3

回答by jpobst

Using pandas 0.20.3

使用熊猫 0.20.3

import os
import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

# get your credentials from environment variables
aws_id = os.environ['AWS_ID']
aws_secret = os.environ['AWS_SECRET']

client = boto3.client('s3', aws_access_key_id=aws_id,
        aws_secret_access_key=aws_secret)

bucket_name = 'my_bucket'

object_key = 'my_file.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))

Answer 4

回答by kepler

Based on this answerthat suggested using smart_openfor reading from S3, this is how I used it with Pandas:

基于这个建议smart_open用于从 S3 读取的答案，这就是我将它与 Pandas 一起使用的方式：

import os
import pandas as pd
from smart_open import smart_open

aws_key = os.environ['AWS_ACCESS_KEY']
aws_secret = os.environ['AWS_SECRET_ACCESS_KEY']

bucket_name = 'my_bucket'
object_key = 'my_file.csv'

path = 's3://{}:{}@{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)

df = pd.read_csv(smart_open(path))

如何在 Python 中使用 Pandas 从 s3 存储桶中读取 csv 文件

提问by Paul_M

回答by Paul_M

回答by BigDataSaurius

回答by jpobst

回答by kepler

相关推荐

最近更新

标签

如何在 Python 中使用 Pandas 从 s3 存储桶中读取 csv 文件

提问by Paul_M

回答by Paul_M

回答by BigDataSaurius

回答by jpobst

回答by kepler

相关推荐

Python 过滤时从熊猫数据框中获取子字符串

Python 我想在我的熊猫数据框中创建一列 value_counts

Python DataFrame.loc 的“索引器太多”

Python ValueError: numpy.dtype 大小错误，尝试重新编译

相关推荐

最近更新

标签