从私有 S3 存储桶读取文件到 Pandas 数据帧

Question

提问by IgorK

I'm trying to read a CSV file from a private S3 bucket to a pandas dataframe:

我正在尝试将 CSV 文件从私有 S3 存储桶读取到 Pandas 数据帧：

df = pandas.read_csv('s3://mybucket/file.csv')

I can read a file from a public bucket, but reading a file from a private bucket results in HTTP 403: Forbidden error.

我可以从公共存储桶读取文件，但从私有存储桶读取文件会导致 HTTP 403: Forbidden 错误。

I have configured the AWS credentials using aws configure.

我已经使用 aws configure 配置了 AWS 凭证。

I can download a file from a private bucket using boto3, which uses aws credentials. It seems that I need to configure pandas to use AWS credentials, but don't know how.

我可以使用 boto3 从私有存储桶下载文件，它使用 aws 凭据。似乎我需要配置熊猫以使用 AWS 凭证，但不知道如何。

Answer 1

回答by TomAugspurger

Pandas uses boto(not boto3) inside read_csv. You might be able to install boto and have it work correctly.

Pandas在里面使用boto(not boto3) read_csv。您也许可以安装 boto 并使其正常工作。

There's some troubleswith boto and python 3.4.4 / python3.5.1. If you're on those platforms, and until those are fixed, you can use boto 3 as

boto 和 python 3.4.4 / python3.5.1存在一些问题。如果您在这些平台上，并且在修复这些平台之前，您可以使用 boto 3 作为

import boto3
import pandas as pd

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(obj['Body'])

That objhad a .readmethod (which returns a stream of bytes), which is enough for pandas.

那obj有一个.read方法（它返回一个字节流），这对熊猫来说已经足够了。

Answer 2

回答by spitfiredd

Updated for Pandas 0.20.1

为 Pandas 0.20.1 更新

Pandas now uses s3fs to handle s3 coonnections. link

Pandas 现在使用 s3fs 来处理 s3 连接。关联

pandas now uses s3fs for handling S3 connections. This shouldn't break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.

pandas 现在使用 s3fs 来处理 S3 连接。这不应该破坏任何代码。但是，由于 s3fs 不是必需的依赖项，因此您需要单独安装它，就像以前版本的 Pandas 中的 boto 一样。

import os

import pandas as pd
from s3fs.core import S3FileSystem

# aws keys stored in ini file in same path
# refer to boto3 docs for config settings
os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'

s3 = S3FileSystem(anon=False)
key = 'path\to\your-csv.csv'
bucket = 'your-bucket-name'

df = pd.read_csv(s3.open('{}/{}'.format(bucket, key),
                         mode='rb')
                 )

Answer 3

回答by Isaac

Update for pandas 0.22 and up:

大熊猫 0.22 及更高版本的更新：

If you have already installed s3fs (pip install s3fs) then you can read the file directly from s3 path, without any imports:

如果您已经安装了 s3fs( pip install s3fs)，那么您可以直接从 s3 路径读取文件，无需任何导入：

data = pd.read_csv('s3:/bucket....csv')

stable docs

稳定的文档

Answer 4

回答by kepler

Based on this answer, I found smart_opento be much simpler to use:

基于这个答案，我发现smart_open使用起来要简单得多：

import pandas as pd
from smart_open import smart_open

initial_df = pd.read_csv(smart_open('s3://bucket/file.csv'))

Answer 5

回答by jpobst

Update for pandas 0.20.3 without using s3fs:

在不使用 s3fs 的情况下更新 pandas 0.20.3：

import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
body = obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))

Answer 6

回答by Saeed Rahman

import pandas as pd
import boto3
from io import StringIO

# Read CSV
s3 = boto3.client('s3',endpoint_url,aws_access_key_id=,aws_secret_access_key)
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'],sep=',')

# Write CSV
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3.put_object(Bucket, Key,Body=csv_buffer.getvalue())

Answer 7

回答by fmguler

In addition to other awesome answers, if a custom endpoint is required, it is possible to use pd.read_csv('s3://...')syntax by monkey patching the s3fs init method.

除了其他很棒的答案之外，如果需要自定义端点，还可以pd.read_csv('s3://...')通过猴子修补 s3fs init 方法来使用语法。

import s3fs
s3fsinit = s3fs.S3FileSystem.__init__
def s3fsinit_patched(self, *k, *kw):
    s3fsinit(self, *k, client_kwargs={'endpoint_url': 'https://yourcustomendpoint'}, **kw)
s3fs.S3FileSystem.__init__ = s3fsinit_patched

Or, a more elegant way:

或者，一种更优雅的方式：

import s3fs
class S3FileSystemPatched(s3fs.S3FileSystem):
    def __init__(self, *k, **kw):
        super(S3FileSystemPatched, self).__init__( *k,
                                                  key = os.environ['aws_access_key_id'],
                                                  secret = os.environ['aws_secret_access_key'],
                                                  client_kwargs={'endpoint_url': 'https://yourcustomendpoint'},
                                                  **kw)
        print('S3FileSystem is patched')
s3fs.S3FileSystem = S3FileSystemPatched

Also see: s3fs custom endpoint url

另请参阅：s3fs 自定义端点 url

Answer 8

回答by MCMZL

Note that if your bucket is private AND on an aws-likeprovider, you will meet errors as s3fsdoes not load the profile config file at ~/.aws/configlike awscli.

请注意，如果您的存储桶是私有的并且位于类似 aws 的提供程序上，您将遇到错误，因为s3fs不会在~/.aws/configlike加载配置文件配置文件awscli。

One solution is to define the current environment variable :

一种解决方案是定义当前环境变量：

export AWS_S3_ENDPOINT="myEndpoint"
export AWS_DEFAULT_REGION="MyRegion"

从私有 S3 存储桶读取文件到 Pandas 数据帧

提问by IgorK

回答by TomAugspurger

回答by spitfiredd

回答by Isaac

回答by kepler

回答by jpobst

回答by Saeed Rahman

回答by fmguler

回答by MCMZL

相关推荐

最近更新

标签

从私有 S3 存储桶读取文件到 Pandas 数据帧

提问by IgorK

回答by TomAugspurger

回答by spitfiredd

回答by Isaac

回答by kepler

回答by jpobst

回答by Saeed Rahman

回答by fmguler

回答by MCMZL

相关推荐

vba 基于vba中单元格（查找）中可用的值执行while循环

Python/Pandas：计算每行中缺失/NaN 的数量

vba excel中宏的iferror语句

pandas 熊猫饼图图删除楔形上的标签文本

相关推荐

最近更新

标签