从私有 S3 存储桶读取文件到 Pandas 数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35803601/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 15:42:13  来源:igfitidea点击:

Reading a file from a private S3 bucket to a pandas dataframe

amazon-web-servicespandas

提问by IgorK

I'm trying to read a CSV file from a private S3 bucket to a pandas dataframe:

我正在尝试将 CSV 文件从私有 S3 存储桶读取到 Pandas 数据帧:

df = pandas.read_csv('s3://mybucket/file.csv')

I can read a file from a public bucket, but reading a file from a private bucket results in HTTP 403: Forbidden error.

我可以从公共存储桶读取文件,但从私有存储桶读取文件会导致 HTTP 403: Forbidden 错误。

I have configured the AWS credentials using aws configure.

我已经使用 aws configure 配置了 AWS 凭证。

I can download a file from a private bucket using boto3, which uses aws credentials. It seems that I need to configure pandas to use AWS credentials, but don't know how.

我可以使用 boto3 从私有存储桶下载文件,它使用 aws 凭据。似乎我需要配置熊猫以使用 AWS 凭证,但不知道如何。

回答by TomAugspurger

Pandas uses boto(not boto3) inside read_csv. You might be able to install boto and have it work correctly.

Pandas在里面使用boto(not boto3) read_csv。您也许可以安装 boto 并使其正常工作。

There's some troubleswith boto and python 3.4.4 / python3.5.1. If you're on those platforms, and until those are fixed, you can use boto 3 as

boto 和 python 3.4.4 / python3.5.1存在一些问题。如果您在这些平台上,并且在修复这些平台之前,您可以使用 boto 3 作为

import boto3
import pandas as pd

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(obj['Body'])

That objhad a .readmethod (which returns a stream of bytes), which is enough for pandas.

obj有一个.read方法(它返回一个字节流),这对熊猫来说已经足够了。

回答by spitfiredd

Updated for Pandas 0.20.1

为 Pandas 0.20.1 更新

Pandas now uses s3fs to handle s3 coonnections. link

Pandas 现在使用 s3fs 来处理 s3 连接。关联

pandas now uses s3fs for handling S3 connections. This shouldn't break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.

pandas 现在使用 s3fs 来处理 S3 连接。这不应该破坏任何代码。但是,由于 s3fs 不是必需的依赖项,因此您需要单独安装它,就像以前版本的 Pandas 中的 boto 一样。

import os

import pandas as pd
from s3fs.core import S3FileSystem

# aws keys stored in ini file in same path
# refer to boto3 docs for config settings
os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'

s3 = S3FileSystem(anon=False)
key = 'path\to\your-csv.csv'
bucket = 'your-bucket-name'

df = pd.read_csv(s3.open('{}/{}'.format(bucket, key),
                         mode='rb')
                 )

回答by Isaac

Update for pandas 0.22 and up:

大熊猫 0.22 及更高版本的更新:

If you have already installed s3fs (pip install s3fs) then you can read the file directly from s3 path, without any imports:

如果您已经安装了 s3fs( pip install s3fs),那么您可以直接从 s3 路径读取文件,无需任何导入:

data = pd.read_csv('s3:/bucket....csv')

stable docs

稳定的文档

回答by kepler

Based on this answer, I found smart_opento be much simpler to use:

基于这个答案,我发现smart_open使用起来要简单得多:

import pandas as pd
from smart_open import smart_open

initial_df = pd.read_csv(smart_open('s3://bucket/file.csv'))

回答by jpobst

Update for pandas 0.20.3 without using s3fs:

在不使用 s3fs 的情况下更新 pandas 0.20.3:

import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
body = obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))

回答by Saeed Rahman

import pandas as pd
import boto3
from io import StringIO

# Read CSV
s3 = boto3.client('s3',endpoint_url,aws_access_key_id=,aws_secret_access_key)
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'],sep=',')

# Write CSV
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3.put_object(Bucket, Key,Body=csv_buffer.getvalue())

回答by fmguler

In addition to other awesome answers, if a custom endpoint is required, it is possible to use pd.read_csv('s3://...')syntax by monkey patching the s3fs init method.

除了其他很棒的答案之外,如果需要自定义端点,还可以pd.read_csv('s3://...')通过猴子修补 s3fs init 方法来使用语法。

import s3fs
s3fsinit = s3fs.S3FileSystem.__init__
def s3fsinit_patched(self, *k, *kw):
    s3fsinit(self, *k, client_kwargs={'endpoint_url': 'https://yourcustomendpoint'}, **kw)
s3fs.S3FileSystem.__init__ = s3fsinit_patched

Or, a more elegant way:

或者,一种更优雅的方式:

import s3fs
class S3FileSystemPatched(s3fs.S3FileSystem):
    def __init__(self, *k, **kw):
        super(S3FileSystemPatched, self).__init__( *k,
                                                  key = os.environ['aws_access_key_id'],
                                                  secret = os.environ['aws_secret_access_key'],
                                                  client_kwargs={'endpoint_url': 'https://yourcustomendpoint'},
                                                  **kw)
        print('S3FileSystem is patched')
s3fs.S3FileSystem = S3FileSystemPatched

Also see: s3fs custom endpoint url

另请参阅:s3fs 自定义端点 url

回答by MCMZL

Note that if your bucket is private AND on an aws-likeprovider, you will meet errors as s3fsdoes not load the profile config file at ~/.aws/configlike awscli.

请注意,如果您的存储桶是私有的并且位于类似 aws 的提供程序上,您将遇到错误,因为s3fs不会在~/.aws/configlike加载配置文件配置文件awscli

One solution is to define the current environment variable :

一种解决方案是定义当前环境变量:

export AWS_S3_ENDPOINT="myEndpoint"
export AWS_DEFAULT_REGION="MyRegion"