pandas 使用 python2.7 从 Amazon s3 读取 csv

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43345907/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:22:34  来源:igfitidea点击:

Read csv from Amazon s3 using python2.7

pythonpython-2.7csvpandasamazon-s3

提问by lucy

I can easily get the bucket name from s3 but when I read the csv file from s3, it gives error every time.

我可以轻松地从 s3 获取存储桶名称,但是当我从 s3 读取 csv 文件时,每次都会出错。

import boto3
import pandas as pd

s3 = boto3.client('s3',
         aws_access_key_id='yyyyyyyy',
         aws_secret_access_key='xxxxxxxxxxx')
# Call S3 to list current buckets
response = s3.list_buckets()
for bucket in response['Buckets']:
    print bucket['Name']

output
s3-bucket-data

.

.

import pandas as pd
import StringIO
from boto.s3.connection import S3Connection

AWS_KEY = 'yyyyyyyyyy'
AWS_SECRET = 'xxxxxxxxxx'
aws_connection = S3Connection(AWS_KEY, AWS_SECRET)
bucket = aws_connection.get_bucket('s3-bucket-data')

fileName = "data.csv"

content = bucket.get_key(fileName).get_contents_as_string()
reader = pd.read_csv(StringIO.StringIO(content))

getting error-

得到错误-

boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request

How I can read the csv from s3?

我如何从 s3 读取 csv?

回答by muon

you can use s3fspackage

你可以使用s3fs

s3fsalso supports aws profiles in credential files.

s3fs还支持凭证文件中的 aws 配置文件。

Here is an example (you don't have to chunk it, but i just had this example handy),

这是一个例子(你不必把它分块,但我只是把这个例子放在手边),

import os
import pandas as pd
import s3fs
import gzip

chunksize = 999999
usecols = ["Col1", "Col2"]

filename = 'some_csv_file.csv.gz'
s3_bucket_name = 'some_bucket_name'

AWS_KEY = 'yyyyyyyyyy'
AWS_SECRET = 'xxxxxxxxxx'
s3f = s3fs.S3FileSystem(
    anon=False,
    key=AWS_KEY,
    secret=AWS_SECRET)

# or if you have a profile defined in credentials file:
#aws_shared_credentials_file = 'path/to/aws/credentials/file/'
#os.environ['AWS_SHARED_CREDENTIALS_FILE'] = aws_shared_credentials_file
#s3f = s3fs.S3FileSystem(
#    anon=False,
#    profile_name=s3_profile)

filepath = os.path.join(s3_bucket_name, filename)
with s3f.open(filepath, 'rb') as f:
    gz = gzip.GzipFile(fileobj=f)  # Decompress data with gzip

    chunks = pd.read_csv(gz,
                            usecols=usecols,
                            chunksize=chunksize,
                            iterator=True,
                            )

    df = pd.concat([c for c in chunks], axis=1)

回答by rrmerugu

botois onething I love when it comes to handling data on S3 with python..

boto在使用 python 处理 S3 上的数据时,这是我喜欢的一件事。

install botousing pip install boto

安装boto使用pip install boto

import boto
from boto.s3.key import Key

keyId ="your_aws_key_id"
sKeyId="your_aws_secret_key_id"
srcFileName="abc.txt" # filename on S3
destFileName="s3_abc.txt" # output file name
bucketName="mybucket001" # S3 bucket name 

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)

#Get the Key object of the given key, in the bucket
k = Key(bucket,srcFileName)

#Get the contents of the key into a file 
k.get_contents_to_filename(destFileName)

回答by Manas Gaur

I experienced this issue with a few AWS Regions. I created a bucket in "us-east-1" and the following code worked fine:

我在几个 AWS 区域遇到了这个问题。我在“us-east-1”中创建了一个存储桶,以下代码运行良好:

import boto
from boto.s3.key import Key
import StringIO
import pandas as pd
keyId ="xxxxxxxxxxxxxxxxxx"
sKeyId="yyyyyyyyyyyyyyyyyy"
srcFileName="zzzzz.csv"
bucketName="elasticbeanstalk-us-east-1-aaaaaaaaaaaa"

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)
k = Key(bucket,srcFileName)
content = k.get_contents_as_string()
reader = pd.read_csv(StringIO.StringIO(content))

Try creating a new bucket in us-east-1 and see if it works.

尝试在 us-east-1 中创建一个新存储桶,看看它是否有效。

回答by sepideh

Try the following:

请尝试以下操作:

import boto3
from boto3 import session
import pandas as pd
import io

session = boto3.session.Session(region_name='XXXX')
s3client = session.client('s3', config = 
boto3.session.Config(signature_version='XXXX'))
response = s3client.get_object(Bucket='myBucket', Key='myKey')

dataset = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='utf8')