Python 使用 boto 和 pandas 从 aws s3 读取 csv 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43355074/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:56:19  来源:igfitidea点击:

Read a csv file from aws s3 using boto and pandas

pythonpython-2.7pandasamazon-s3boto

提问by Drj

I have already read through the answers available hereand hereand these do not help.

我已经通读了这里这里的可用答案,但这些都没有帮助。

I am trying to read a csvobject from S3bucket and have been able to successfully read the data using the following code.

我正在尝试csvS3存储桶中读取一个对象,并且能够使用以下代码成功读取数据。

srcFileName="gossips.csv"
def on_session_started():
  print("Starting new session.")
  conn = S3Connection()
  my_bucket = conn.get_bucket("randomdatagossip", validate=False)
  print("Bucket Identified")
  print(my_bucket)
  key = Key(my_bucket,srcFileName)
  key.open()
  print(key.read())
  conn.close()

on_session_started()

However, if I try to read the same object using pandas as a data frame, I get an error. The most common one being S3ResponseError: 403 Forbidden

但是,如果我尝试使用 Pandas 作为数据框读取同一个对象,则会出现错误。最常见的一种是S3ResponseError: 403 Forbidden

def on_session_started2():
  print("Starting Second new session.")
  conn = S3Connection()
  my_bucket = conn.get_bucket("randomdatagossip", validate=False)
  #     url = "https://s3.amazonaws.com/randomdatagossip/gossips.csv"
  #     urllib2.urlopen(url)

  for line in smart_open.smart_open('s3://my_bucket/gossips.csv'):
     print line
  #     data = pd.read_csv(url)
  #     print(data)

on_session_started2()

What am I doing wrong? I am on python 2.7 and cannot use Python 3.

我究竟做错了什么?我使用的是 python 2.7,不能使用 Python 3。

回答by Drj

Here is what I have done to successfully read the dffrom a csvon S3.

这是我为成功读取S3 上的dfa所做的工作csv

import pandas as pd
import boto3

bucket = "yourbucket"
file_name = "your_file.csv"

s3 = boto3.client('s3') 
# 's3' is a key word. create connection to S3 using default config and all buckets within S3

obj = s3.get_object(Bucket= bucket, Key= file_name) 
# get object and file (key) from bucket

initial_df = pd.read_csv(obj['Body']) # 'Body' is a key word

回答by aidan.plenert.macdonald

This worked for me.

这对我有用。

import pandas as pd
import boto3
import io

s3_file_key = 'data/test.csv'
bucket = 'data-bucket'

s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=s3_file_key)

initial_df = pd.read_csv(io.BytesIO(obj['Body'].read()))