将 Dataframe 保存到 csv 直接到 s3 Python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38154040/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:26:06  来源:igfitidea点击:

Save Dataframe to csv directly to s3 Python

pythoncsvamazon-s3dataframeboto3

提问by user2494275

I have a pandas DataFrame that I want to upload to a new CSV file. The problem is that I don't want to save the file locally before transferring it to s3. Is there any method like to_csv for writing the dataframe to s3 directly? I am using boto3.
Here is what I have so far:

我有一个要上传到新 CSV 文件的 Pandas DataFrame。问题是我不想在将文件传输到 s3 之前将其保存在本地。有没有像 to_csv 这样的方法可以直接将数据帧写入 s3?我正在使用 boto3。
这是我到目前为止所拥有的:

import boto3
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'])

# Make alterations to DataFrame

# Then export DataFrame to CSV through direct transfer to s3

回答by Stefan

You can use:

您可以使用:

from io import StringIO # python3; python2: BytesIO 
import boto3

bucket = 'my_bucket_name' # already created on S3
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'df.csv').put(Body=csv_buffer.getvalue())

回答by yardstick17

You can directly use the S3 path. I am using Pandas 0.24.1

您可以直接使用 S3 路径。我正在使用Pandas 0.24.1

In [1]: import pandas as pd

In [2]: df = pd.DataFrame( [ [1, 1, 1], [2, 2, 2] ], columns=['a', 'b', 'c'])

In [3]: df
Out[3]:
   a  b  c
0  1  1  1
1  2  2  2

In [4]: df.to_csv('s3://experimental/playground/temp_csv/dummy.csv', index=False)

In [5]: pd.__version__
Out[5]: '0.24.1'

In [6]: new_df = pd.read_csv('s3://experimental/playground/temp_csv/dummy.csv')

In [7]: new_df
Out[7]:
   a  b  c
0  1  1  1
1  2  2  2

Release Note:

发行公告:

S3 File Handling

pandas now uses s3fs for handling S3 connections. This shouldn't break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. GH11915.

S3 文件处理

pandas 现在使用 s3fs 来处理 S3 连接。这不应该破坏任何代码。但是,由于 s3fs 不是必需的依赖项,因此您需要单独安装它,就像以前版本的 Pandas 中的 boto 一样。GH11915

回答by michcio1234

I like s3fswhich lets you use s3 (almost) like a local filesystem.

我喜欢s3fs,它可以让您(几乎)像本地文件系统一样使用 s3。

You can do this:

你可以这样做:

import s3fs

bytes_to_write = df.to_csv(None).encode()
fs = s3fs.S3FileSystem(key=key, secret=secret)
with fs.open('s3://bucket/path/to/file.csv', 'wb') as f:
    f.write(bytes_to_write)

s3fssupports only rband wbmodes of opening the file, that's why I did this bytes_to_writestuff.

s3fs只支持rbwb打开文件,这就是为什么我做这个模式bytes_to_write的东西。

回答by erncyp

This is a more up to date answer:

这是一个更新的答案:

import s3fs

s3 = s3fs.S3FileSystem(anon=False)

# Use 'w' for py3, 'wb' for py2
with s3.open('<bucket-name>/<filename>.csv','w') as f:
    df.to_csv(f)

The problem with StringIO is that it will eat away at your memory. With this method, you are streaming the file to s3, rather than converting it to string, then writing it into s3. Holding the pandas dataframe and its string copy in memory seems very inefficient.

StringIO 的问题在于它会侵蚀你的记忆。使用此方法,您将文件流式传输到 s3,而不是将其转换为字符串,然后将其写入 s3。将 pandas 数据帧及其字符串副本保存在内存中似乎非常低效。

If you are working in an ec2 instant, you can give it an IAM role to enable writing it to s3, thus you dont need to pass in credentials directly. However, you can also connect to a bucket by passing credentials to the S3FileSystem()function. See documention:https://s3fs.readthedocs.io/en/latest/

如果您在 ec2 即时工作,则可以为其提供 IAM 角色以启用将其写入 s3,因此您无需直接传入凭据。但是,您也可以通过将凭据传递给S3FileSystem()函数来连接到存储桶。请参阅文档:https: //s3fs.readthedocs.io/en/latest/

回答by mhawke

If you pass Noneas the first argument to to_csv()the data will be returned as a string. From there it's an easy step to upload that to S3 in one go.

如果None作为第一个参数传递to_csv()给数据,则将作为字符串返回。从那里可以轻松地一次性将其上传到 S3。

It should also be possible to pass a StringIOobject to to_csv(), but using a string will be easier.

也应该可以将StringIO对象传递给to_csv(),但使用字符串会更容易。

回答by gabra

You can also use the AWS Data Wrangler:

您还可以使用AWS Data Wrangler

import awswrangler

session = awswrangler.Session()
session.pandas.to_csv(
    dataframe=df,
    path="s3://...",
)

Note that it will split into several parts since it uploads it in parallel.

请注意,它会分成几个部分,因为它是并行上传的。

回答by Harry_pb

I found this can be done using clientalso and not just resource.

我发现这也可以使用client而不仅仅是resource.

from io import StringIO
import boto3
s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)
csv_buf = StringIO()
df.to_csv(csv_buf, header=True, index=False)
csv_buf.seek(0)
s3.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key='path/test.csv')

回答by jerrytim

since you are using boto3.client(), try:

由于您正在使用boto3.client(),请尝试:

import boto3
from io import StringIO #python3 
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
def copy_to_s3(client, df, bucket, filepath):
    csv_buf = StringIO()
    df.to_csv(csv_buf, header=True, index=False)
    csv_buf.seek(0)
    client.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key=filepath)
    print(f'Copy {df.shape[0]} rows to S3 Bucket {bucket} at {filepath}, Done!')

copy_to_s3(client=s3, df=df_to_upload, bucket='abc', filepath='def/test.csv')

回答by Antoine Krajnc

I found a very simple solution that seems to be working :

我找到了一个似乎有效的非常简单的解决方案:

s3 = boto3.client("s3")

s3.put_object(
    Body=open("filename.csv").read(),
    Bucket="your-bucket",
    Key="your-key"
)

Hope that helps !

希望有帮助!

回答by Jamir Josimar Huamán Campos

I read a csv with two columns from bucket s3, and the content of the file csv i put in pandas dataframe.

我从存储桶 s3 中读取了一个包含两列的 csv,以及我放入 Pandas 数据框中的文件 csv 的内容。

Example:

例子:

config.json

配置文件

{
  "credential": {
    "access_key":"xxxxxx",
    "secret_key":"xxxxxx"
}
,
"s3":{
       "bucket":"mybucket",
       "key":"csv/user.csv"
   }
}

cls_config.json

cls_config.json

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json

class cls_config(object):

    def __init__(self,filename):

        self.filename = filename


    def getConfig(self):

        fileName = os.path.join(os.path.dirname(__file__), self.filename)
        with open(fileName) as f:
        config = json.load(f)
        return config

cls_pandas.py

cls_pandas.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import io

class cls_pandas(object):

    def __init__(self):
        pass

    def read(self,stream):

        df = pd.read_csv(io.StringIO(stream), sep = ",")
        return df

cls_s3.py

cls_s3.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import boto3
import json

class cls_s3(object):

    def  __init__(self,access_key,secret_key):

        self.s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)

    def getObject(self,bucket,key):

        read_file = self.s3.get_object(Bucket=bucket, Key=key)
        body = read_file['Body'].read().decode('utf-8')
        return body

test.py

测试文件

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from cls_config import *
from cls_s3 import *
from cls_pandas import *

class test(object):

    def __init__(self):
        self.conf = cls_config('config.json')

    def process(self):

        conf = self.conf.getConfig()

        bucket = conf['s3']['bucket']
        key = conf['s3']['key']

        access_key = conf['credential']['access_key']
        secret_key = conf['credential']['secret_key']

        s3 = cls_s3(access_key,secret_key)
        ob = s3.getObject(bucket,key)

        pa = cls_pandas()
        df = pa.read(ob)

        print df

if __name__ == '__main__':
    test = test()
    test.process()