将 Pandas 数据帧作为压缩 CSV 直接写入 Amazon s3 存储桶?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43729224/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:31:14  来源:igfitidea点击:

Write pandas dataframe as compressed CSV directly to Amazon s3 bucket?

pythoncsvpandasamazon-web-servicesamazon-s3

提问by rosstripi

I currently have a script that reads the existing version of a csv saved to s3, combines that with the new rows in the pandas dataframe, and then writes that directly back to s3.

我目前有一个脚本可以读取保存到 s3 的 csv 的现有版本,将其与 Pandas 数据框中的新行结合,然后将其直接写回 s3。

    try:
        csv_prev_content = str(s3_resource.Object('bucket-name', ticker_csv_file_name).get()['Body'].read(), 'utf8')
    except:
        csv_prev_content = ''

    csv_output = csv_prev_content + curr_df.to_csv(path_or_buf=None, header=False)
    s3_resource.Object('bucket-name', ticker_csv_file_name).put(Body=csv_output)

Is there a way that I can do this but with a gzip compressed csv? I want to read an existing .gz compressed csv on s3 if there is one, concatenate it with the contents of the dataframe, and then overwrite the .gz with the new combined compressed csv directly in s3 withouthaving to make a local copy.

有没有办法做到这一点,但使用 gzip 压缩的 csv?我想在 s3 上读取现有的 .gz 压缩 csv(如果有),将它与数据帧的内容连接起来,然后直接在 s3 中用新的组合压缩 csv 覆盖 .gz,无需制作本地副本。

回答by ramhiser

Here's a solution in Python 3.5.2 using Pandas 0.20.1.

这是使用 Pandas 0.20.1 在 Python 3.5.2 中的解决方案。

The source DataFrame can be read from a S3, a local CSV, or whatever.

源 DataFrame 可以从 S3、本地 CSV 或其他任何内容中读取。

import boto3
import gzip
import pandas as pd
from io import BytesIO, TextIOWrapper

df = pd.read_csv('s3://ramey/test.csv')
gz_buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object('ramey', 'new-file.csv.gz')
s3_object.put(Body=gz_buffer.getvalue())

回答by user582175

If you want streaming writes (to not hold (de)compressed CSV in memory), you can do this:

如果您想要流式写入(不将(解)压缩的 CSV 保存在内存中),您可以这样做:

import s3fs
import io
import gzip

    def write_df_to_s3(df, filename, path):
        s3 = s3fs.S3FileSystem(anon=False)
        with s3.open(path, 'wb') as f:
            gz = gzip.GzipFile(filename, mode='wb', compresslevel=9, fileobj=f)
            buf = io.TextIOWrapper(gz)
            df.to_csv(buf, index=False, encoding='UTF_8')
            gz.flush()
            gz.close()

TextIOWrapper is needed until this issue is fixed: https://github.com/pandas-dev/pandas/issues/19827

在修复此问题之前需要 TextIOWrapper:https: //github.com/pandas-dev/pandas/issues/19827