将 Pandas 数据帧作为压缩 CSV 直接写入 Amazon s3 存储桶？

Question

提问by rosstripi

I currently have a script that reads the existing version of a csv saved to s3, combines that with the new rows in the pandas dataframe, and then writes that directly back to s3.

我目前有一个脚本可以读取保存到 s3 的 csv 的现有版本，将其与 Pandas 数据框中的新行结合，然后将其直接写回 s3。

    try:
        csv_prev_content = str(s3_resource.Object('bucket-name', ticker_csv_file_name).get()['Body'].read(), 'utf8')
    except:
        csv_prev_content = ''

    csv_output = csv_prev_content + curr_df.to_csv(path_or_buf=None, header=False)
    s3_resource.Object('bucket-name', ticker_csv_file_name).put(Body=csv_output)

Is there a way that I can do this but with a gzip compressed csv? I want to read an existing .gz compressed csv on s3 if there is one, concatenate it with the contents of the dataframe, and then overwrite the .gz with the new combined compressed csv directly in s3 withouthaving to make a local copy.

有没有办法做到这一点，但使用 gzip 压缩的 csv？我想在 s3 上读取现有的 .gz 压缩 csv（如果有），将它与数据帧的内容连接起来，然后直接在 s3 中用新的组合压缩 csv 覆盖 .gz，而无需制作本地副本。

Answer 1

回答by ramhiser

Here's a solution in Python 3.5.2 using Pandas 0.20.1.

这是使用 Pandas 0.20.1 在 Python 3.5.2 中的解决方案。

The source DataFrame can be read from a S3, a local CSV, or whatever.

源 DataFrame 可以从 S3、本地 CSV 或其他任何内容中读取。

import boto3
import gzip
import pandas as pd
from io import BytesIO, TextIOWrapper

df = pd.read_csv('s3://ramey/test.csv')
gz_buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object('ramey', 'new-file.csv.gz')
s3_object.put(Body=gz_buffer.getvalue())

Answer 2

回答by user582175

If you want streaming writes (to not hold (de)compressed CSV in memory), you can do this:

如果您想要流式写入（不将（解）压缩的 CSV 保存在内存中），您可以这样做：

import s3fs
import io
import gzip

    def write_df_to_s3(df, filename, path):
        s3 = s3fs.S3FileSystem(anon=False)
        with s3.open(path, 'wb') as f:
            gz = gzip.GzipFile(filename, mode='wb', compresslevel=9, fileobj=f)
            buf = io.TextIOWrapper(gz)
            df.to_csv(buf, index=False, encoding='UTF_8')
            gz.flush()
            gz.close()

TextIOWrapper is needed until this issue is fixed: https://github.com/pandas-dev/pandas/issues/19827

在修复此问题之前需要 TextIOWrapper：https: //github.com/pandas-dev/pandas/issues/19827

将 Pandas 数据帧作为压缩 CSV 直接写入 Amazon s3 存储桶？

提问by rosstripi

回答by ramhiser

回答by user582175

相关推荐

最近更新

标签

将 Pandas 数据帧作为压缩 CSV 直接写入 Amazon s3 存储桶？

提问by rosstripi

回答by ramhiser

回答by user582175

相关推荐

pandas 根据没有公共列的其他两个日期之间的日期合并两个数据框

pandas 熊猫在 csv 列中读取为浮点数并将空单元格设置为 0

pandas 使用pandas从csv中删除特定行

从单列 Pandas 数据帧生成词云

相关推荐

最近更新

标签