Pandas to_csv() 保存大数据帧的速度很慢

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40660331/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:27:27  来源:igfitidea点击:

Pandas to_csv() slow saving large dataframe

pythoncsvpandasdataframe

提问by Kimi Merroll

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv()function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

我猜这是一个简单的解决方法,但我遇到了一个问题,即使用to_csv()函数将Pandas数据帧保存到 csv 文件需要将近一个小时。我将 anaconda python 2.7.12 与 Pandas (0.19.1) 一起使用。

import os
import glob
import pandas as pd

src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))

# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
    stage = pd.DataFrame()
    iter_csv = pd.read_csv(file_
                     , sep=','
                     , index_col=False
                     , header=0
                     , low_memory=False
                     , iterator=True
                     , chunksize=100000
                     , compression='gzip'
                     , memory_map=True
                     , encoding='utf-8')

    df = pd.concat([chunk for chunk in iter_csv])
    stage = stage.append(df, ignore_index=True)

# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
             , sep='|'
             , header=True
             , index=False
             , chunksize=100000
             , encoding='utf-8')

del stage

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

我已经确认硬件和内存正常工作,但这些是相当宽的表(约 100 列),主要是数字(十进制)数据。

Thank you,

谢谢,

回答by Frane

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

您正在读取压缩文件并写入纯文本文件。可能是IO瓶颈。

Writing compressed file could speedup writing up to 10x

写入压缩文件最多可将写入速度提高 10 倍

    stage.to_csv('output.csv.gz'
         , sep='|'
         , header=True
         , index=False
         , chunksize=100000
         , compression='gzip'
         , encoding='utf-8')

Additionally you could experiment with different chunk sizes and compression methods (‘bz2', ‘xz').

此外,您可以尝试不同的块大小和压缩方法('bz2'、'xz')。

回答by Amir F

Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

添加我的小见解,因为 'gzip' 替代方案对我不起作用 - 尝试使用 to_hdf 方法。这大大减少了写入时间!(对于 100MB 的文件不到一秒钟 - CSV 选项在 30-55 秒之间执行此操作)

stage.to_hdf(r'path/file.h5', key='stage', mode='w')

回答by lucas F

You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?

你说“ [...] 主要是数字(十进制)数据。”。你有任何带有时间和/或日期的列吗?

I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Datescolumns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:

当它只有数字/字符串值时,我在几秒钟内保存了一个 8 GB 的 CSV,但保存一个 500 MB 的 CSV 需要 20 分钟的两Dates列。所以,我建议在保存之前将每个日期列转换为字符串。以下命令就足够了:

df['Column'] = df['Column'].astype(str) 

I hope that this answer helps you.

我希望这个答案对你有帮助。

P.S.: I understand that saving as a .hdffile solved the problem. But, sometimes, we do need a .csvfile anyway.

PS:我知道保存为.hdf文件解决了这个问题。但是,有时我们确实需要一个.csv文件。