Pandas to_csv() 保存大数据帧的速度很慢

Question

提问by Kimi Merroll

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv()function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

我猜这是一个简单的解决方法，但我遇到了一个问题，即使用to_csv()函数将Pandas数据帧保存到 csv 文件需要将近一个小时。我将 anaconda python 2.7.12 与 Pandas (0.19.1) 一起使用。

import os
import glob
import pandas as pd

src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))

# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
    stage = pd.DataFrame()
    iter_csv = pd.read_csv(file_
                     , sep=','
                     , index_col=False
                     , header=0
                     , low_memory=False
                     , iterator=True
                     , chunksize=100000
                     , compression='gzip'
                     , memory_map=True
                     , encoding='utf-8')

    df = pd.concat([chunk for chunk in iter_csv])
    stage = stage.append(df, ignore_index=True)

# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
             , sep='|'
             , header=True
             , index=False
             , chunksize=100000
             , encoding='utf-8')

del stage

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

我已经确认硬件和内存正常工作，但这些是相当宽的表（约 100 列），主要是数字（十进制）数据。

Thank you,

谢谢，

Answer 1

回答by Frane

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

您正在读取压缩文件并写入纯文本文件。可能是IO瓶颈。

Writing compressed file could speedup writing up to 10x

写入压缩文件最多可将写入速度提高 10 倍

    stage.to_csv('output.csv.gz'
         , sep='|'
         , header=True
         , index=False
         , chunksize=100000
         , compression='gzip'
         , encoding='utf-8')

Additionally you could experiment with different chunk sizes and compression methods (‘bz2', ‘xz').

此外，您可以尝试不同的块大小和压缩方法（'bz2'、'xz'）。

Answer 2

回答by Amir F

Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

添加我的小见解，因为 'gzip' 替代方案对我不起作用 - 尝试使用 to_hdf 方法。这大大减少了写入时间！（对于 100MB 的文件不到一秒钟 - CSV 选项在 30-55 秒之间执行此操作）

stage.to_hdf(r'path/file.h5', key='stage', mode='w')

Answer 3

回答by lucas F

You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?

你说“ [...] 主要是数字（十进制）数据。”。你有任何带有时间和/或日期的列吗？

I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Datescolumns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:

当它只有数字/字符串值时，我在几秒钟内保存了一个 8 GB 的 CSV，但保存一个 500 MB 的 CSV 需要 20 分钟的两Dates列。所以，我建议在保存之前将每个日期列转换为字符串。以下命令就足够了：

df['Column'] = df['Column'].astype(str)

I hope that this answer helps you.

我希望这个答案对你有帮助。

P.S.: I understand that saving as a .hdffile solved the problem. But, sometimes, we do need a .csvfile anyway.

PS：我知道保存为.hdf文件解决了这个问题。但是，有时我们确实需要一个.csv文件。

Pandas to_csv() 保存大数据帧的速度很慢

提问by Kimi Merroll

回答by Frane

回答by Amir F

回答by lucas F

相关推荐

最近更新

标签

Pandas to_csv() 保存大数据帧的速度很慢

提问by Kimi Merroll

回答by Frane

回答by Amir F

回答by lucas F

相关推荐

pandas 如何在熊猫中将月度数据转换为季度数据

pandas 从 Python 中的 Dataframe 过滤多个条件

基于值 (Pandas) 过滤列：TypeError: 无法将 ['a'] 与块值进行比较

pandas pyspark：ValueError：推断后无法确定某些类型

相关推荐

最近更新

标签