Python 将大型 Pandas 数据帧分块写入 CSV 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38531195/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:05:28  来源:igfitidea点击:

Writing large Pandas Dataframes to CSV file in chunks

pythonpandasdataframeexport-to-csvlarge-data

提问by Korean_Of_the_Mountain

How do I write out a large data file to a CSV file in chunks?

如何将大数据文件分块写入 CSV 文件?

I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of that data is of interest to me.

我有一组大数据文件(1M 行 x 20 列)。但是,我对这些数据中只有 5 列左右感兴​​趣。

I want to make things easier by making copies of these files with only the columns of interest so I have smaller files to work with for post-processing. So I plan to read the file into a dataframe, then write to csv file.

我想通过仅使用感兴趣的列制作这些文件的副本来使事情变得更容易,因此我可以使用较小的文件进行后期处理。所以我计划将文件读入数据帧,然后写入 csv 文件。

I've been looking into reading large data files in chunks into a dataframe. However, I haven't been able to find anything on how to write out the data to a csv file in chunks.

我一直在研究将大数据文件分块读入数据帧。但是,我无法找到有关如何将数据分块写入 csv 文件的任何信息。

Here is what I'm trying now, but this doesn't append the csv file:

这是我现在正在尝试的,但这不会附加 csv 文件:

with open(os.path.join(folder, filename), 'r') as src:
    df = pd.read_csv(src, sep='\t',skiprows=(0,1,2),header=(0), chunksize=1000)
    for chunk in df:
        chunk.to_csv(os.path.join(folder, new_folder,
                                  "new_file_" + filename), 
                                  columns = [['TIME','STUFF']])

回答by Scratch'N'Purr

Solution:

解决方案:

header = True
for chunk in chunks:

    chunk.to_csv(os.path.join(folder, new_folder, "new_file_" + filename),
        header=header, cols=[['TIME','STUFF']], mode='a')

    header = False

Notes:

笔记:

  • The mode='a'tells pandas to append.
  • We only write a column header on the first chunk.
  • mode='a'讲述大熊猫追加。
  • 我们只在第一个块上写一个列标题。

回答by Alex

Check out the chunksizeargument in the to_csvmethod. Hereare the docs.

查看方法中的chunksize参数to_csv是文档。

Writing to file would look like:

写入文件看起来像:

df.to_csv("path/to/save/file.csv", chunksize=1000, cols=['TIME','STUFF'])

回答by Alexander

Why don't you only read the columns of interest and then save it?

为什么不只阅读感兴趣的列然后保存呢?

file_in = os.path.join(folder, filename)
file_out = os.path.join(folder, new_folder, 'new_file' + filename)

df = pd.read_csv(file_in, sep='\t', skiprows=(0, 1, 2), header=0, names=['TIME', 'STUFF'])
df.to_csv(file_out)