如何在 Python 中拆分大文件 csv 文件(7GB)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20033861/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:19:23  来源:igfitidea点击:

How can I split a large file csv file (7GB) in Python

pythoncsvsplit

提问by Sohail

I have a 7GB csvfile which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?

我有一个 7GB 的csv文件,我想将其拆分为更小的块,以便在笔记本电脑上使用 Python 进行分析时可读性更高且速度更快。我想从中获取一小部分,也许 250MB,那么我该怎么做呢?

回答by jonrsharpe

See the Python docson fileobjects (the object returned by open(filename)- you can choose to reada specified number of bytes, or use readlineto work through one line at a time.

请参阅有关对象的Python 文档file(返回的对象open(filename)- 您可以选择read指定的字节数,或用于readline一次处理一行。

回答by Thomas Orozco

You don't need Python to split a csv file. Using your shell:

您不需要 Python 来拆分 csv 文件。使用你的外壳:

$ split -l 100 data.csv

Would split data.csvin chunks of 100 lines.

将分成data.csv100 行的块。

回答by dstromberg

Maybe something like this?

也许是这样的?

#!/usr/local/cpython-3.3/bin/python

import csv

divisor = 10

outfileno = 1
outfile = None

with open('big.csv', 'r') as infile:
    for index, row in enumerate(csv.reader(infile)):
        if index % divisor == 0:
            if outfile is not None:
                outfile.close()
            outfilename = 'big-{}.csv'.format(outfileno)
            outfile = open(outfilename, 'w')
            outfileno += 1
            writer = csv.writer(outfile)
        writer.writerow(row)

回答by Quentin Febvre

I had to do a similar task, and used the pandas package:

我不得不做一个类似的任务,并使用了熊猫包:

for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
    chunk.to_csv('chunk{}.csv'.format(i), index=False)

回答by Jimmy

I agree with @jonrsharpe readline should be able to read one line at a time even for big files.

我同意@jonrsharpe readline 应该能够一次读取一行,即使是大文件。

If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.

如果您正在处理大型 csv 文件,我可能会建议使用pandas.read_csv。我经常将它用于相同的目的,并且总是觉得它很棒(而且速度很快)。需要一些时间来习惯 DataFrames 的概念。但是一旦你克服了它,它就会大大加快像你这样的大型操作。

Hope it helps.

希望能帮助到你。