如何在 Python 中拆分大文件 csv 文件(7GB)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20033861/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I split a large file csv file (7GB) in Python
提问by Sohail
I have a 7GB csvfile which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?
我有一个 7GB 的csv文件,我想将其拆分为更小的块,以便在笔记本电脑上使用 Python 进行分析时可读性更高且速度更快。我想从中获取一小部分,也许 250MB,那么我该怎么做呢?
回答by jonrsharpe
See the Python docson fileobjects (the object returned by open(filename)- you can choose to reada specified number of bytes, or use readlineto work through one line at a time.
请参阅有关对象的Python 文档file(返回的对象open(filename)- 您可以选择read指定的字节数,或用于readline一次处理一行。
回答by Thomas Orozco
You don't need Python to split a csv file. Using your shell:
您不需要 Python 来拆分 csv 文件。使用你的外壳:
$ split -l 100 data.csv
Would split data.csvin chunks of 100 lines.
将分成data.csv100 行的块。
回答by dstromberg
Maybe something like this?
也许是这样的?
#!/usr/local/cpython-3.3/bin/python
import csv
divisor = 10
outfileno = 1
outfile = None
with open('big.csv', 'r') as infile:
for index, row in enumerate(csv.reader(infile)):
if index % divisor == 0:
if outfile is not None:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w')
outfileno += 1
writer = csv.writer(outfile)
writer.writerow(row)
回答by Quentin Febvre
I had to do a similar task, and used the pandas package:
我不得不做一个类似的任务,并使用了熊猫包:
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
chunk.to_csv('chunk{}.csv'.format(i), index=False)
回答by Jimmy
I agree with @jonrsharpe readline should be able to read one line at a time even for big files.
我同意@jonrsharpe readline 应该能够一次读取一行,即使是大文件。
If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.
如果您正在处理大型 csv 文件,我可能会建议使用pandas.read_csv。我经常将它用于相同的目的,并且总是觉得它很棒(而且速度很快)。需要一些时间来习惯 DataFrames 的概念。但是一旦你克服了它,它就会大大加快像你这样的大型操作。
Hope it helps.
希望能帮助到你。

