如何在 Python 中拆分大文件 csv 文件（7GB）

Question

提问by Sohail

I have a 7GB csvfile which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?

我有一个 7GB 的csv文件，我想将其拆分为更小的块，以便在笔记本电脑上使用 Python 进行分析时可读性更高且速度更快。我想从中获取一小部分，也许 250MB，那么我该怎么做呢？

Answer 1

回答by jonrsharpe

See the Python docson fileobjects (the object returned by open(filename)- you can choose to reada specified number of bytes, or use readlineto work through one line at a time.

请参阅有关对象的Python 文档file（返回的对象open(filename)- 您可以选择read指定的字节数，或用于readline一次处理一行。

Answer 2

回答by Thomas Orozco

You don't need Python to split a csv file. Using your shell:

您不需要 Python 来拆分 csv 文件。使用你的外壳：

$ split -l 100 data.csv

Would split data.csvin chunks of 100 lines.

将分成data.csv100 行的块。

Answer 3

回答by dstromberg

Maybe something like this?

也许是这样的？

#!/usr/local/cpython-3.3/bin/python

import csv

divisor = 10

outfileno = 1
outfile = None

with open('big.csv', 'r') as infile:
    for index, row in enumerate(csv.reader(infile)):
        if index % divisor == 0:
            if outfile is not None:
                outfile.close()
            outfilename = 'big-{}.csv'.format(outfileno)
            outfile = open(outfilename, 'w')
            outfileno += 1
            writer = csv.writer(outfile)
        writer.writerow(row)

Answer 4

回答by Quentin Febvre

I had to do a similar task, and used the pandas package:

我不得不做一个类似的任务，并使用了熊猫包：

for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
    chunk.to_csv('chunk{}.csv'.format(i), index=False)

Answer 5

回答by Jimmy

I agree with @jonrsharpe readline should be able to read one line at a time even for big files.

我同意@jonrsharpe readline 应该能够一次读取一行，即使是大文件。

If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.

如果您正在处理大型 csv 文件，我可能会建议使用pandas.read_csv。我经常将它用于相同的目的，并且总是觉得它很棒（而且速度很快）。需要一些时间来习惯 DataFrames 的概念。但是一旦你克服了它，它就会大大加快像你这样的大型操作。

Hope it helps.

希望能帮助到你。

如何在 Python 中拆分大文件 csv 文件（7GB）

提问by Sohail

回答by jonrsharpe

回答by Thomas Orozco

回答by dstromberg

回答by Quentin Febvre

回答by Jimmy

相关推荐

最近更新

标签

如何在 Python 中拆分大文件 csv 文件（7GB）

提问by Sohail

回答by jonrsharpe

回答by Thomas Orozco

回答by dstromberg

回答by Quentin Febvre

回答by Jimmy

相关推荐

在 Python 中使用 ElementTree 发出命名空间规范

在 for 循环期间计算运行总数 - Python

如何按n个元素对python中的元素进行分组？

在 Python 中使用多个分隔符拆分字符串

相关推荐

最近更新

标签