pandas 使用 Python 将 .csv 文件分成块

Question

提问by invoker

I have a large .csv file that is well over 300 gb. I would like to chunk it into smaller files of 100,000,000 rows each (each row has approximately 55-60 bytes).

我有一个超过 300 GB 的大型 .csv 文件。我想将它分成每个 100,000,000 行的小文件（每行大约有 55-60 个字节）。

I wrote the following code:

我写了以下代码：

import pandas as pd
df = pd.read_csv('/path/to/really/big.csv',header=None,chunksize=100000000)
count = 1
for chunk in df:
    name = '/output/to/this/directory/file_%s.csv' %s count
    chunk.to_csv(name,header=None,index=None)
    print(count)
    count+=1

This code works fine, and I have plenty of memory on disk to store the approximate 5.5-6 gb at a time, but it's slow.

这段代码工作正常，我有足够的磁盘内存可以一次存储大约 5.5-6 GB，但速度很慢。

Is there a better way?

有没有更好的办法？

EDIT

编辑

I have written the following iterative solution:

我编写了以下迭代解决方案：

with open('/path/to/really/big.csv', 'r') as csvfile:
    read_rows = csv.reader(csvfile)
    file_count = 1
    row_count = 1
    f = open('/output/to/this/directory/file_%s.csv' %s count,'w')
    for row in read_rows:
        f.write(''.join(row))
        row_count+=1
        if row_count % 100000000 == 0:
            f.close()
            file_count += 1
            f = open('/output/to/this/directory/file_%s.csv' %s count,'w')

EDIT 2

编辑 2

I would like to call attention to Vor's comment about using a Unix/Linux split command, this is the fastest solution I have found.

我想提请注意 Vor 关于使用 Unix/Linux 拆分命令的评论，这是我找到的最快的解决方案。

Answer 1

采纳答案by babbageclunk

You don't really need to read all that data into a pandas DataFrame just to split the file - you don't even need to read the data all into memory at all. You could seek to the approximate offset you want to split at, then scan forward until you find a line break, and loop reading much smaller chunks from the source file into a destination file between your start and end offsets. (This approach assumes your CSV doesn't have any column values with embedded newlines.)

您真的不需要将所有数据读入 Pandas DataFrame 只是为了拆分文件 - 您甚至根本不需要将数据全部读入内存。您可以寻找要拆分的近似偏移量，然后向前扫描直到找到换行符，然后循环将源文件中的小得多的块读取到开始和结束偏移量之间的目标文件中。（此方法假定您的 CSV 没有任何带有嵌入换行符的列值。）

SMALL_CHUNK = 100000

def write_chunk(source_file, start, end, dest_name):
    pos = start
    source_file.seek(pos)
    with open(dest_name, 'w') as dest_file:
        for chunk_start in range(start, end, SMALL_CHUNK):
            chunk_end = min(chunk_start + SMALL_CHUNK, end)
            dest_file.write(source_file.read(chunk_end - chunk_start))

Actually, an intermediate solution could be to use the csvmodule - that would still parse all of the lines in the file, which isn't strictly necessary, but would avoid reading huge arrays into memory for each chunk.

实际上，一个中间解决方案可能是使用该csv模块——它仍然会解析文件中的所有行，这不是绝对必要的，但会避免为每个块将巨大的数组读入内存。

Answer 2

回答by karakfa

there is an existing tool for this in Unix/Linux.

在 Unix/Linux 中有一个现有的工具。

split -l 100000 -d source destination

will add two digit numerical suffix to destination prefix for the chunks.

将为块的目标前缀添加两位数字后缀。

pandas 使用 Python 将 .csv 文件分成块

提问by invoker

采纳答案by babbageclunk

回答by karakfa

相关推荐

最近更新

标签

pandas 使用 Python 将 .csv 文件分成块

提问by invoker

采纳答案by babbageclunk

回答by karakfa

相关推荐

从 Pandas DataFrame 中的列创建一个元组

pandas AssertionError的解决方案：在Dataframes列表上连接操作时get_concat_dtype中的dtype判定无效

如何在长 Pandas 系列上应用三次样条插值？

从 pandas.dataframe 中删除低频值

相关推荐

最近更新

标签