pandas 许多巨大的csv文件的高效合并

Question

提问by Stonecraft

I have a script that takes all the csv files in a directory and merges them side-by-side, using an outer join. The problem is that my computer chokes (MemoryError) when I try to use it on the files I need to join (about two dozen files 6-12 Gb each). I am aware that itertools can be used to make loops more efficient, but I am unclear as to whether or how it could be applied to this situation. The other alternative I can think of is to install mySQL, learn the basics, and do this there. Obviously I'd rather do this in Python if possible because I'm already learning it. An R-based solution would also be acceptable.

我有一个脚本，它使用一个目录中的所有 csv 文件并使用外连接将它们并排合并。问题是当我尝试在我需要加入的文件上使用它时我的计算机窒息（MemoryError）（大约两打文件每个 6-12 Gb）。我知道 itertools 可用于提高循环效率，但我不清楚它是否或如何应用于这种情况。我能想到的另一种选择是安装 mySQL，学习基础知识，然后在那里执行此操作。显然，如果可能的话，我宁愿用 Python 来做这件事，因为我已经在学习了。基于 R 的解决方案也是可以接受的。

Here is my code:

这是我的代码：

import os
import glob
import pandas as pd
os.chdir("\path\containing\files")

files = glob.glob("*.csv")
sdf = pd.read_csv(files[0], sep=',')

for filename in files[1:]:
    df = pd.read_csv(filename, sep=',')
    sdf = pd.merge(sdf, df, how='outer', on=['Factor1', 'Factor2'])

Any advice for how to do this with files too big for my computer's memory would be greatly appreciated.

关于如何处理对于我的计算机内存来说太大的文件的任何建议将不胜感激。

Answer 1

回答by Kartik

Use HDF5, that in my opinion would suit your needs very well. It also handles out-of-core queries, so you won't have to face MemoryError.

使用HDF5，在我看来这将非常适合您的需求。它还处理核外查询，因此您不必面对MemoryError.

import os
import glob
import pandas as pd
os.chdir("\path\containing\files")

files = glob.glob("*.csv")
hdf_path = 'my_concatenated_file.h5'

with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
    # This compresses the final file by 5 using blosc. You can avoid that or
    # change it as per your needs.
    for filename in files:
        store.append('table_name', pd.read_csv(filename, sep=','), index=False)
    # Then create the indexes, if you need it
    store.create_table_index('table_name', columns=['Factor1', 'Factor2'], optlevel=9, kind='full')

Answer 2

回答by Mike Graham

There is a chance daskwill be well-suited to your use. It might depend on what you want to do after the merge.

有机会的话DASK将非常适合您的使用。这可能取决于您在合并后想要做什么。

Answer 3

回答by Shawn K

You should be able to do this with python but i don't think reading the csv's at once will be the most efficient use of your memory.

你应该可以用 python 做到这一点，但我认为一次读取 csv 不会是最有效地利用你的记忆。

How to read a CSV file from a stream and process each line as it is written?

如何从流中读取 CSV 文件并在写入时处理每一行？

pandas 许多巨大的csv文件的高效合并

提问by Stonecraft

回答by Kartik

回答by Mike Graham

回答by Shawn K

相关推荐

最近更新

标签

pandas 许多巨大的csv文件的高效合并

提问by Stonecraft

回答by Kartik

回答by Mike Graham

回答by Shawn K

相关推荐

pandas 与熊猫的加权相关系数

Pandas，基于列值的条件列分配

pandas 读取 .csv 文件时在 Python 中解析日期的最快方法？

多选的 Pandas read_sql 查询

相关推荐

最近更新

标签