pandas 许多巨大的csv文件的高效合并
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38799704/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficient merge for many huge csv files
提问by Stonecraft
I have a script that takes all the csv files in a directory and merges them side-by-side, using an outer join. The problem is that my computer chokes (MemoryError) when I try to use it on the files I need to join (about two dozen files 6-12 Gb each). I am aware that itertools can be used to make loops more efficient, but I am unclear as to whether or how it could be applied to this situation. The other alternative I can think of is to install mySQL, learn the basics, and do this there. Obviously I'd rather do this in Python if possible because I'm already learning it. An R-based solution would also be acceptable.
我有一个脚本,它使用一个目录中的所有 csv 文件并使用外连接将它们并排合并。问题是当我尝试在我需要加入的文件上使用它时我的计算机窒息(MemoryError)(大约两打文件每个 6-12 Gb)。我知道 itertools 可用于提高循环效率,但我不清楚它是否或如何应用于这种情况。我能想到的另一种选择是安装 mySQL,学习基础知识,然后在那里执行此操作。显然,如果可能的话,我宁愿用 Python 来做这件事,因为我已经在学习了。基于 R 的解决方案也是可以接受的。
Here is my code:
这是我的代码:
import os
import glob
import pandas as pd
os.chdir("\path\containing\files")
files = glob.glob("*.csv")
sdf = pd.read_csv(files[0], sep=',')
for filename in files[1:]:
df = pd.read_csv(filename, sep=',')
sdf = pd.merge(sdf, df, how='outer', on=['Factor1', 'Factor2'])
Any advice for how to do this with files too big for my computer's memory would be greatly appreciated.
关于如何处理对于我的计算机内存来说太大的文件的任何建议将不胜感激。
回答by Kartik
Use HDF5, that in my opinion would suit your needs very well. It also handles out-of-core queries, so you won't have to face MemoryError
.
使用HDF5,在我看来这将非常适合您的需求。它还处理核外查询,因此您不必面对MemoryError
.
import os
import glob
import pandas as pd
os.chdir("\path\containing\files")
files = glob.glob("*.csv")
hdf_path = 'my_concatenated_file.h5'
with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
# This compresses the final file by 5 using blosc. You can avoid that or
# change it as per your needs.
for filename in files:
store.append('table_name', pd.read_csv(filename, sep=','), index=False)
# Then create the indexes, if you need it
store.create_table_index('table_name', columns=['Factor1', 'Factor2'], optlevel=9, kind='full')
回答by Mike Graham
回答by Shawn K
You should be able to do this with python but i don't think reading the csv's at once will be the most efficient use of your memory.
你应该可以用 python 做到这一点,但我认为一次读取 csv 不会是最有效地利用你的记忆。
How to read a CSV file from a stream and process each line as it is written?