高效地将大型 Pandas 数据帧写入磁盘

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19639596/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:17:26  来源:igfitidea点击:

Efficiently writing large Pandas data frames to disk

pythonpandas

提问by user2928791

I am trying to find the best way to efficiently write large data frames (250MB+) to and from disk using Python/Pandas. I've tried all of the methods in Python for Data Analysis, but the performance has been very disappointing.

我试图找到使用 Python/Pandas 将大数据帧(250MB+)高效写入磁盘和从磁盘写入的最佳方法。我已经尝试了Python 中用于数据分析的所有方法,但性能非常令人失望。

This is part of a larger project exploring migrating our current analytic/data management environment from Stata to Python. When I compare the read/write times in my tests to those that I get with Stata, Python and Pandas are typically taking more than 20 times as long.

这是探索将我们当前的分析/数据管理环境从 Stata 迁移到 Python 的更大项目的一部分。当我将测试中的读/写时间与我使用 Stata 获得的读/写时间进行比较时,Python 和 Pandas 通常需要 20 多倍的时间。

I strongly suspect that I am the problem, not Python or Pandas.

我强烈怀疑是我的问题,而不是 Python 或 Pandas。

Any suggestions?

有什么建议?

回答by Jeff

Using HDFStoreis your best bet (not covered very much in the book, and has changed quite a lot). You will find performance is MUCH better than any other serialization method.

使用HDFStore是你最好的选择(书中没有太多涉及,并且已经改变了很多)。您会发现性能比任何其他序列化方法都要好得多。