pandas 熊猫数据帧的序列化
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16971803/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Serialization of a pandas DataFrame
提问by James Bond
Is there a fast way to do serialization of a DataFrame?
有没有一种快速的方法来序列化 DataFrame?
I have a grid system which can run pandas analysis in parallel. In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant DataFrame.
我有一个可以并行运行Pandas分析的网格系统。最后,我想从每个网格作业中收集所有结果(作为 DataFrame)并将它们聚合到一个巨大的 DataFrame 中。
How can I save data frame in a binary format that can be loaded rapidly?
如何以可以快速加载的二进制格式保存数据帧?
回答by Andy Hayden
The easiest way is just to use to_pickle(as a pickle), see pickling from the docs api page:
最简单的方法是使用to_pickle(作为pickle),请参阅文档 api 页面中的pickling:
df.to_pickle(file_name)
Another option is to use HDF5, slightly more work to get started but much richer for querying.
另一种选择是使用HDF5,开始时需要做更多的工作,但查询要丰富得多。
回答by osa
DataFrame.to_msgpackis experimental and not without some issues e.g. with Unicode, but it is much fasterthan pickling. It serialized a dataframe with 5 million rows that was taking 2-3 Gb of memory in about 2 seconds, and the resulting file was about 750 Mb. Loading is somewhat slower, but still way faster than unpickling.
DataFrame.to_msgpack是实验性的,并非没有一些问题,例如 Unicode,但它比酸洗要快得多。它序列化了一个包含 500 万行的数据帧,在大约 2 秒内占用了 2-3 Gb 的内存,生成的文件大约为 750 Mb。加载有点慢,但仍然比 unpickling 快。
回答by Achim
Have to timed the available io functions? Binary is not automatically faster and HDF5 should be quite fast to my knowledge.
必须对可用的io 功能进行计时吗?二进制不会自动更快,据我所知 HDF5 应该很快。

