pandas 熊猫数据帧的序列化

Question

提问by James Bond

Is there a fast way to do serialization of a DataFrame?

有没有一种快速的方法来序列化 DataFrame？

I have a grid system which can run pandas analysis in parallel. In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant DataFrame.

我有一个可以并行运行Pandas分析的网格系统。最后，我想从每个网格作业中收集所有结果（作为 DataFrame）并将它们聚合到一个巨大的 DataFrame 中。

How can I save data frame in a binary format that can be loaded rapidly?

如何以可以快速加载的二进制格式保存数据帧？

Answer 1

回答by Andy Hayden

The easiest way is just to use to_pickle(as a pickle), see pickling from the docs api page:

最简单的方法是使用to_pickle（作为pickle），请参阅文档 api 页面中的 pickling：

df.to_pickle(file_name)

Another option is to use HDF5, slightly more work to get started but much richer for querying.

另一种选择是使用HDF5，开始时需要做更多的工作，但查询要丰富得多。

Answer 2

回答by osa

DataFrame.to_msgpackis experimental and not without some issues e.g. with Unicode, but it is much fasterthan pickling. It serialized a dataframe with 5 million rows that was taking 2-3 Gb of memory in about 2 seconds, and the resulting file was about 750 Mb. Loading is somewhat slower, but still way faster than unpickling.

DataFrame.to_msgpack是实验性的，并非没有一些问题，例如 Unicode，但它比酸洗要快得多。它序列化了一个包含 500 万行的数据帧，在大约 2 秒内占用了 2-3 Gb 的内存，生成的文件大约为 750 Mb。加载有点慢，但仍然比 unpickling 快。

Answer 3

回答by Achim

Have to timed the available io functions? Binary is not automatically faster and HDF5 should be quite fast to my knowledge.

必须对可用的io 功能进行计时吗？二进制不会自动更快，据我所知 HDF5 应该很快。

pandas 熊猫数据帧的序列化

提问by James Bond

回答by Andy Hayden

回答by osa

回答by Achim

相关推荐

最近更新

标签

pandas 熊猫数据帧的序列化

提问by James Bond

回答by Andy Hayden

回答by osa

回答by Achim

相关推荐

pandas 熊猫将“NA”转换为 NaN

将 HDFS（Hadoop 文件系统）目录中的文件读入 Pandas 数据帧

pandas 从时间索引数据框中删除一行

pandas 通过索引中的部分字符串匹配选择行

相关推荐

最近更新

标签