pandas 如何直接以gzipped格式保存pandas数据帧？

Question

提问by Curious2learn

I have a pandas data frame, called df.

我有一个名为 .pandas 的数据框df。

I want to save this in a gzipped format. One way to do this is the following:

我想以 gzip 格式保存它。执行此操作的一种方法如下：

import gzip
import pandas

df.save('filename.pickle')
f_in = open('filename.pickle', 'rb')
f_out = gzip.open('filename.pickle.gz', 'wb')
f_out.writelines(f_in)
f_in.close()
f_out.close()

However, this requires me to first create a file called filename.pickle. Is there a way to do this more directly, i.e., without creating the filename.pickle?

但是，这需要我首先创建一个名为filename.pickle. 有没有办法更直接地做到这一点，即不创建filename.pickle?

When I want to load the dataframe that has been gzipped I have to go through the same step of creating filename.pickle. For example, to read a file filename2.pickle.gzip, which is a gzipped pandas dataframe, I know of the following method:

当我想加载已被 gzipped 的数据框时，我必须执行创建 filename.pickle 的相同步骤。例如，要读取一个文件 filename2.pickle.gzip，它是一个 gzip 压缩的 Pandas 数据框，我知道以下方法：

f_in = gzip.open('filename2.pickle.gz', 'rb')
f_out = gzip.open('filename2.pickle', 'wb')
f_out.writelines(f_in)
f_in.close()
f_out.close()

df2 = pandas.load('filename2.pickle')

Can this be done without creating filename2.picklefirst?

这可以在不filename2.pickle先创建的情况下完成吗？

Answer 1

采纳答案by Wes McKinney

We plan to add better serialization with compression eventually. Stay tuned to pandas development

我们计划最终通过压缩添加更好的序列化。请继续关注Pandas的发展

Answer 2

回答by Seanny123

Better serialization with compression has recently been added to Pandas. (Starting in pandas 0.20.0.) Here is an example of how it can be used:

Pandas 最近添加了更好的压缩序列化。（从 pandas 0.20.0 开始。）这是一个如何使用它的示例：

df.to_csv("my_file.gz", compression="gzip")

For more information, such as different forms of compression available, check out the docs.

有关更多信息，例如可用的不同形式的压缩，请查看文档。

Answer 3

回答by Mark Adler

For some reason, the Python zlib modulehas the ability to decompress gzip data, but it does not have the ability to directly compress to that format. At least as far as what is documented. This is despite the remarkably misleading documentation page header "Compression compatible with gzip".

出于某种原因，Python zlib 模块具有解压缩 gzip 数据的能力，但它不具备直接压缩为该格式的能力。至少就记录在案的内容而言。尽管有明显误导性的文档页标题“与 gzip 兼容的压缩”。

You can compress to the zlib formatinstead using zlib.compressor zlib.compressobj, and then strip the zlib header and trailer and add a gzip header and trailer, since both the zlib and gzip formats use the same compressed data format. This will give you data in the gzip format. The zlib header is fixed at two bytes and the trailer at four bytes, so those are easy to strip. Then you can prepend a basic gzip header of ten bytes: "\x1f\x8b\x08\0\0\0\0\0\0\xff"(C string format) and append a four-byte CRC in little-endian order. The CRC can be computed using zlib.crc32.

您可以使用或压缩为zlib 格式，然后剥离 zlib 头和尾并添加 gzip 头和尾，因为 zlib 和 gzip 格式使用相同的压缩数据格式。这将为您提供gzip 格式的数据。zlib 标头固定为两个字节，尾标固定为四个字节，因此很容易剥离。然后，您可以预先添加一个十字节的基本 gzip 标头：（C 字符串格式）并以小端顺序附加一个四字节的 CRC。可以使用计算 CRC 。zlib.compresszlib.compressobj"\x1f\x8b\x08\0\0\0\0\0\0\xff"zlib.crc32

Answer 4

回答by Viacheslav Nefedov

You can dump dataframe into string using pickle.dumps and then write it on disk with import gzip

您可以使用 pickle.dumps 将数据帧转储为字符串，然后使用 import gzip 将其写入磁盘

file = gzip.GzipFile('filename.pickle.gz', 'wb', 3)
file.write(pickle.dumps(df))
file.close()

pandas 如何直接以gzipped格式保存pandas数据帧？

提问by Curious2learn

采纳答案by Wes McKinney

回答by Seanny123

回答by Mark Adler

回答by Viacheslav Nefedov

相关推荐

最近更新

标签

pandas 如何直接以gzipped格式保存pandas数据帧？

提问by Curious2learn

采纳答案by Wes McKinney

回答by Seanny123

回答by Mark Adler

回答by Viacheslav Nefedov

相关推荐

使用 python pandas 以年、日、小时、分钟、秒格式解析带有日期的 CSV

如何在 hdf5 中有效地保存 python pandas 数据帧并在 R 中将其作为数据帧打开？

pandas 如何根据第 i 个字段的值对 numpy 数组进行切片？

KDB+ 像 asof 一样在 Pandas 中加入时间序列数据？

相关推荐

最近更新

标签