pandas 如何直接以gzipped格式保存pandas数据帧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13033270/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to save a pandas dataframe in gzipped format directly?
提问by Curious2learn
I have a pandas data frame, called df.
我有一个名为 .pandas 的数据框df。
I want to save this in a gzipped format. One way to do this is the following:
我想以 gzip 格式保存它。执行此操作的一种方法如下:
import gzip
import pandas
df.save('filename.pickle')
f_in = open('filename.pickle', 'rb')
f_out = gzip.open('filename.pickle.gz', 'wb')
f_out.writelines(f_in)
f_in.close()
f_out.close()
However, this requires me to first create a file called filename.pickle.
Is there a way to do this more directly, i.e., without creating the filename.pickle?
但是,这需要我首先创建一个名为filename.pickle. 有没有办法更直接地做到这一点,即不创建filename.pickle?
When I want to load the dataframe that has been gzipped I have to go through the same
step of creating filename.pickle. For example, to read a file
filename2.pickle.gzip, which is a gzipped pandas dataframe, I know of the following method:
当我想加载已被 gzipped 的数据框时,我必须执行创建 filename.pickle 的相同步骤。例如,要读取一个文件
filename2.pickle.gzip,它是一个 gzip 压缩的 Pandas 数据框,我知道以下方法:
f_in = gzip.open('filename2.pickle.gz', 'rb')
f_out = gzip.open('filename2.pickle', 'wb')
f_out.writelines(f_in)
f_in.close()
f_out.close()
df2 = pandas.load('filename2.pickle')
Can this be done without creating filename2.picklefirst?
这可以在不filename2.pickle先创建的情况下完成吗?
采纳答案by Wes McKinney
We plan to add better serialization with compression eventually. Stay tuned to pandas development
我们计划最终通过压缩添加更好的序列化。请继续关注Pandas的发展
回答by Seanny123
Better serialization with compression has recently been added to Pandas. (Starting in pandas 0.20.0.) Here is an example of how it can be used:
Pandas 最近添加了更好的压缩序列化。(从 pandas 0.20.0 开始。)这是一个如何使用它的示例:
df.to_csv("my_file.gz", compression="gzip")
For more information, such as different forms of compression available, check out the docs.
有关更多信息,例如可用的不同形式的压缩,请查看文档。
回答by Mark Adler
For some reason, the Python zlib modulehas the ability to decompress gzip data, but it does not have the ability to directly compress to that format. At least as far as what is documented. This is despite the remarkably misleading documentation page header "Compression compatible with gzip".
出于某种原因,Python zlib 模块具有解压缩 gzip 数据的能力,但它不具备直接压缩为该格式的能力。至少就记录在案的内容而言。尽管有明显误导性的文档页标题“与 gzip 兼容的压缩”。
You can compress to the zlib formatinstead using zlib.compressor zlib.compressobj, and then strip the zlib header and trailer and add a gzip header and trailer, since both the zlib and gzip formats use the same compressed data format. This will give you data in the gzip format. The zlib header is fixed at two bytes and the trailer at four bytes, so those are easy to strip. Then you can prepend a basic gzip header of ten bytes: "\x1f\x8b\x08\0\0\0\0\0\0\xff"(C string format) and append a four-byte CRC in little-endian order. The CRC can be computed using zlib.crc32.
您可以使用或压缩为zlib 格式,然后剥离 zlib 头和尾并添加 gzip 头和尾,因为 zlib 和 gzip 格式使用相同的压缩数据格式。这将为您提供gzip 格式的数据。zlib 标头固定为两个字节,尾标固定为四个字节,因此很容易剥离。然后,您可以预先添加一个十字节的基本 gzip 标头:(C 字符串格式)并以小端顺序附加一个四字节的 CRC。可以使用 计算 CRC 。zlib.compresszlib.compressobj"\x1f\x8b\x08\0\0\0\0\0\0\xff"zlib.crc32
回答by Viacheslav Nefedov
You can dump dataframe into string using pickle.dumps and then write it on disk with import gzip
您可以使用 pickle.dumps 将数据帧转储为字符串,然后使用 import gzip 将其写入磁盘
file = gzip.GzipFile('filename.pickle.gz', 'wb', 3)
file.write(pickle.dumps(df))
file.close()

