从 Pandas 数据帧写入格式化的二进制文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26348095/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Writing a formated binary file from a Pandas Dataframe
提问by jbssm
I've seen some ways to read a formatted binary file in Python to Pandas, namely, I'm using this code that read using NumPy fromfile formatted with a structure given using dtype.
我已经看到了一些在 Python 中将格式化的二进制文件读取到 Pandas 的方法,即,我使用的这段代码使用 NumPy fromfile 读取,该文件格式化为使用 dtype 给出的结构。
import numpy as np
import pandas as pd
input_file_name = 'test.hst'
input_file = open(input_file_name, 'rb')
header = input_file.read(96)
dt_header = np.dtype([('version', 'i4'),
('copyright', 'S64'),
('symbol', 'S12'),
('period', 'i4'),
('digits', 'i4'),
('timesign', 'i4'),
('last_sync', 'i4')])
header = np.fromstring(header, dt_header)
dt_records = np.dtype([('ctm', 'i4'),
('open', 'f8'),
('low', 'f8'),
('high', 'f8'),
('close', 'f8'),
('volume', 'f8')])
records = np.fromfile(input_file, dt_records)
input_file.close()
df_records = pd.DataFrame(records)
# Now, do some changes in the individual values of df_records
# and then write it back to a binary file
Now, my issue is on how to write this back to a new file. I can't find any function in NumPy (neither in Pandas) that allows me to specify exactly the bytes to use in each field to write.
现在,我的问题是如何将其写回新文件。我在 NumPy(在 Pandas 中都没有)中找不到任何允许我准确指定要在每个字段中使用的字节来写入的函数。
回答by ebarr
It isn't clear to me if the DataFrameis a view or a copy, but assuming it is a copy, you can use the to_recordsmethod of the DataFrame.
这是我不清楚,如果DataFrame是一个视图或副本,但假设它是一个副本,你可以使用to_records的方法DataFrame。
This gives you back a record array that you can then put to disk using tofile.
这会给你一个记录数组,然后你可以使用tofile.
e.g.
例如
df_records = pd.DataFrame(records)
# do some stuff
new_recarray = df_records.to_records()
new_recarray.tofile("myfile.npy")
The data will reside in memory as packed bytes with the format described by the recarray dtype.
数据将作为压缩字节驻留在内存中,格式由 recarray dtype 描述。
回答by JosiahYoder-deactive except..
Pandas now offers a wide variety of formatsthat are more stable than tofile(). tofile() is best for quick file storage where you do not expect the file to be used on a different machine where the data may have a different endianness (big-/little-endian).
Pandas 现在提供了比 tofile() 更稳定的多种格式。tofile() 最适合快速文件存储,在这种情况下,您不希望该文件在数据可能具有不同字节序(大/小字节序)的不同机器上使用。
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
I'm currently using HDF5, but if I were on Amazon, I would be using parquet.
我目前使用的是 HDF5,但如果我在亚马逊上,我会使用镶木地板。
Example of using to_hdf:
使用to_hdf 的示例:
df.to_hdf('tmp.hdf','df', mode='w')
df2 = pd.read_hdf('tmp.hdf','df')
However, the HDF5 format may not be best for long-term archival, since it is fairly complex. It has a 150 page specification, and only one 300,000 line C implementation.
但是,HDF5 格式可能不是长期存档的最佳格式,因为它相当复杂。它有一个 150 页的规范,只有一个 300,000 行的 C 实现。

