从 Pandas 数据帧写入格式化的二进制文件

Question

提问by jbssm

I've seen some ways to read a formatted binary file in Python to Pandas, namely, I'm using this code that read using NumPy fromfile formatted with a structure given using dtype.

我已经看到了一些在 Python 中将格式化的二进制文件读取到 Pandas 的方法，即，我使用的这段代码使用 NumPy fromfile 读取，该文件格式化为使用 dtype 给出的结构。

import numpy as np
import pandas as pd

input_file_name = 'test.hst'

input_file = open(input_file_name, 'rb')
header = input_file.read(96)

dt_header = np.dtype([('version', 'i4'),
                      ('copyright', 'S64'),
                      ('symbol', 'S12'),
                      ('period', 'i4'),
                      ('digits', 'i4'),
                      ('timesign', 'i4'),
                      ('last_sync', 'i4')])

header = np.fromstring(header, dt_header)

dt_records = np.dtype([('ctm', 'i4'),
                       ('open', 'f8'),
                       ('low', 'f8'),
                       ('high', 'f8'),
                       ('close', 'f8'),
                       ('volume', 'f8')])
records = np.fromfile(input_file, dt_records)

input_file.close()

df_records = pd.DataFrame(records)
# Now, do some changes in the individual values of df_records
# and then write it back to a binary file

Now, my issue is on how to write this back to a new file. I can't find any function in NumPy (neither in Pandas) that allows me to specify exactly the bytes to use in each field to write.

现在，我的问题是如何将其写回新文件。我在 NumPy（在 Pandas 中都没有）中找不到任何允许我准确指定要在每个字段中使用的字节来写入的函数。

Answer 1

回答by ebarr

It isn't clear to me if the DataFrameis a view or a copy, but assuming it is a copy, you can use the to_recordsmethod of the DataFrame.

这是我不清楚，如果DataFrame是一个视图或副本，但假设它是一个副本，你可以使用to_records的方法DataFrame。

This gives you back a record array that you can then put to disk using tofile.

这会给你一个记录数组，然后你可以使用tofile.

e.g.

例如

df_records = pd.DataFrame(records)
# do some stuff
new_recarray = df_records.to_records()
new_recarray.tofile("myfile.npy")

The data will reside in memory as packed bytes with the format described by the recarray dtype.

数据将作为压缩字节驻留在内存中，格式由 recarray dtype 描述。

Answer 2

回答by JosiahYoder-deactive except..

Pandas now offers a wide variety of formatsthat are more stable than tofile(). tofile() is best for quick file storage where you do not expect the file to be used on a different machine where the data may have a different endianness (big-/little-endian).

Pandas 现在提供了比 tofile() 更稳定的多种格式。tofile() 最适合快速文件存储，在这种情况下，您不希望该文件在数据可能具有不同字节序（大/小字节序）的不同机器上使用。

Format Type Data Description     Reader         Writer
text        CSV                  read_csv       to_csv
text        JSON                 read_json      to_json
text        HTML                 read_html      to_html
text        Local clipboard      read_clipboard to_clipboard
binary      MS Excel             read_excel     to_excel
binary      HDF5 Format          read_hdf       to_hdf
binary      Feather Format       read_feather   to_feather
binary      Parquet Format       read_parquet   to_parquet
binary      Msgpack              read_msgpack   to_msgpack
binary      Stata                read_stata     to_stata
binary      SAS                  read_sas    
binary      Python Pickle Format read_pickle    to_pickle
SQL         SQL                  read_sql       to_sql
SQL         Google Big Query     read_gbq       to_gbq

I'm currently using HDF5, but if I were on Amazon, I would be using parquet.

我目前使用的是 HDF5，但如果我在亚马逊上，我会使用镶木地板。

Example of using to_hdf:

使用to_hdf 的示例：

df.to_hdf('tmp.hdf','df', mode='w')
df2 = pd.read_hdf('tmp.hdf','df')

However, the HDF5 format may not be best for long-term archival, since it is fairly complex. It has a 150 page specification, and only one 300,000 line C implementation.

但是，HDF5 格式可能不是长期存档的最佳格式，因为它相当复杂。它有一个 150 页的规范，只有一个 300,000 行的 C 实现。

从 Pandas 数据帧写入格式化的二进制文件

提问by jbssm

回答by ebarr

回答by JosiahYoder-deactive except..

相关推荐

最近更新

标签

从 Pandas 数据帧写入格式化的二进制文件

提问by jbssm

回答by ebarr

回答by JosiahYoder-deactive except..

相关推荐

pandas 将存储过程选择结果读入熊​​猫数据帧

如何从数字中减去 Pandas DataFrame 的每一行？

如何将数据帧堆叠在一起（Pandas、Python3）

Pandas：从 3 列创建时间戳：月、日、小时

相关推荐

最近更新

标签

pandas 将存储过程选择结果读入熊猫数据帧