Python 将二进制数据读入 Pandas
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16573089/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading binary data into pandas
提问by kasperhj
I have some binary data and I was wondering how I can load that into pandas.
我有一些二进制数据,我想知道如何将其加载到 Pandas 中。
Can I somehow load it specifying the format it is in, and what the individual columns are called?
我可以以某种方式加载它,指定它的格式以及各个列的名称吗?
Edit:
Format is
编辑:
格式是
int, int, int, float, int, int[256]
each comma separation represents a column in the data, i.e. the last 256 integers is one column.
每个逗号分隔代表数据中的一列,即最后 256 个整数为一列。
采纳答案by mowen
Even though this is an old question, I was wondering the same thing and I didn't see a solution I liked.
即使这是一个老问题,我也想知道同样的事情,但我没有看到我喜欢的解决方案。
When reading binary data with Python I have found numpy.fromfileor numpy.fromstringto be much faster than using the Python struct module. Binary data with mixed types can be efficiently read into a numpy array, using the methods above, as long as the data format is constant and can be described with a numpy data type object (numpy.dtype).
使用 Python 读取二进制数据时,我发现numpy.fromfileornumpy.fromstring比使用 Python struct 模块快得多。可以使用上述方法将混合类型的二进制数据高效读入 numpy 数组,只要数据格式是常量并且可以用 numpy 数据类型对象 ( numpy.dtype)来描述。
import numpy as np
import pandas as pd
# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'),
('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame(data)
# Or if you want to explicitly set the column names
df = pd.DataFrame(data, columns=data.dtype.names)
Edits:
编辑:
- Removed unnecessary conversion of
data.to_list(). Thanks fxx - Added example of leaving off the
columnsargument
- 删除了不必要的
data.to_list(). 谢谢 fxx - 添加了省略
columns参数的示例
回答by Brian Cain
Here's something to get you started.
这里有一些东西可以让你开始。
from struct import unpack, calcsize
from pandas import DataFrame
entry_format = 'iiifi256i' #int, int, int, float, int, int[256]
field_names = ['a', 'b', 'c', 'd', 'e', 'f', ]
entry_size = calcsize(entry_format)
with open(input_filename, mode='rb') as f:
entry_count = os.fstat(f.fileno()).st_size / entry_size
for i in range(entry_count):
record = f.read(entry_size)
entry = unpack(entry_format, record)
entry_frame = dict( (n[0], n[1]) for n in zip(field_names, entry) )
DataFrame(entry_frame)
回答by Albert-Jan
The following uses a compiled struct, which is a lot faster than a normal struct. An alternative is to use np.fromstring or np.fromfile, as mentioned above.
下面使用编译后的结构体,它比普通结构体快很多。另一种方法是使用 np.fromstring 或 np.fromfile,如上所述。
import struct, ctypes, os
import numpy as np, pandas as pd
mystruct = struct.Struct('iiifi256i')
buff = ctypes.create_string_buffer(mystruct.size)
with open(input_filename, mode='rb') as f:
nrows = os.fstat(f.fileno()).st_size / entry_size
dtype = 'i,i,i,d,i,i8'
array = np.empty((nrows,), dtype=dtype)
for row in xrange(row):
buff.raw = f.read(s.size)
record = mystruct.unpack_from(buff, 0)
#record = np.fromstring(buff, dtype=dtype)
array[row] = record
df = pd.DataFrame(array)
see also http://pymotw.com/2/struct/
回答by NicoBernard
Recently I was confronted to a similar problem, with a much bigger structure though. I think I found an improvement of mowen's answer using utility method DataFrame.from_records. In the example above, this would give:
最近我遇到了类似的问题,但结构要大得多。我想我发现使用实用方法DataFrame.from_records改进了 mowen 的答案。在上面的例子中,这将给出:
import numpy as np
import pandas as pd
# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'), ('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame.from_records(data)
In my case, it significantly sped up the process. I assume the improvement comes from not having to create an intermediate Python list, but rather directly create the DataFrame from the Numpy structured array.
就我而言,它显着加快了进程。我认为改进来自不必创建中间 Python 列表,而是直接从 Numpy 结构化数组创建 DataFrame。

