Python 将二进制数据读入 Pandas

Question

提问by kasperhj

I have some binary data and I was wondering how I can load that into pandas.

我有一些二进制数据，我想知道如何将其加载到 Pandas 中。

Can I somehow load it specifying the format it is in, and what the individual columns are called?

我可以以某种方式加载它，指定它的格式以及各个列的名称吗？

Edit:
Format is

编辑：
格式是

int, int, int, float, int, int[256]

each comma separation represents a column in the data, i.e. the last 256 integers is one column.

每个逗号分隔代表数据中的一列，即最后 256 个整数为一列。

Answer 1

采纳答案by mowen

Even though this is an old question, I was wondering the same thing and I didn't see a solution I liked.

即使这是一个老问题，我也想知道同样的事情，但我没有看到我喜欢的解决方案。

When reading binary data with Python I have found numpy.fromfileor numpy.fromstringto be much faster than using the Python struct module. Binary data with mixed types can be efficiently read into a numpy array, using the methods above, as long as the data format is constant and can be described with a numpy data type object (numpy.dtype).

使用 Python 读取二进制数据时，我发现numpy.fromfileornumpy.fromstring比使用 Python struct 模块快得多。可以使用上述方法将混合类型的二进制数据高效读入 numpy 数组，只要数据格式是常量并且可以用 numpy 数据类型对象 ( numpy.dtype)来描述。

import numpy as np
import pandas as pd

# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'),
               ('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame(data)

# Or if you want to explicitly set the column names
df = pd.DataFrame(data, columns=data.dtype.names)

Edits:

编辑：

Removed unnecessary conversion of data.to_list(). Thanks fxx
Added example of leaving off the columnsargument

删除了不必要的data.to_list(). 谢谢 fxx
添加了省略columns参数的示例

Answer 2

回答by Brian Cain

Here's something to get you started.

这里有一些东西可以让你开始。

from struct import unpack, calcsize
from pandas import DataFrame

entry_format = 'iiifi256i' #int, int, int, float, int, int[256]
field_names = ['a', 'b', 'c', 'd', 'e', 'f', ]
entry_size = calcsize(entry_format)

with open(input_filename, mode='rb') as f:
    entry_count = os.fstat(f.fileno()).st_size / entry_size
    for i in range(entry_count):
        record = f.read(entry_size)
        entry = unpack(entry_format, record)
        entry_frame = dict( (n[0], n[1]) for n in zip(field_names, entry) )
        DataFrame(entry_frame)

Answer 3

回答by Albert-Jan

The following uses a compiled struct, which is a lot faster than a normal struct. An alternative is to use np.fromstring or np.fromfile, as mentioned above.

下面使用编译后的结构体，它比普通结构体快很多。另一种方法是使用 np.fromstring 或 np.fromfile，如上所述。

import struct, ctypes, os
import numpy as np, pandas as pd 

mystruct = struct.Struct('iiifi256i')
buff = ctypes.create_string_buffer(mystruct.size)
with open(input_filename, mode='rb') as f:
    nrows = os.fstat(f.fileno()).st_size / entry_size
    dtype = 'i,i,i,d,i,i8'
    array = np.empty((nrows,), dtype=dtype)
    for row in xrange(row):
        buff.raw = f.read(s.size)
        record = mystruct.unpack_from(buff, 0)
        #record = np.fromstring(buff, dtype=dtype)
        array[row] = record
 df = pd.DataFrame(array)

回答by NicoBernard

Recently I was confronted to a similar problem, with a much bigger structure though. I think I found an improvement of mowen's answer using utility method DataFrame.from_records. In the example above, this would give:

最近我遇到了类似的问题，但结构要大得多。我想我发现使用实用方法DataFrame.from_records改进了 mowen 的答案。在上面的例子中，这将给出：

import numpy as np
import pandas as pd

# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'), ('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame.from_records(data)

In my case, it significantly sped up the process. I assume the improvement comes from not having to create an intermediate Python list, but rather directly create the DataFrame from the Numpy structured array.

就我而言，它显着加快了进程。我认为改进来自不必创建中间 Python 列表，而是直接从 Numpy 结构化数组创建 DataFrame。

Python 将二进制数据读入 Pandas

提问by kasperhj

采纳答案by mowen

回答by Brian Cain

回答by Albert-Jan

回答by NicoBernard

相关推荐

最近更新

标签

Python 将二进制数据读入 Pandas

提问by kasperhj

采纳答案by mowen

回答by Brian Cain

回答by Albert-Jan

回答by NicoBernard

相关推荐

如何在终端上运行 Python 脚本？

Python uwsgi 服务未启动

Python - 导入错误：没有名为“请求”的模块

Python OrderedDict 迭代

相关推荐

最近更新

标签