Python 如何用numpy读取二进制文件的一部分?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14245094/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:51:28  来源:igfitidea点击:

How to read part of binary file with numpy?

pythonnumpyscipy

提问by brorfred

I'm converting a matlab script to numpy, but have some problems with reading data from a binary file. Is there an equivelent to fseekwhen using fromfileto skip the beginning of the file? This is the type of extractions I need to do:

我正在将 matlab 脚本转换为 numpy,但是在从二进制文件读取数据时遇到了一些问题。fseek使用fromfile跳过文件开头时是否有等效项?这是我需要做的提取类型:

fid = fopen(fname);
fseek(fid, 8, 'bof');
second = fread(fid, 1, 'schar');
fseek(fid, 100, 'bof');
total_cycles = fread(fid, 1, 'uint32', 0, 'l');
start_cycle = fread(fid, 1, 'uint32', 0, 'l');

Thanks!

谢谢!

采纳答案by tom10

You can use seek with a file object in the normal way, and then use this file object in fromfile. Here's a full example:

您可以以正常方式对文件对象使用 seek,然后在fromfile. 这是一个完整的例子:

import numpy as np
import os

data = np.arange(100, dtype=np.int)
data.tofile("temp")  # save the data

f = open("temp", "rb")  # reopen the file
f.seek(256, os.SEEK_SET)  # seek

x = np.fromfile(f, dtype=np.int)  # read the data into numpy
print x 
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]

回答by abarnert

There probably is a better answer…?But when I've been faced with this problem, I had a file that I already wanted to access different parts of separately, which gave me an easy solution to this problem.

可能有更好的答案......?但是当我遇到这个问题时,我已经想分别访问一个文件的不同部分,这给了我一个简单的解决方案。

For example, say chunkyfoo.binis a file consisting of a 6-byte header, a 1024-byte numpyarray, and another 1024-byte numpyarray. You can't just open the file and seek 6 bytes (because the first thing numpy.fromfiledoes is lseekback to 0). But you can just mmapthe file and use fromstringinstead:

例如,假设chunkyfoo.bin一个文件由一个 6 字节的标头、一个 1024 字节的numpy数组和另一个 1024 字节的numpy数组组成。你不能只打开文件并寻找 6 个字节(因为第一件事numpy.fromfile就是lseek回到 0)。但是您可以只mmap使用文件并使用fromstring

with open('chunkyfoo.bin', 'rb') as f:
    with closing(mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)) as m:
        a1 = np.fromstring(m[6:1030])
        a2 = np.fromstring(m[1030:])

This sounds like exactly what you want to do. Except, of course, that in real life the offset and length to a1and a2probably depend on the header, rather than being fixed comments.

这听起来正是您想要做的。除,当然,在现实生活中的偏移量和长度,a1a2可能依赖于头部,而不是固定的意见。

The header is just m[:6], and you can parse that by explicitly pulling it apart, using the structmodule, or whatever else you'd do once you readthe data. But, if you'd prefer, you can explicitly seekand readfrom fbefore constructing m, or after, or even make the same calls on m, and it will work, without affecting a1and a2.

标头只是m[:6],您可以通过显式地将其拆开、使用struct模块或read在获得数据后执行的任何其他操作来解析它。但是,如果您愿意,您可以显式地seekand readfromf在构造 之前m或之后,甚至对 进行相同的调用m,它会起作用,而不会影响a1a2

An alternative, which I've done for a different non-numpy-related project, is to create a wrapper file object, like this:

我为不同的非numpy相关项目所做的另一种方法是创建一个包装文件对象,如下所示:

class SeekedFileWrapper(object):
    def __init__(self, fileobj):
        self.fileobj = fileobj
        self.offset = fileobj.tell()
    def seek(self, offset, whence=0):
        if whence == 0:
            offset += self.offset
        return self.fileobj.seek(offset, whence)
    # ... delegate everything else unchanged

I did the "delegate everything else unchanged" by generating a listof attributes at construction time and using that in __getattr__, but you probably want something less hacky. numpyonly relies on a handful of methods of the file-like object, and I think they're properly documented, so just explicitly delegate those. But I think the mmapsolution makes more sense here, unless you're trying to mechanically port over a bunch of explicit seek-based code. (You'd think mmapwould also give you the option of leaving it as a numpy.memmapinstead of a numpy.array, which lets numpyhave more control over/feedback from the paging, etc. But it's actually pretty tricky to get a numpy.memmapand an mmapto work together.)

我通过list在构建时生成 a of 属性并在 中使用它来完成“委托其他一切不变” __getattr__,但您可能想要一些不那么hacky的东西。numpy仅依赖于类文件对象的少数方法,并且我认为它们已被正确记录,因此只需明确委派这些方法即可。但我认为该mmap解决方案在这里更有意义,除非您试图机械地移植一堆基于显式seek的代码。(您可能认为mmap还可以让您选择将其保留为 anumpy.memmap而不是 a numpy.array,这样可以numpy更好地控制分页/反馈等。但让 anumpy.memmap和 anmmap一起工作实际上非常棘手。)

回答by Theodros Zelleke

This is what I do when I have to read arbitrary in an heterogeneous binary file.
Numpy allows to interpret a bit pattern in arbitray way by changing the dtypeof the array. The Matlab code in the question reads a charand two uint.

当我必须在异构二进制文件中任意读取时,这就是我所做的。
numpy的允许通过改变来解释任意波形方式的比特模式D型阵列的。问题中的 Matlab 代码读取 achar和 two uint

Read this paper(easy reading on user level, not for scientists) on what one can achieve with changing the dtype, stride, dimensionality of an array.

阅读这篇论文(用户级别的简单阅读,而不是科学家),了解通过更改数组的 dtype、步长和维度可以实现的目标。

import numpy as np

data = np.arange(10, dtype=np.int)
data.tofile('f')

x = np.fromfile('f', dtype='u1')
print x.size
# 40

second = x[8]
print 'second', second
# second 2

total_cycles = x[8:12]
print 'total_cycles', total_cycles
total_cycles.dtype = np.dtype('u4')
print 'total_cycles', total_cycles
# total_cycles [2 0 0 0]       !endianness
# total_cycles [2]

start_cycle = x[12:16]
start_cycle.dtype = np.dtype('u4')
print 'start_cycle', start_cycle
# start_cycle [3]

x.dtype = np.dtype('u4')
print 'x', x
# x [0 1 2 3 4 5 6 7 8 9]

x[3] = 423 
print 'start_cycle', start_cycle
# start_cycle [423]