如何使用 NumPy 在 Python 中读取二进制文件？

Question

提问by Suyash Shetty

I know how to read binary files in Python using NumPy's np.fromfile()function. The issue I'm faced with is that when I do so, the array has exceedingly large numbers of the order of 10^100 or so, with random nanand infvalues.

我知道如何使用 NumPy 的np.fromfile()函数在 Python 中读取二进制文件。我面临的问题是，当我这样做时，数组有非常大的数量，大约 10^100 左右，带有随机nan和inf值。

I need to apply machine learning algorithms to this dataset and I cannot work with this data. I cannot normalise the dataset because of the nanvalues.

我需要将机器学习算法应用于此数据集，但我无法使用此数据。由于这些nan值，我无法对数据集进行标准化。

I've tried np.nan_to_num()but that doesn't seem to work. After doing so, my min and max values range from 3e-38 and 3e+38 respectively, so I could not normalize it.

我试过了，np.nan_to_num()但这似乎不起作用。这样做之后，我的最小值和最大值的范围分别为 3e-38 和 3e+38，因此我无法对其进行标准化。

Is there any way to scale this data down? If not, how should I deal with this?

有没有办法缩小这些数据？如果没有，我该如何处理？

Thank you.

谢谢你。

EDIT:

编辑：

Some context. I'm working on a malware classification problem. My dataset consists of live malware binaries. They are files of the type .exe, .apk etc. My idea is store these binaries as a numpy array, convert to a grayscale image and then perform pattern analysis on it.

一些上下文。我正在处理恶意软件分类问题。我的数据集由实时恶意软件二进制文件组成。它们是 .exe、.apk 等类型的文件。我的想法是将这些二进制文件存储为一个 numpy 数组，转换为灰度图像，然后对其进行模式分析。

Answer 1

回答by John1024

If you want to make an image out of a binary file, you need to read it in as integer, not float. Currently, the most common format for images is unsigned 8-bit integers.

如果要从二进制文件中制作图像，则需要将其作为整数读入，而不是浮点数。目前，最常见的图像格式是无符号 8 位整数。

As an example, let's make an image out of the first 10,000 bytes of /bin/bash:

例如，让我们用 /bin/bash 的前 10,000 个字节制作一个图像：

>>> import numpy as np
>>> import cv2
>>> xbash = np.fromfile('/bin/bash', dtype='uint8')
>>> xbash.shape
(1086744,)
>>> cv2.imwrite('bash1.png', xbash[:10000].reshape(100,100))

In the above, we used the OpenCV library to write the integers to a PNG file. Any of several other imaging libraries could have been used.

在上面，我们使用 OpenCV 库将整数写入 PNG 文件。可以使用其他几个成像库中的任何一个。

This what the first 10,000 bytes of bash"looks" like:

这是bash“看起来”的前 10,000 个字节的样子：

Answer 2

回答by Sayali Sonawane

EDIT 2

编辑 2

Refer this answer: https://stackoverflow.com/a/11548224/6633975
It states: NaNcan't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
source: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

参考这个答案：https: //stackoverflow.com/a/11548224/6633975
它指出：NaN不能存储在整数数组中。这是目前 Pandas 的一个已知限制；我一直在等待 NumPy 中的 NA 值取得进展（类似于 R 中的 NA），但 NumPy 获得这些功能至少需要 6 个月到一年的时间，看起来：
来源：http: //pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

Numpy integer nan
Accepted answer states:NaNcan't be stored in an integer array. A nanis a special value for float arrays only. There are talks about introducing a special bit that would allow non-float arrays to store what in practice would correspond to a nan, but so far (2012/10), it's only talks. In the meantime, you may want to consider the numpy.mapackage: instead of picking an invalid integer like -99999, you could use the special numpy.ma.maskedvalue to represent an invalid value.

Numpy integer nan
接受的答案状态：NaN不能存储在整数数组中。Anan是仅用于浮点数组的特殊值。有关于引入一个特殊位的讨论，该位将允许非浮点数组存储实际上对应于 a 的内容 nan，但到目前为止（2012/10），这只是讨论。同时，您可能需要考虑 numpy.ma包：您可以使用特殊numpy.ma.masked值来表示无效值，而不是选择像 -99999 这样的无效整数。

a = np.ma.array([1,2,3,4,5], dtype=int)
a[1] = np.ma.masked
masked_array(data = [1 -- 3 4 5],
             mask = [False  True False False False],
       fill_value = 999999)

EDIT 1

编辑 1

To read binary file:

读取二进制文件：

Read the binary file content like this:

with open(fileName, mode='rb') as file: # b is important -> binary
    fileContent = file.read()

After that you can "unpack" binary data using struct.unpack

If you are using np.fromfile()function:
numpy.fromfile, which can read data from both text and binary files. You would first construct a data type, which represents your file format, using numpy.dtype, and then read this type from file using numpy.fromfile.

像这样读取二进制文件内容：
```
with open(fileName, mode='rb') as file: # b is important -> binary
    fileContent = file.read()
```
之后，您可以使用struct.unpack“解包”二进制数据
如果您正在使用np.fromfile()功能：
numpy.fromfile，它可以从文本和二进制文件中读取数据。您将首先使用构造一个数据类型，它表示您的文件格式， numpy.dtype然后使用numpy.fromfile.

如何使用 NumPy 在 Python 中读取二进制文件？

提问by Suyash Shetty

回答by John1024

回答by Sayali Sonawane

相关推荐

最近更新

标签

如何使用 NumPy 在 Python 中读取二进制文件？

提问by Suyash Shetty

回答by John1024

回答by Sayali Sonawane

相关推荐

Python SyntaxError：解析时出现意外的 EOF

Python 如何设置散点图的大小

Python ValueError：长度不匹配：在熊猫数据框中创建分层列时，预期轴有 0 个元素

Python 如何使用 Matplotlib 创建折线图

相关推荐

最近更新

标签