Python 中的内存视图到底是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18655648/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:21:25  来源:igfitidea点击:

What exactly is the point of memoryview in Python

pythonbuffermemoryview

提问by Basel Shishani

Checking the documentationon memoryview:

检查memoryview 上的文档

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

class memoryview(obj)

Create a memoryview that references obj. obj must support the buffer protocol. Built-in objects that support the buffer protocol include bytes and bytearray.

memoryview 对象允许 Python 代码访问支持缓冲协议的对象的内部数据而无需复制。

内存视图(obj)

创建一个引用 obj 的内存视图。obj 必须支持缓冲协议。支持缓冲协议的内置对象包括字节和字节数组。

Then we are given the sample code:

然后我们给出了示例代码:

>>> v = memoryview(b'abcefg')
>>> v[1]
98
>>> v[-1]
103
>>> v[1:4]
<memory at 0x7f3ddc9f4350>
>>> bytes(v[1:4])
b'bce'

Quotation over, now lets take a closer look:

引用结束,现在让我们仔细看看:

>>> b = b'long bytes stream'
>>> b.startswith(b'long')
True
>>> v = memoryview(b)
>>> vsub = v[5:]
>>> vsub.startswith(b'bytes')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'memoryview' object has no attribute 'startswith'
>>> bytes(vsub).startswith(b'bytes')
True
>>> 

So what I gather from the above:

所以我从上面收集到的:

We create a memoryview object to expose the internal data of a buffer object without copying, however, in order to do anything useful with the object (by calling the methods provided by the object), we have to create a copy!

我们创建了一个 memoryview 对象来公开缓冲区对象的内部数据而不进行复制,但是,为了对对象做任何有用的事情(通过调用对象提供的方法),我们必须创建一个副本!

Usually memoryview (or the old buffer object) would be needed when we have a large object, and the slices can be large too. The need for a better efficiency would be present if we are making large slices, or making small slices but a large number of times.

当我们有一个大对象时,通常需要 memoryview(或旧的缓冲区对象),并且切片也可以很大。如果我们制作大切片,或制作小切片但多次制作,则需要更高的效率。

With the above scheme, I don't see how it can be useful for either situation, unless someone can explain to me what I'm missing here.

使用上述方案,我看不出它对这两种情况有何用处,除非有人可以向我解释我在这里遗漏了什么。

Edit1:

编辑1:

We have a large chunk of data, we want to process it by advancing through it from start to end, for example extracting tokens from the start of a string buffer until the buffer is consumed.In C term, this is advancing a pointer through the buffer, and the pointer can be passed to any function expecting the buffer type. How can something similar be done in python?

我们有一大块数据,我们希望通过从头到尾推进来处理它,例如从字符串缓冲区的开头提取令牌,直到缓冲区被消耗。在 C 术语中,这是推进一个指针通过缓冲区,指针可以传递给任何需要缓冲区类型的函数。如何在 python 中完成类似的事情?

People suggest workarounds, for example many string and regex functions take position arguments that can be used to emulate advancing a pointer. There're two issues with this: first it's a work around, you are forced to change your coding style to overcome the shortcomings, and second: not all functions have position arguments, for example regex functions and startswithdo, encode()/decode()don't.

人们建议解决方法,例如许多字符串和正则表达式函数采用可用于模拟前进指针的位置参数。这有两个问题:首先,它是一种变通方法,您被迫更改编码风格以克服缺点,其次:并非所有函数都有位置参数,例如正则表达式函数和startswithdo,encode()/ decode()don't。

Others might suggest to load the data in chunks, or processing the buffer in small segments larger than the max token. Okay so we are aware of these possible workarounds, but we are supposed to work in a more natural way in python without trying to bend the coding style to fit the language - aren't we?

其他人可能会建议以块的形式加载数据,或以大于最大令牌的小段处理缓冲区。好的,所以我们知道这些可能的解决方法,但是我们应该在 python 中以更自然的方式工作,而不是试图改变编码风格以适应语言 - 不是吗?

Edit2:

编辑2:

A code sample would make things clearer. This is what I want to do, and what I assumed memoryview would allow me to do at first glance. Lets use pmview (proper memory view) for the functionality I'm looking for:

代码示例会使事情变得更清楚。这就是我想要做的,而且我认为 memoryview 乍一看可以让我做。让我们使用 pmview(适当的内存视图)来实现我正在寻找的功能:

tokens = []
xlarge_str = get_string()
xlarge_str_view =  pmview(xlarge_str)

while True:
    token =  get_token(xlarge_str_view)
    if token: 
        xlarge_str_view = xlarge_str_view.vslice(len(token)) 
        # vslice: view slice: default stop paramter at end of buffer
        tokens.append(token)
    else:   
        break

回答by Martijn Pieters

memoryviewobjects are great when you need subsets of binary data that only need to support indexing. Instead of having to take slices (and create new, potentially large) objects to pass to another APIyou can just take a memoryviewobject.

memoryview当您需要只需要支持索引的二进制数据子集时,对象非常有用。无需获取切片(并创建新的、可能很大的)对象来传递给另一个 API,您只需获取一个memoryview对象即可。

One such API example would be the structmodule. Instead of passing in a slice of the large bytesobject to parse out packed C values, you pass in a memoryviewof just the region you need to extract values from.

一个这样的 API 示例是struct模块。不是传入大bytes对象的切片来解析压缩的 C 值,而是传入memoryview需要从中提取值的区域。

memoryviewobjects, in fact, support structunpacking natively; you can target a region of the underlying bytesobject with a slice, then use .cast()to 'interpret' the underlying bytes as long integers, or floating point values, or n-dimensional lists of integers. This makes for very efficient binary file format interpretations, without having to create more copies of the bytes.

memoryview对象,实际上支持原生struct解包;您可以bytes使用切片定位底层对象的一个区域,然后使用.cast()将底层字节“解释”为长整数、浮点值或 n 维整数列表。这使得非常有效的二进制文件格式解释,而不必创建更多的字节副本。

回答by Antimony

One reason memoryviewsare useful is because they can be sliced without copying the underlying data, unlike bytes/str.

一个memoryviews有用的原因是因为它们可以在不复制基础数据的情况下进行切片,这与bytes/不同str

For example, take the following toy example.

例如,以下面的玩具为例。

import time
for n in (100000, 200000, 300000, 400000):
    data = 'x'*n
    start = time.time()
    b = data
    while b:
        b = b[1:]
    print 'bytes', n, time.time()-start

for n in (100000, 200000, 300000, 400000):
    data = 'x'*n
    start = time.time()
    b = memoryview(data)
    while b:
        b = b[1:]
    print 'memoryview', n, time.time()-start

On my computer, I get

在我的电脑上,我得到

bytes 100000 0.200068950653
bytes 200000 0.938908100128
bytes 300000 2.30898690224
bytes 400000 4.27718806267
memoryview 100000 0.0100269317627
memoryview 200000 0.0208270549774
memoryview 300000 0.0303030014038
memoryview 400000 0.0403470993042

You can clearly see quadratic complexity of the repeated string slicing. Even with only 400000 iterations, it's already unmangeable. Meanwhile, the memoryview version has linear complexity and is lightning fast.

您可以清楚地看到重复字符串切片的二次复杂度。即使只有 400000 次迭代,它也已经无法管理。同时,memoryview 版本具有线性复杂度,并且速度快如闪电。

Edit: Note that this was done in CPython. There was a bug in Pypy up to 4.0.1 that caused memoryviews to have quadratic performance.

编辑:请注意,这是在 CPython 中完成的。Pypy 4.0.1 之前存在一个错误,导致内存视图具有二次性能。

回答by jimaf

Here is python3 code.

这是python3代码。

#!/usr/bin/env python3

import time
for n in (100000, 200000, 300000, 400000):
    data = b'x'*n
    start = time.time()
    b = data
    while b:
        b = b[1:]
    print ('bytes {:d} {:f}'.format(n,time.time()-start))

for n in (100000, 200000, 300000, 400000):
    data = b'x'*n
    start = time.time()
    b = memoryview(data)
    while b:
        b = b[1:]
    print ('memview {:d} {:f}'.format(n,time.time()-start))

回答by gwideman

Let me make plain where lies the glitch in understanding here.

让我弄清楚这里的理解错误在哪里。

The questioner, like myself, expected to be able to create a memoryview that selects a slice of an existing array (for example a bytes or bytearray). We therefore expected something like:

提问者和我一样,希望能够创建一个内存视图来选择现有数组的一部分(例如字节或字节数组)。因此,我们预计会出现以下情况:

desired_slice_view = memoryview(existing_array, start_index, end_index)

Alas, there is no such constructor, and the docs don't make a point of what to do instead.

唉,没有这样的构造函数,文档也没有说明要做什么。

The key is that you have to first make a memoryview that covers the entire existing array. From that memoryview you can create a second memoryview that covers a slice of the existing array, like this:

关键是你必须首先制作一个覆盖整个现有数组的内存视图。从该内存视图中,您可以创建第二个内存视图来覆盖现有数组的一部分,如下所示:

whole_view = memoryview(existing_array)
desired_slice_view = whole_view[10:20]

In short, the purpose of the first line is simply to provide an object whose slice implementation (dunder-getitem) returns a memoryview.

简而言之,第一行的目的只是提供一个对象,其切片实现(dunder-getitem)返回一个内存视图。

That might seem untidy, but one can rationalize it a couple of ways:

这可能看起来不整洁,但可以通过以下几种方式使其合理化:

  1. Our desired output is a memoryview that is a slice of something. Normally we get a sliced object from an object of that same type, by using the slice operator [10:20] on it. So there's some reason to expect that we need to get our desired_slice_view from a memoryview, and that therefore the first step is to get a memoryview of the whole underlying array.

  2. The naive expection of a memoryview constructor with start and end arguments fails to consider that the slice specification really needs all the expressivity of the usual slice operator (including things like [3::2] or [:-4] etc). There is no way to just use the existing (and understood) operator in that one-liner constructor. You can't attach it to the existing_array argument, as that will make a slice of that array, instead of telling the memoryview constructor some slice parameters. And you can't use the operator itself as an argument, because it's an operator and not a value or object.

  1. 我们想要的输出是一个内存视图,它是某物的一部分。通常,我们通过在其上使用切片运算符 [10:20] 从相同类型的对象中获取切片对象。所以有一些理由期望我们需要从 memoryview 中获取我们的 desired_slice_view,因此第一步是获取整个底层数组的 memoryview。

  2. 对带有 start 和 end 参数的 memoryview 构造函数的天真期望未能考虑到切片规范确实需要通常切片运算符的所有表达能力(包括 [3::2] 或 [:-4] 等)。无法在该单行构造函数中仅使用现有(并理解)的运算符。您不能将它附加到 existing_array 参数,因为这将创建该数组的一个切片,而不是告诉 memoryview 构造函数一些切片参数。并且您不能将运算符本身用作参数,因为它是运算符而不是值或对象。

Conceivably, a memoryview constructor could take a slice object:

可以想象,一个内存视图构造函数可以接受一个切片对象:

desired_slice_view = memoryview(existing_array, slice(1, 5, 2) )

... but that's not very satisfactory, since users would have to learn about the slice object and what its constructor's parameters mean, when they already think in terms of the slice operator's notation.

...但这并不是很令人满意,因为当用户已经根据切片运算符的符号进行思考时,他们必须了解切片对象及其构造函数的参数含义。

回答by user2494386

Excellent example by Antimony. Actually, in Python3, you can replace data = 'x'*n by data = bytes(n) and put parenthesis to print statements as below:

锑的优秀例子。实际上,在 Python3 中,您可以将 data = 'x'*n 替换为 data = bytes(n) 并将括号放在打印语句中,如下所示:

import time
for n in (100000, 200000, 300000, 400000):
    #data = 'x'*n
    data = bytes(n)
    start = time.time()
    b = data
    while b:
        b = b[1:]
    print('bytes', n, time.time()-start)

for n in (100000, 200000, 300000, 400000):
    #data = 'x'*n
    data = bytes(n)
    start = time.time()
    b = memoryview(data)
    while b:
        b = b[1:]
    print('memoryview', n, time.time()-start)