C++ 对齐和未对齐的内存访问?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1063809/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Aligned and unaligned memory accesses?
提问by Can Bal
What is the difference between aligned and unaligned memory access?
对齐和未对齐的内存访问有什么区别?
I work on an TMS320C64x DSP, and I want to use the intrinsic functions (C functions for assembly instructions) and it has
我在 TMS320C64x DSP 上工作,我想使用内在函数(用于汇编指令的 C 函数)并且它具有
ushort & _amem2(void *ptr);
ushort & _mem2(void *ptr);
where _amem2
does an aligned access of 2 bytes and _mem2
does unaligned access.
其中_amem2
进行 2 个字节的_mem2
对齐访问并进行未对齐访问。
When should I use which?
我什么时候应该使用哪个?
采纳答案by Doug
An aligned memory access means that the pointer (as an integer) is a multiple of a type-specific value called the alignment. The alignment is the natural address multiple where the type must be, or should be stored (e.g. for performance reasons) on a CPU. For example, a CPU might require that all two-byte loads or stores are done through addresses that are multiples of two. For small primitive types (under 4 bytes), the alignment is almost always the size of the type. For structs, the alignment is usually the maximum alignment of any member.
对齐的内存访问意味着指针(作为整数)是称为对齐的特定于类型的值的倍数。对齐是类型必须或应该存储(例如出于性能原因)在 CPU 上的自然地址倍数。例如,CPU 可能要求所有两字节的加载或存储都通过 2 的倍数的地址完成。对于小的原始类型(4 个字节以下),对齐几乎总是类型的大小。对于结构,对齐通常是任何成员的最大对齐。
The C compiler always puts variables that you declare at addresses which satisfy the "correct" alignment. So if ptr points to e.g. a uint16_t variable, it will be aligned and you can use _amem2. You need to use _mem2 only if you are accessing e.g. a packed byte array received via I/O, or bytes in the middle of a string.
C 编译器总是将您声明的变量放在满足“正确”对齐的地址处。因此,如果 ptr 指向例如 uint16_t 变量,它将对齐并且您可以使用 _amem2。只有在访问例如通过 I/O 接收的压缩字节数组或字符串中间的字节时,才需要使用 _mem2。
回答by Avi
Many computer architectures store memory in "words" of several bytes each. For example, the Intel 32-bit architecture stores words of 32 bits, each of 4 bytes. Memory is addressed at the single byte level, however; therefore an address can be "aligned", meaning it starts at a word boundary, or "unaligned", meaning it doesn't.
许多计算机体系结构将内存存储在每个有几个字节的“字”中。例如,Intel 32 位架构存储 32 位字,每个字 4 个字节。然而,内存是在单字节级别寻址的;因此,地址可以是“对齐的”,意味着它从字边界开始,或者“未对齐”,意味着它没有。
On certain architectures certain memory operations may be slower or even completely not allowed on unaligned addresses.
在某些体系结构上,某些内存操作可能会在未对齐的地址上变慢甚至完全不允许。
So, if you know your addresses are aligned on the right addresses, you can use _amem2(), for speed. Otherwise, you should use _mem2().
因此,如果您知道您的地址与正确的地址对齐,则可以使用 _amem2() 来提高速度。否则,您应该使用 _mem2()。
回答by old_timer
I know this is an old question with a selected answer but didnt see anyone explain the answer to what is the difference between aligned and unaligned memory access...
我知道这是一个带有选定答案的老问题,但没有看到任何人解释对齐和未对齐内存访问之间的区别的答案......
Be it dram or sram or flash or other. Take an sram as a simple example it is built out of bits a specific sram will be built out of a fixed number of bits wide and a fixed number of rows deep. lets say 32 bits wide and several/many rows deep.
无论是 dram 或 sram 或 flash 或其他。以 sram 为例,它是由位构建的,特定的 sram 将由固定数量的位宽和固定数量的行深构建。让我们说 32 位宽和几行/多行深。
if I do a 32 bit write to address 0x0000 in this sram, the memory controller around this sram can simply do a single write cycle to row 0.
如果我对这个 sram 中的地址 0x0000 执行 32 位写入,则该 sram 周围的内存控制器可以简单地对第 0 行执行单个写入周期。
if I do a 32 bit write to address 0x0001 in this sram, assuming that is allowed, the controller will need to do a read of row 0, modify three of the bytes, preserving one, and write that to row 0, then read row 1 modify one byte leaving the other three as found and write that back. which bytes get modified or not have to do with endianness for the system.
如果我对这个 sram 中的地址 0x0001 进行 32 位写入,假设允许,控制器将需要读取第 0 行,修改三个字节,保留一个,并将其写入第 0 行,然后读取行1 修改一个字节,将其他三个字节保留为已找到,然后将其写回。哪些字节被修改或与系统的字节序无关。
The former is aligned and the latter unaligned, clearly a performance difference plus need the extra logic to be able to do the four memory cycles and merge the byte lanes.
前者是对齐的,后者是未对齐的,显然是性能差异加上需要额外的逻辑才能完成四个内存周期并合并字节通道。
If I were to read 32 bits from address 0x0000 then a single read of row 0, done. But read from 0x0001 and I have to do two reads row0 and row1 and depending on the system design just send those 64 bits back to the processor possibly two bus clocks instead of one. or the memory controller has the extra logic so that the 32 bits are aligned on the data bus in one bus cycle.
如果我要从地址 0x0000 读取 32 位,那么就完成了对第 0 行的单次读取。但是从 0x0001 读取,我必须进行两次读取 row0 和 row1,根据系统设计,只需将这 64 位发送回处理器,可能是两个总线时钟而不是一个。或者内存控制器具有额外的逻辑,以便在一个总线周期内在数据总线上对齐 32 位。
16 bit reads are a little better, a read from 0x0000, 0x0001 and 0x0002 would only be a read from row0 and could based on the system/processor design send those 32 bits back and the processor extracts them or shift them in the memory controller so that they land on specific byte lanes so the processor doesnt have to rotate around. One or the other has to if not both. A read from 0x0003 though is like above you have to read row 0 and row1 as one of your bytes is in each and then either send 64 bits back for the processor to extract or the memory controller combines the bits into one 32 bit bus response (assuming the bus between the processor and memory controller is 32 bits wide for these examples).
16 位读取要好一些,从 0x0000、0x0001 和 0x0002 读取只会是从 row0 读取,并且可以基于系统/处理器设计将这 32 位发回,处理器提取它们或将它们移入内存控制器,因此它们落在特定的字节通道上,因此处理器不必旋转。一个或另一个必须,如果不是两个。从 0x0003 读取就像上面一样,您必须读取第 0 行和第 1 行,因为每个字节都包含一个字节,然后将 64 位发送回处理器以进行提取,或者内存控制器将这些位组合成一个 32 位总线响应(假设这些示例中处理器和内存控制器之间的总线为 32 位宽)。
A 16 bit write though always ends up with at least one read-modify-write in this example sram, address 0x0000, 0x0001 and 0x0002 read row0 modify two bytes and write back. address 0x0003 read two rows modify one byte each and write back.
尽管在此示例 sram 中,16 位写入始终以至少一个读取-修改-写入结束,地址 0x0000、0x0001 和 0x0002 读取 row0 修改两个字节并写回。地址 0x0003 读取两行,每行修改一个字节并写回。
8 bit you only need to read one row containing that byte, writes though are a read-modify-write of one row.
8 位您只需要读取包含该字节的一行,但写入是一行的读取-修改-写入。
The armv4 didnt like unaligned although you could disable the trap and the result is not like you would expect above, not important, current arms allow unaligned and give you the above behavior you can change a bit in a control register and then it will abort unaligned transfers. mips used to not allow, not sure what they do now. x86, 68K etc, was allowed and the memory controller may have had to do the most work.
armv4 不喜欢未对齐,尽管您可以禁用陷阱,但结果并不像您在上面预期的那样,并不重要,当前臂允许未对齐并为您提供上述行为,您可以在控制寄存器中稍作更改,然后它将中止未对齐转让。mips 过去不允许,不知道他们现在做什么。x86、68K 等是允许的,内存控制器可能需要做最多的工作。
The designs that dont permit it clearly are for performance and less logic at what some would say is a burden on the programmers others might say it is no extra work on the programmer or easier on the programmer. aligned or not you can also see why it can be better to not try to save any memory by making 8 bit variables but go ahead and burn a 32 bit word or whatever the natural size of a register or the bus is. It may help your performance at a small cost of some bytes. Not to mention the extra code the compiler would need to add to make the lets say 32 bit register mimic an 8 bit variable, masking and sometimes sign extension. Where using register native sizes those additional instructions are not required. You can also pack multiple things into a bus/memory wide location and do one memory cycle to collect or write them then use some extra instructions to manipulate between registers not costing ram and a possible wash on the number of instructions.
显然不允许它的设计是为了性能和更少的逻辑,有些人会说这是程序员的负担,其他人可能会说这对程序员来说没有额外的工作,或者对程序员来说更容易。无论是否对齐,您还可以了解为什么最好不要尝试通过创建 8 位变量来节省任何内存,而是继续烧毁 32 位字或任何寄存器或总线的自然大小。它可能会以一些字节的小成本帮助您的性能。更不用说编译器需要添加的额外代码,让我们说 32 位寄存器模仿 8 位变量,屏蔽和有时符号扩展。在使用寄存器本机大小的情况下,不需要这些附加指令。
I dont agree that the compiler will always align the data right for the target, there are ways to break that. And if the target doesnt support unaligned you will hit the fault. Programmers would never need to talk about this if the compiler always did it right based on any legal code you could come up with, there would be no reason for this question unless it was for performance. if you dont control the void ptr address to be aligned or not then you have to use the mem2() unaligned access all the time or you have to do an if-then-else in your code based on the value of the ptr as nik pointed out. by declaring as void the C compiler now has no way to correctly deal with your alignment and it wont be guaranteed. if you take a char *prt and feed it to these functions all bets are off on the compiler getting it right without you adding extra code either buried in the mem2() function or outside these two functions. so as written in your question mem2() is the only correct answer.
我不同意编译器总是为目标对齐数据,有办法打破它。如果目标不支持未对齐,您将遇到错误。如果编译器总是根据你能提出的任何合法代码正确地做这件事,程序员就永远不需要谈论这个问题,除非是为了性能,否则没有理由提出这个问题。如果您不控制 void ptr 地址是否对齐,那么您必须一直使用 mem2() 未对齐访问,或者您必须根据 ptr 作为 nik 的值在代码中执行 if-then-else指出。通过声明为 void,C 编译器现在无法正确处理您的对齐,并且无法保证。如果你使用一个 char *prt 并将它提供给这些函数,那么所有的赌注都是在编译器上得到正确的,而无需添加额外的代码,要么埋在 mem2() 函数中,要么在这两个函数之外。所以正如你的问题 mem2() 所写的那样,是唯一正确的答案。
DRAM say used in your desktop/laptop tends to be 64 or 72 (with ecc) bits wide, and every access to them is aligned. Even though the memory sticks are actually made up of 8 bit wide or 16 or 32 bit wide chips. (this may be changing with phones/tablets for various reasons) the memory controller and ideally at least one cache sits in front of this dram so that the unaligned or even aligned accesses that are smaller than the bus width read-modify-writes are dealt with in the cache sram which is way faster, and the dram accesses are all aligned full bus width accesses. If you have no cache in front of the dram and the controller is designed for full width accesses then that is the worst performance, if designed for lighting up the byte lanes separately (assuming 8 bit wide chips) then you dont have the read-modify-writes but a more complicated controller. if the typical use case is with a cache (if there is one in the design) then it may not make sense to have that additional work in the controller for each byte lane, but have it just know how to do full bus width sized transfers or multiples of.
台式机/笔记本电脑中使用的 DRAM 往往是 64 或 72(带 ecc)位宽,并且对它们的每次访问都是对齐的。即使记忆棒实际上由 8 位宽或 16 或 32 位宽的芯片组成。(由于各种原因,这可能会随着手机/平板电脑而改变)内存控制器和理想情况下至少有一个缓存位于此 dram 的前面,以便处理小于总线宽度的未对齐或什至对齐的访问读取-修改-写入在高速缓存 sram 中,速度更快,并且 DRAM 访问都是对齐的全总线宽度访问。如果您在 dram 前面没有缓存并且控制器是为全宽访问而设计的,那么这是最差的性能,如果设计用于单独点亮字节通道(假设为 8 位宽芯片),那么您没有读取-修改-写入,而是一个更复杂的控制器。如果典型用例是带有缓存(如果设计中有缓存),那么在控制器中为每个字节通道进行额外的工作可能没有意义,但让它知道如何进行全总线宽度大小的传输或倍数。
回答by nik
Aligned addresses are those which are multiples of the access size in question.
对齐的地址是所讨论的访问大小的倍数。
- Access of 4 byte words on addresses that are multiple of 4 will be aligned
- Access of 4 bytes from the address (say) 3 will be unaligned access
- 4 的倍数地址上的 4 字节字的访问将对齐
- 从地址访问 4 个字节(比如)3 将是未对齐的访问
It is very likely that the _mem2function which will work also for unaligned accesses will be less optimal to get the correct alignments working in its code. This means that the _mem2function is likely to be costlier then its _amem2version.
很可能也适用于未对齐访问的_mem2函数不太适合在其代码中获得正确的对齐方式。这意味着_mem2函数可能比其_amem2版本更昂贵。
So, when you need performance (particularly when you know that the access latency is high) it would be prudent to identify when you can use the aligned access. The _amem2exists for this very purpose -- to give you performance when you know the access is aligned.
因此,当您需要性能时(尤其是当您知道访问延迟很高时),最好确定何时可以使用对齐访问。_amem2 就是为了这个目的而存在的——当你知道访问是一致的时,它会给你性能。
When it comes to 2 byte accesses, identifying aligned operations is very simple.
If all the access addresses for the operation are 'even' (that is, their LSB is zero), you have 2-byte alignment. This can be easily checked with,
当涉及 2 字节访问时,识别对齐操作非常简单。
如果操作的所有访问地址都是“偶数”(即,它们的 LSB 为零),则您有 2 字节对齐。这可以很容易地检查,
if (address & 1) // is true
/* we have an odd address; not aligned */
else
/* we have an even address; its aligned to 2-bytes */
回答by laalto
Many processors have alignment restrictions on memory access. Unaligned access either generates an exception interrupt (e.g. ARM), or is just slower (e.g. x86).
许多处理器对内存访问有对齐限制。未对齐访问要么生成异常中断(例如 ARM),要么速度较慢(例如 x86)。
_mem2
is probably implemented as fetching two bytes and using shift and or bitwise operations to make a 16-bit ushort out of them.
_mem2
可能实现为获取两个字节并使用移位和/或按位操作从它们中生成 16 位 ushort。
_amem2
probably just reads the 16-bit ushort from the specified ptr.
_amem2
可能只是从指定的 ptr 读取 16 位 ushort。
I don't know TMS320C64x specifically but I'd guess it requires 16-bit alignment for 16-bit memory accesses. So you can use _mem2
always but with performance penalty, and _amem2
when you can guarantee that ptr is an even address.
我不具体了解 TMS320C64x,但我猜它需要 16 位对齐才能访问 16 位内存。因此,您可以_mem2
始终使用但会降低性能,并且_amem2
何时可以保证 ptr 是偶数地址。
回答by Laurence Gonsalves
_mem2 is more general. It'll work if ptr is aligned or not. _amem2 is more strict: it requires that ptr be aligned (though is presumably slightly more efficient). So use _mem2 unless you can guarantee that ptr is always aligned.
_mem2 更通用。无论 ptr 是否对齐,它都会起作用。_amem2 更严格:它要求 ptr 对齐(尽管可能效率更高一些)。所以使用 _mem2 除非你能保证 ptr 总是对齐的。