java 结构数组,还是数组结构?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1125626/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 15:15:53  来源:igfitidea点击:

array of structures, or structure of arrays?

javadata-structures

提问by Jason S

Hmmm. I have a table which is an array of structures I need to store in Java. The naive don't-worry-about-memory approach says do this:

嗯。我有一个表,它是我需要在 Java 中存储的结构数组。天真的不要担心内存方法说这样做:

public class Record {
  final private int field1;
  final private int field2;
  final private long field3;
  /* constructor & accessors here */
}

List<Record> records = new ArrayList<Record>();

If I end up using a large number (> 106) of records, where individual records are accessed occasionally, one at a time, how would I figure out how the preceding approach (an ArrayList) would compare with an optimized approach for storage costs:

如果我最终使用大量(> 10 6)记录,其中个别记录偶尔被访问,一次一个,我将如何弄清楚前面的方法(ArrayList)与存储成本的优化方法相比如何:

public class OptimizedRecordStore {
  final private int[] field1;
  final private int[] field2;
  final private long[] field3;

  Record getRecord(int i) { return new Record(field1[i],field2[i],field3[i]); }
  /* constructor and other accessors & methods */
}

edit:

编辑:

  • assume the # of records is something that is changed infrequently or never
  • I'm probably not going to use the OptimizedRecordStore approach, but I want to understand the storage cost issue so I can make that decision with confidence.
  • obviously if I add/change the # of records in the OptimizedRecordStore approach above, I either have to replace the whole object with a new one, or remove the "final" keyword.
  • kd304 brings up a good point that was in the back of my mind. In other situations similar to this, I need column access on the records, e.g. if field1 and field2 are "time" and "position", and it's important for me to get those values as an array for use with MATLAB, so I can graph/analyze them efficiently.
  • 假设记录数量很少或从不更改
  • 我可能不会使用 OptimizedRecordStore 方法,但我想了解存储成本问题,以便我可以自信地做出决定。
  • 显然,如果我在上面的 OptimizedRecordStore 方法中添加/更改记录数,我要么必须用新对象替换整个对象,要么删除“final”关键字。
  • kd304 提出了一个很好的观点,这是我的想法。在与此类似的其他情况下,我需要对记录进行列访问,例如,如果 field1 和 field2 是“时间”和“位置”,那么将这些值作为数组获取以用于 MATLAB 对我来说很重要,因此我可以绘制图形/有效地分析它们。

回答by flux

The answers that give the general "optimise when you have to" is unhelpful in this case because , IMHO, programmers should always be aware of the performance in different in design choices when that choice leads to an order of magnitude performance penalty, particularly API writers.

在这种情况下,给出一般“必须时优化”的答案是没有帮助的,因为恕我直言,当该选择导致一个数量级的性能损失时,程序员应该始终意识到不同设计选择中的性能,尤其是 API 编写者.

The original question is quite valid and I would tend to agree that the second approach is better, given his particular situation. I've written image processing code where each pixel requires a data structure, a situation not too dissimilar to this, except I needed frequent random access to each pixel. The overhead of creating one object for each pixel was enormous.

最初的问题非常有效,鉴于他的特殊情况,我倾向于同意第二种方法更好。我已经编写了图像处理代码,其中每个像素都需要一个数据结构,这种情况与此不太相似,只是我需要频繁地随机访问每个像素。为每个像素创建一个对象的开销是巨大的。

回答by cletus

The second version is much, much worse. Instead of resizing onearray, you're resizing threearrays when you do an insert or delete. What's more, the second version will lead to the creation of many more temporary objects and it will do so on accesses. That could lead to a lot of garbage (from a GC point of view). Not good.

第二个版本要差得多。在执行插入或删除操作时,不是调整一个数组的大小,而是调整三个数组的大小。更重要的是,第二个版本将导致创建更多的临时对象,并且会在访问时这样做。这可能会导致大量垃圾(从 GC 的角度来看)。不好。

Generally speaking, you should worry about how you use the objects long before you think about performance. So you have a record with three fields or three arrays. Which one more accurately depicts what you're modeling? By this I mean, when you insert or delete an item, are you doing one of the three arrays or all three as a block?

一般来说,在考虑性能之前,您应该先考虑如何使用对象。所以你有一个包含三个字段或三个数组的记录。哪一个更准确地描述了您正在建模的内容?我的意思是,当您插入或删除一个项目时,您是将三个数组中的一个还是所有三个作为一个块?

I suspect it's the latter in which case the former makes far more sense.

我怀疑是后者,在这种情况下,前者更有意义。

If you're really concerned about insertion/deletion performance then perhaps a different data structure is appropriate, perhaps a SortedSet or a Map or SortedMap.

如果您真的很关心插入/删除性能,那么也许不同的数据结构是合适的,也许是 SortedSet 或 Map 或 SortedMap。

回答by Jaan

If you have millions of records, the second approach has several advantages:

如果你有数百万条记录,第二种方法有几个优点:

  • Memory usage: the first approach uses more memory because a)every Java object in heap has a header (containing class id, lock state etc.); b)objects are aligned in memory; c)each reference to an object costs 4 bytes (on 64-bit JVMs with Compressed OOPs or 32-bit JVMs) or 8 bytes (64-bit JVMs without Compressed OOPs). See e. g. CompressedOopsfor more details. So the first approach takes about two times more memory (more precisely: according to my benchmark, an object with 16 bytes of payload + a reference to it took 28 bytes on 32-bit Java 7, 36 bytes on 64-bit Java 7 with compressed OOPs, and 40 bytes on 64-bit Java 7 w/o compressed OOPs).
  • Garbage collection: although the second approach seems to create many objects (one on each call of getRecord), it might not be so, as modern server JVMs (e. g. Oracle's Java 7) can apply escape analysis and stack allocation to avoid heap allocation of temporary objects in some cases; anyway, GCing short-lived objects is cheap. On the other hand, it is probably easier for the garbage collector if there are not millions of long-lived objects (as there are in the first approach) whose reachability to check (or at least, such objects may make your application need more careful tuning of GC generation sizes). Thus the second approach may be better for GC performance. However, to see whether it makes a difference in the real situation, one should make a benchmark oneself.
  • Serialization speed: the speed of (de)serializing a large array of primitives on disk is only limited by HDD speed; serializing many small objects is inevitably slower (especially if you use Java's default serialization).
  • 内存使用:第一种方法使用更多内存,因为a)堆中的每个 Java 对象都有一个标头(包含类 ID、锁状态等);b)对象在内存中对齐;c)对对象的每次引用花费 4 个字节(在具有压缩 OOP 的 64 位 JVM 或 32 位 JVM 上)或 8 个字节(在没有压缩 OOP 的 64 位 JVM 上)。有关更多详细信息,请参见例如CompressedOops。因此,第一种方法需要大约两倍的内存(更准确地说:根据我的基准测试,具有 16 字节有效载荷的对象 + 对它的引用在 32 位 Java 7 上占用了 28 个字节,在 64 位 Java 7 上占用了 36 个字节压缩的 OOP,以及 64 位 Java 7 上没有压缩的 OOP 的 40 字节)。
  • 垃圾收集:虽然第二种方法似乎创建了许多对象(每次调用一个对象getRecord),但可能并非如此,因为现代服务器 JVM(例如 Oracle 的 Java 7)可以应用逃逸分析和堆栈分配来避免临时对象的堆分配在某些情况下;无论如何,GCing 短期对象很便宜。另一方面,如果没有数百万个长期存在的对象(如第一种方法中的对象)的可达性检查(或者至少,这些对象可能使您的应用程序需要更加小心),垃圾收集器可能更容易GC 生成大小的调整)。因此,第二种方法可能更适合 GC 性能。但是,要看实际情况是否有所不同,还是要自己做一个标杆。
  • 序列化速度:(反)序列化磁盘上大量原语的速度仅受硬盘速度的限制;序列化许多小对象不可避免地会变慢(特别是如果您使用 Java 的默认序列化)。

Therefore I have used the second approach quite often for very large collections. But of course, if you have enough memory and don't care about serialization, the first approach is simpler.

因此,对于非常大的集合,我经常使用第二种方法。但是当然,如​​果你有足够的内存并且不关心序列化,第一种方法更简单。

回答by twolfe18

I was curious so I actually ran a benchmark. If you don't re-create the object like you are[1], then SoA beats AoS by 5-100% depending on workload[2]. See my code here:

我很好奇所以我实际上运行了一个基准测试。如果您不像现在这样重新创建对象 [1],那么 SoA 会比 AoS 高 5-100%,具体取决于工作负载[2]。在这里查看我的代码:

https://gist.github.com/twolfe18/8168262c5420c7a62d39

https://gist.github.com/twolfe18/8168262c5420c7a62d39

[1] I didn't add that because if you are concerned enough about speed to consider this refactor, it would be silly to do that.

[1] 我没有添加那个是因为如果你足够关心速度来考虑这个重构,那么这样做是很愚蠢的。

[2] This also doesn't account for re-allocation, but again, this is often something you can either amortize away or know statically. This is a reasonable assumption for a pure-speed benchmark.

[2] 这也没有考虑重新分配,但同样,这通常是您可以摊销或静态知道的东西。对于纯速度基准测试来说,这是一个合理的假设。

回答by fortran

How are you going to access the data? If the accesses over the fields are always coupled, then use the first option, if you are going to process the fields by its own, then the second option is better.

您将如何访问数据?如果对字段的访问总是耦合的,则使用第一个选项,如果您要自己处理字段,则第二个选项更好。

See this article in wikipedia: Parallel Array

请参阅维基百科中的这篇文章:并行阵列

A good example about when it's more convenient to have separate arrays could be simulations where the numerical data is packed together in the same array, and other attributes like name, colour, etc. that are accessed just for presentation of the data in other array.

关于何时拥有单独的数组更方便的一个很好的例子可能是模拟,其中数值数据被打包在同一个数组中,以及其他属性,如名称、颜色等,这些属性只是为了在其他数组中显示数据而访问。

回答by mikera

I'd choose the first method (array of structures) unlessyou access the store relatively infrequently and are running into serious memory pressure issues.

我会选择第一种方法(结构数组),除非您相对不频繁地访问存储并且遇到严重的内存压力问题。

First version basically stores the objects in their "natural" form (+1 BTW for using immutable records). This uses a little more memory because of the per-object overhead (probably around 8-16 bytes depending on your JVM) but is very good for accessing and returning objects in a convenient and human-understandable form in one simple step.

第一个版本基本上以“自然”形式存储对象(使用不可变记录+1 BTW)。由于每个对象的开销(可能大约 8-16 字节,具体取决于您的 JVM),这会使用更多的内存,但非常适合在一个简单的步骤中以一种方便且人类可理解的形式访问和返回对象。

Second version uses less memory overall, but the allocation of a new object on every "get" is a pretty ugly solution that will not perform well if accesses are frequent.

第二个版本总体上使用较少的内存,但在每次“获取”时分配一个新对象是一个非常丑陋的解决方案,如果访问频繁,则性能不佳。

Some other possibilities to consider:

需要考虑的其他一些可能性:

An interesting "extreme" variant would be to take the second version but write your algorithms / access methods to interact with the underlying arrays directly. This is clearly going to result in complex inter-dependencies and some ugly code, but would probably give you the absolute best performance if you really needed it. It's quite common to use this approach for intensive graphics applications such as manipulating a large array of 3D coordinates.

一个有趣的“极端”变体是采用第二个版本,但编写您的算法/访问方法以直接与底层数组交互。这显然会导致复杂的相互依赖和一些丑陋的代码,但如果你真的需要它,它可能会给你绝对最好的性能。将此方法用于密集型图形应用程序(例如操作大型 3D 坐标数组)是很常见的。

A "hybrid" option would be to store the underlying data in a structure of arrays as in the second version, but cache the accessed objects in a HashMap so that you only generate the object the first time a particular index is accessed. Might make sense if only a small fraction of objects are ever likely to accessed, but all data is needed "just in case".

“混合”选项是将底层数据存储在第二个版本中的数组结构中,但将访问的对象缓存在 HashMap 中,以便您仅在第一次访问特定索引时生成对象。如果只有一小部分对象可能被访问,那么可能有意义,但“以防万一”需要所有数据。

回答by TofuBeer

(Not a direct answer, but one that I think should be given)

(不是直接的答案,而是我认为应该给出的答案)

From your comment,

从你的评论来看,

"cletus -- I greatly respect your thoughts and opinions, but you gave me the high-level programming & software design viewpoint which is not what I'm looking for. I cannot learn to ignore optimization until I can get an intuitive sense for the cost of different implementation styles, and/or the ability to estimate those costs. – Jason S Jul 14 '09 at 14:27"

“cletus——我非常尊重你的想法和意见,但你给了我高级编程和软件设计的观点,这不是我想要的。我不能学会忽视优化,直到我能对优化有了直观的感觉不同实施方式的成本,和/或估计这些成本的能力。 – Jason S 2009 年 7 月 14 日,14:27"

You should always ignore optimization until it presents itself as a problem. Most important is to have the system be usable by a developer (so they can make it usable by a user). There are very few times that you should concern yourself with optimization, in fact in ~20 years of professional coding I have cared about optimization a total of two times:

您应该始终忽略优化,直到它表现为问题为止。最重要的是让系统可供开发人员使用(以便他们可以让用户使用)。很少有人应该关注优化,实际上在大约 20 年的专业编码中,我一共关注了两次优化:

  1. Writing a program that had its primary purpose to be faster than another product
  2. Writing a smartphone app with the intention of reducing the amount of data sent between the client and server
  1. 编写一个主要目的是比其他产品更快的程序
  2. 编写智能手机应用程序以减少客户端和服务器之间发送的数据量

In the first case I wrote some code, then ran it through a profiler, when I wanted to do something and I was not sure which approach was best (for speed/memory) I would code one way and see the result in the profiler, then code the other way and see the result. Then I would chose the faster of the two. This works and you learn a lot about low level decisions. I did not, however, allow it to impact the higher level classes.

在第一种情况下,我编写了一些代码,然后通过分析器运行它,当我想做某事但不确定哪种方法最好(速度/内存)时,我会以一种方式编码并在分析器中查看结果,然后以另一种方式编码并查看结果。然后我会选择两者中较快的。这很有效,你可以学到很多关于低级决策的知识。然而,我没有让它影响更高级别的课程。

In the second case, there was no programming involved, but I did the same basic thing of looking at the data being sent and figuring out how to reduce the number of messages being sent as well as the number of bytes being sent.

在第二种情况下,不涉及编程,但我做了同样的基本事情,即查看正在发送的数据并弄清楚如何减少发送的消息数量以及发送的字节数。

If your code is clear then it will be easier to speed up once you find out it is slow. As Cletus said in his answer, you are resizing one time -vs- three times... one time will be faster than three. From a higher point of view the one time is simpler to understand than the three times, thus it is more likely to be correct.

如果你的代码很清晰,那么一旦你发现它很慢,就会更容易加速。正如 Cletus 在他的回答中所说,您正在调整大小一次 - 对 - 三次......一次会比三次快。从更高的角度来看,一次比三次更容易理解,因此更可能是正确的。

Personally I'd rather get the right answer slowly then the wrong answer quickly. Once I know how to get the right answer then I can find out where the system is slow and replace those parts of it with faster implementations.

就我个人而言,我宁愿慢慢地得到正确的答案,而不是快速地得到错误的答案。一旦我知道如何获得正确的答案,我就可以找出系统慢的地方,并用更快的实现替换其中的那些部分。

回答by EFraim

Notice that the second approach might have negative impact on caching behaviour. If you want to access a single record at a time, you'd better have that record not scattered all across the place.

请注意,第二种方法可能会对缓存行为产生负面影响。如果您想一次访问一条记录,最好不要将该记录分散在各处。

Also, the only memory you win in the second approach, is (possibly) due to member alignment. (and having to allocate a separate object). Otherwise, they have exactly the same memory use, asymptotically. The first option is much better due to locality, IMO

此外,您在第二种方法中获得的唯一记忆(可能)是由于成员对齐。(并且必须分配一个单独的对象)。否则,它们渐近地具有完全相同的内存使用。由于地方性,第一个选择要好得多,IMO

回答by Greg Reynolds

Whenever I have tried doing number crunching in Java, I have always had to revert to C-style coding (i.e. close to your option 2). It minimised the number of objects floating around in your system, as instead of 1,000,000 objects, you only have 3. I was able to do a bit of FFT analysis of real-time sound data using the C-style, and it was far too slow using objects.

每当我尝试在 Java 中进行数字运算时,我总是不得不恢复到 C 风格的编码(即接近您的选项 2)。它最大限度地减少了系统中漂浮的对象数量,因为您只有 3 个,而不是 1,000,000 个对象。我能够使用 C 风格对实时声音数据进行一些 FFT 分析,而且还差得远使用对象缓慢。

回答by akarnokd

I would go for the ArrayList version too, so I don't need to worry about growing it. Do you need to have a column like access to values? What is your scenario behind your question?

我也会选择 ArrayList 版本,所以我不需要担心它的增长。您是否需要像访问值这样的列?你的问题背后的场景是什么?

EditYou could also use a common long[][]matrix. I don't know how you pass the columns to Matlab, but I guess you don't gain much speed with a column based storage, more likely you loose speed in the java computation.

编辑您还可以使用公共long[][]矩阵。我不知道您如何将列传递给 Matlab,但我想您不会通过基于列的存储获得多少速度,更有可能在 Java 计算中降低速度。