Java 内存高效数据存储的 HashMap 替代方案
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3972127/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HashMap alternatives for memory-efficient data storage
提问by Brad Mace
I've currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. You'll no doubt be shocked when I tell you that this hasn't proven ideal. The overhead seems to use 5x more memory than the data itself.
我目前有一个电子表格类型的程序,可以将其数据保存在 HashMap 的 ArrayList 中。当我告诉你这并不理想时,你无疑会感到震惊。开销似乎使用比数据本身多 5 倍的内存。
This questionasks about efficient collections libraries, and the answer was use Google Collections. My follow up is "which part?". I've been reading through the documentation but don't feel like it gives a very good sense of which classes are a good fit for this. (I'm also open to other libraries or suggestions).
这个问题询问了高效的收藏库,答案是使用 Google Collections。 我的后续是“哪个部分?”。我一直在阅读文档,但觉得它不能很好地了解哪些类适合于此。(我也对其他图书馆或建议持开放态度)。
So I'm looking for something that will let me store dense spreadsheet-type data with minimal memory overhead.
所以我正在寻找可以让我以最小的内存开销存储密集电子表格类型数据的东西。
- My columns are currently referenced by Field objects, rows by their indexes, and values are Objects, almost always Strings
- Some columns will have a lot of repeated values
- primary operations are to update or remove records based on values of certain fields, and also adding/removing/combining columns
- 我的列当前由 Field 对象引用,行由它们的索引引用,值是对象,几乎总是字符串
- 有些列会有很多重复的值
- 主要操作是根据某些字段的值更新或删除记录,以及添加/删除/组合列
I'm aware of options like H2 and Derby but in this case I'm not looking to use an embedded database.
我知道 H2 和 Derby 之类的选项,但在这种情况下,我不打算使用嵌入式数据库。
EDIT: If you're suggesting libraries, I'd also appreciate it if you could point me to a particular class or two in them that would apply here. Whereas Sun's documentation usually includes information about which operations are O(1), which are O(N), etc, I'm not seeing much of that in third-party libraries, nor really any description of which classes are best suited for what.
编辑:如果您建议使用图书馆,如果您能指出其中一两个适用于此处的特定课程,我也将不胜感激。虽然 Sun 的文档通常包含有关哪些操作是 O(1)、哪些是 O(N) 等的信息,但我在第三方库中并没有看到太多这样的信息,也没有任何关于哪些类最适合什么的描述.
采纳答案by Steve B.
So I'm assuming that you have a map of Map<ColumnName,Column>
, where the column is actually something like ArrayList<Object>
.
所以我假设您有一张 的地图Map<ColumnName,Column>
,其中的列实际上类似于ArrayList<Object>
.
A few possibilities -
几种可能——
Are you completely sure that memory is an issue? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. It takes an awful lot of rows and maps to fill up a JVM.
You could test your data set with different types of maps in the collections. Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky.
What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns?
你完全确定内存是一个问题吗?如果您只是普遍担心大小,那么值得确认这确实是正在运行的程序中的一个问题。填满 JVM 需要大量的行和映射。
您可以使用集合中不同类型的地图来测试您的数据集。根据您的数据,您还可以使用可能有帮助的预设大小/负载因子组合来初始化地图。我过去曾搞砸过,如果幸运的话,您的内存可能会减少 30%。
将您的数据存储在单个类似矩阵的数据结构(现有的库实现或类似列表列表的包装器)中,并使用将列键映射到矩阵列的单个映射怎么样?
回答by Brian Agnew
Some columns will have a lot of repeated values
有些列会有很多重复的值
immediately suggests to me the possible use of the FlyWeight pattern, regardless of the solution you choose for your collections.
立即向我建议可能使用FlyWeight 模式,无论您为收藏选择哪种解决方案。
回答by Hyman
Trove collectionsshould have a particular care about space occupied (I think they also have tailored data structures if you stick to primitive types).. take a look here.
Trove 集合应该特别关心占用的空间(我认为如果您坚持使用原始类型,它们也有定制的数据结构).. 看看这里。
Otherwise you can try with Apache collections.. just do your benchmarks!
否则你可以尝试使用Apache 集合.. 做你的基准测试!
In anycase, if you've got many references around to same elements try to design some suited pattern (like flyweight)
无论如何,如果您对相同元素有很多参考,请尝试设计一些合适的模式(例如flyweight)
回答by Nikita Rybak
keeps its data in an ArrayList of HashMaps
Well, this part seems terribly inefficient to me. Empty HashMap will already allocate 16 * size of a pointer
bytes (16 stands for default initial capacity), plus some variables for hash object (14 + psize). If you have a lot of sparsely filled rows, this could be a big problem.
将其数据保存在 HashMaps 的 ArrayList 中
,这部分对我来说似乎非常低效。Empty HashMap 已经分配了16 * size of a pointer
字节(16 代表默认初始容量),加上一些用于散列对象的变量(14 + psize)。如果您有很多稀疏填充的行,这可能是一个大问题。
One option would be to use a single large hash with composite key (combining row and column). Although, that doesn't make operations on whole rows very effective.
一种选择是使用具有复合键(组合行和列)的单个大散列。虽然,这不会使对整行的操作非常有效。
Also, since you don't mention the operation of adding cell, you can create hashes with only necessary inner storage (initialCapacity
parameter).
另外,由于您没有提到添加单元格的操作,因此您可以仅使用必要的内部存储(initialCapacity
参数)创建哈希。
I don't know much about google collections, so can't help there. Also, if you find any useful optimization, please do post here! It would be interesting to know.
我对谷歌收藏不太了解,所以无法提供帮助。另外,如果您发现任何有用的优化,请在此处发布!知道会很有趣。
回答by leonbloy
From your description, it seems that instead of an ArrayList of HashMaps you rather want a (Linked)HashMap of ArrayList(each ArrayList would be a column).
从您的描述来看,您似乎想要一个 ArrayList 的(Linked)HashMap而不是 HashMaps的 ArrayList(每个 ArrayList 都是一列)。
I'd add a double map from field-name to column-number, and some clever getters/setters that never throw IndexOutOfBoundsException
.
我会添加一个从字段名到列号的双映射,以及一些从不抛出IndexOutOfBoundsException
.
You can also use a ArrayList<ArrayList<Object>>
(basically a jagged dinamically growing matrix) and keep the mapping to field (column) names outside.
您还可以使用ArrayList<ArrayList<Object>>
(基本上是一个锯齿状的动态增长矩阵)并将映射到字段(列)名称之外。
Some columns will have a lot of repeated values
有些列会有很多重复的值
I doubt this matters, specially if they are Strings, (they are internalized) and your collection would store references to them.
我怀疑这很重要,特别是如果它们是字符串(它们被内化)并且您的集合会存储对它们的引用。
回答by whiskeysierra
回答by Peter Lawrey
Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. This way you have only 4-8 bytes of overhead per cell.
假设您的所有行都有大部分相同的列,您可以只为每一行使用一个数组,并使用 Map<ColumnKey, Integer> 来查找哪些列引用哪个单元格。这样每个单元只有 4-8 个字节的开销。
If Strings are often repeated, you could use a String pool to reduce duplication of strings. Object pools for other immutable types may be useful in reducing memory consumed.
如果字符串经常重复,您可以使用字符串池来减少字符串的重复。其他不可变类型的对象池可能有助于减少内存消耗。
EDIT:You can structure your data as either row based or column based. If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. If its columns based, you can have an array per column. This can make handling primitive types much more efficent. i.e. you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row.
编辑:您可以将数据结构为基于行或基于列。如果其基于行(每行一个单元格数组)添加/删除该行只是删除该行的问题。如果它基于列,则每列可以有一个数组。这可以使处理原始类型更加有效。即,您可以拥有一个 int[] 列和另一个 double[] 列,整列具有相同的数据类型而不是整行具有相同的数据类型更为常见。
However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset.
但是,无论采用哪种方式构建数据,都将针对行或列修改进行优化,并且执行其他类型的添加/删除将导致整个数据集的重建。
(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it)
(我做的是有基于行的数据并在最后添加列,假设行不够长,列有一个默认值,这避免了添加列时的重建。而不是删除列,我有一种忽略它的方法)
回答by Brad Mace
I've been experimenting with using the SparseObjectMatrix2D
from the Coltproject. My data is pretty dense but their Matrix classes don't really offer any way to enlarge them, so I went with a sparse matrix set to the maximum size.
我一直在尝试使用SparseObjectMatrix2D
来自Colt项目的 。我的数据非常密集,但它们的 Matrix 类并没有真正提供任何放大它们的方法,因此我将稀疏矩阵设置为最大尺寸。
It seems to use roughly 10% less memory and loads about 15% faster for the same data, as well as offering some clever manipulation methods. Still interested in other options though.
对于相同的数据,它似乎使用的内存减少了大约 10%,加载速度提高了大约 15%,并提供了一些巧妙的操作方法。仍然对其他选择感兴趣。
回答by NiranjanBhat
Why don't you try using cache implementation like EHCache.
This turned out to be very effective for me, when I hit the same situation.
You can simply store your collection within the EHcache implementation.
There are configurations like:
为什么不尝试使用像EHCache这样的缓存实现。当我遇到同样的情况时,这对我来说非常有效。
您可以简单地将您的集合存储在 EHcache 实现中。有如下配置:
Maximum bytes to be used from Local heap.
Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm.
You can be sure of avoiding any out of memory errors, using this types of cache implementations.
It only increases the IO operations of your application by a small degree.
This is just a birds eye view of the configuration. There are a lot of configurations to optimize your requirements.
一旦应用程序使用的字节溢出缓存中配置的字节,缓存实现就会负责将数据写入磁盘。您还可以配置使用最近最少使用算法将对象写入磁盘的时间量。使用这种类型的缓存实现,您可以确保避免任何内存不足错误。它只会在很小的程度上增加应用程序的 IO 操作。
这只是配置的鸟瞰图。有很多配置可以优化您的需求。
回答by leventov
Chronicle Mapcould have overhead of less than 20 bytes per entry (see a testproving this). For comparison, java.util.HashMap's overhead varies from 37-42 bytes with -XX:+UseCompressedOops
to 58-69 bytes without compressed oops (reference).
Chronicle Map每个条目的开销可能少于 20 个字节(请参阅证明这一点的测试)。为了进行比较,java.util.HashMap 的开销从 37-42 字节-XX:+UseCompressedOops
到 58-69 字节(没有压缩 oops )不等(参考)。
Additionally, Chronicle Map stores keys and values off-heap, so it doesn't store Object headers, which are not accounted as HashMap's overhead above. Chronicle Map integrateswith Chronicle-Values, a library for generation of flyweight implementations of interfaces, the pattern suggested by Brian Agnewin another answer.
此外,Chronicle Map 将键和值存储在堆外,因此它不存储 Object 标头,这些标头不计入上述 HashMap 的开销。Chronicle Map与Chronicle-Values集成,这是一个用于生成接口的享元实现的库,这是 Brian Agnew在另一个答案中建议的模式。