如何在不使用太多内存的情况下在 Java 中处理大型数据集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3560837/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to handle large data sets in Java without using too much memory
提问by Tyler
I'm working in Java. I have the requirement that I must essentially compare two database queries. To do this, I take each row of the result set and assign it to a HashTable with the field name as the 'key' and the data in the field as the 'value'. I then group the entire result set of HashTables into a single Vector just as a container. So essentially to compare two queries I'm really iterating through two Vectors of HashTables.
我在 Java 工作。我要求我必须从本质上比较两个数据库查询。为此,我将结果集的每一行分配给一个 HashTable,其中字段名称为“键”,字段中的数据为“值”。然后,我将 HashTable 的整个结果集分组到一个 Vector 中,就像一个容器一样。所以基本上是为了比较两个查询,我实际上是在遍历哈希表的两个向量。
I've come to find that this approach works really well for me but requires a lot of memory. Because of other design requirements, I have to do this comparison via a Vector-HashTable-like structure, and not some DB side procedure.
我发现这种方法对我来说非常有效,但需要大量内存。由于其他设计要求,我必须通过类似 Vector-HashTable 的结构进行比较,而不是某些 DB 端过程。
Does anyone have any suggestions for optimization? The optimal solution would be one that is somewhat similar to what I am doing now as most of the code is already designed around it.
有没有人有任何优化建议?最佳解决方案将与我现在正在做的有些相似,因为大多数代码已经围绕它设计。
Thanks
谢谢
采纳答案by Noel M
Have you looked at the Flyweight Pattern? Do you have lots of equal objects?
你看过享元模式吗?你有很多相同的对象吗?
Perhaps this pattern might be appropriate for your 'Key', as I imagine the field names are going to be repeated for each row? If they're Strings, you can call intern()so that they'll share the same memory location with other equal Strings, as Strings are immutable.
也许这种模式可能适合您的“键”,因为我想每一行都会重复字段名称?如果它们是字符串,您可以调用intern()以便它们与其他相等的字符串共享相同的内存位置,因为字符串是不可变的。
Another possible optimization - not memory but speed - if concurrency is not an issue would be to use an ArrayListrather than a Vector- as they are not synchronized so accesses should be a little faster. Similarly, HashMapisn't synchronized and Hashtableis, so using the former might be faster too.
另一种可能的优化 - 不是内存而是速度 - 如果并发不是问题,那么使用 anArrayList而不是 a Vector- 因为它们不同步,所以访问应该快一点。同样,HashMap不是同步的,是同步Hashtable的,所以使用前者也可能更快。
回答by erickson
Specify the same ORDER BYclause (based on the "key") for both result sets. Then you only have to have one record from each result set in memory at once.
ORDER BY为两个结果集指定相同的子句(基于“键”)。那么你只需要一次在内存中的每个结果集中有一个记录。
For example, say your results are res1and res2.
例如,假设您的结果是res1和res2。
If the keyfield of res1is less than the keyfield of res2, res2is missing some records; iterate res1until its keyfield is equal to or greater than the keyof res2.
如果 的key字段res1小于 的key字段res2,res2则缺少一些记录;迭代res1直到其key字段等于或大于更大key的res2。
Likewise, if the keyfield of res1is greater than the keyfield of res2, res1is missing some records; iterate res2instead.
同样,如果 的key字段res1大于 的key字段res2,res1则缺少一些记录;res2而是迭代。
If the keyfields of the current records are equal, you can compare their values, then iterate both result sets.
如果key当前记录的字段相等,可以比较它们的值,然后迭代两个结果集。
You can see, in this manner, that only one record from each result is required to be held in memory at a given time.
您可以看到,通过这种方式,在给定时间,每个结果中只需要保存一条记录。
回答by OscarRyz
You don't specify what kind of comparison do you need, but I would reduce the amount of data held by the HashMap/Vector by transforming the row information into a single hash number.
您没有指定需要哪种比较,但我会通过将行信息转换为单个散列数来减少 HashMap/Vector 保存的数据量。
Something like this:
像这样的东西:
class RowHash {
private final int id; // the row id
private final int hashCode; // summary of the whole row info
public RowHash( ResultSet rs ) {
this.id = rs.getInt("id");
// get the strings from all the data
this.hashCode = new StringBuilder()
.append( rs.getString("field1") )
.append( rs.getString("field2") )
.append(rs.getString("fieldN"))
.toString().hashCode();
}
public final boolean equals( Object other ) {
return this.hashCode() == other.hashCode();
}
public final int hasCode() {
return hashCode;
}
}
And then store it into an ArrayList instead of a Vector which is not synchronized.
然后将其存储到一个 ArrayList 而不是一个不同步的 Vector 中。
...
ResulSet rs = ...
while( rs.next() ) {
arrayList.add( new RowHash( rs ) );
}
Well that's the idea, ( and depending on the comparison you need ) is to compute a number representing the whole record, and then use that single number to see if the other query has it.
嗯,这就是想法,(取决于您需要的比较)是计算一个代表整个记录的数字,然后使用该单个数字来查看其他查询是否有它。
Bear in mind that this is just a concept, you'll have to modify it to suit your needs.
请记住,这只是一个概念,您必须对其进行修改以满足您的需要。
Another ( probably simpler ) way to reduce the amount of memory used by a program that uses a lot of strings, is to call intern().
减少使用大量字符串的程序使用的内存量的另一种(可能更简单)方法是调用intern().
See this answerto compare the impact, but really it depends in your data.
请参阅此答案以比较影响,但这实际上取决于您的数据。
Heres a before/after screenshot using internon that answer
这是intern在该答案上使用的之前/之后的屏幕截图
Before
前
After
后
Area in blue is memory used, in the first around 2gb in the second < 25 mb
蓝色区域是使用的内存,第一个大约 2gb,第二个 < 25 mb
回答by Skarab
If you can sort both of the queries results, you should adapt sorted-merge joinalgorithm.
回答by u290629
You could encapsulate your own Object, for instance, a 'MyRecord' which is smaller than a HashMap, then it will be a List of 'MyRecord'.
您可以封装您自己的 Object,例如,一个比 HashMap 小的 'MyRecord',那么它将是一个 'MyRecord' 的列表。
If you have to use HashMap, use new HashMap(7,1)instead of default constructor, that could save memory, since you said fixed '8 key-value pairs' in a map
如果您必须使用 HashMap,请使用new HashMap(7,1)而不是默认构造函数,这可以节省内存,因为您在映射中说固定的“8 个键值对”
回答by Nakedible
If your dataset does not fit in to memory, then do an external sort, and after then the sort-merge join, as already pointed out in another answer.
如果您的数据集不适合内存,则进行外部排序,然后进行排序合并连接,正如在另一个答案中已经指出的那样。
If your dataset doesfit in to memory, then just use a lot of memory - it's fastest that way.
如果您的数据集确实适合内存,那么只需使用大量内存 - 这种方式最快。
Or if you are interested in specific optimizations just doing what you already do a little bit better - I can't help you.
或者,如果您对特定优化感兴趣,只需将您已经做得更好一点 - 我无法帮助您。
回答by Thorbj?rn Ravn Andersen
If you do not have the memory you will need external storage backing your datastructure, which is hard to do correctly (maps of weak references to your data, which all need to be rolled out to disk, etc), and you probably still will end up with bad performance when scaling.
如果您没有内存,您将需要外部存储来支持您的数据结构,这很难正确执行(对数据的弱引用映射,所有这些都需要推出到磁盘等),并且您可能仍然会结束缩放时性能不佳。
If you really have lots and lots of data, I would suggest embedding a SQL database. Then you can generate two tables containing your data and ask the database to find out any differences, and drop the tables afterwards. I've previously played with Derby, which I found nice, but others exist.
如果你真的有很多很多数据,我建议嵌入一个 SQL 数据库。然后您可以生成两个包含您的数据的表,并要求数据库找出任何差异,然后删除这些表。我以前玩过德比,我觉得这很好,但其他人也存在。


