在 Java 中保存大量数据的最佳实践
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27943897/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best practice for holding huge lists of data in Java
提问by Aviadjo
I'm writing a small system in Java in which i extract n-gram feature from text files and later need to perform Feature Selection process in order to select the most discriminators features.
我正在用 Java 编写一个小系统,其中我从文本文件中提取 n-gram 特征,然后需要执行特征选择过程以选择最多的鉴别器特征。
The Feature Extraction process for a single file return a Map which contains for each unique feature, its occurrences in the file. I merge all the file's Maps (Map) into one Map that contain the Document Frequency (DF) of all unique features extracted from all the files. The unified Map can contain above 10,000,000 entries.
单个文件的特征提取过程返回一个 Map,其中包含每个独特特征及其在文件中的出现次数。我将所有文件的地图 (Map) 合并为一张地图,其中包含从所有文件中提取的所有独特特征的文档频率 (DF)。统一的 Map 可以包含超过 10,000,000 个条目。
Currently the Feature Extraction process is working great and i want to perform Feature Selection in which i need to implement Information Gain or Gain Ratio. I will have to sort the Map first, perform computations and save the results in order to finally get a list of (for each feature, its Feature Selection score)
目前,特征提取过程运行良好,我想执行特征选择,其中我需要实现信息增益或增益比。我必须首先对地图进行排序,执行计算并保存结果,以便最终获得(对于每个特征,其特征选择分数)的列表
My question is: What is the best practice and the best data structure to hold this large amount of data (~10M) and perform computations?
我的问题是:保存大量数据(~10M)并执行计算的最佳实践和最佳数据结构是什么?
回答by Malt
This is a very broad question, so the answer is going to broad too. The solution depends on (at least) these three things:
这是一个非常广泛的问题,因此答案也会很广泛。解决方案取决于(至少)这三件事:
- The size of your entries
- 条目的大小
Storing 10,000,000 integers will require about 40MiB of memory, while storing 10,000,000 x 1KiB records will require more than 9GiB. These are two different problems. Ten million integers are trivial to store in memory in any stock Java collection, while keeping 9GiB in memory will force you to tweak and tune the Java Heap and garbage collector. If the entries are even larger, say 1MiB, then you can forget about in-memory storage entirely. Instead, you'll need to focus on finding a good disk backed data structure, maybe a database.
存储 10,000,000 个整数将需要大约 40MiB 的内存,而存储 10,000,000 x 1KiB 记录将需要超过 9GiB。这是两个不同的问题。在任何股票 Java 集合中,在内存中存储 1000 万个整数是微不足道的,而在内存中保持 9GiB 将迫使您调整和调整 Java 堆和垃圾收集器。如果条目更大,比如 1MiB,那么您可以完全忘记内存中的存储。相反,您需要专注于寻找一个好的磁盘支持的数据结构,可能是一个数据库。
- The hardware you're using
- 您使用的硬件
Storing ten million 1KiB records on a machine with 8 GiB of ram is not the same as storing them on a server with 128GiB. Things that are pretty much impossible with the former machine are trivial with the latter.
在具有 8 GiB 内存的机器上存储一千万条 1KiB 记录与将它们存储在具有 128GiB 的服务器上不同。前者几乎不可能的事情对后者来说是微不足道的。
- The type of computation(s) you want to do
- 您想要进行的计算类型
You've mentioned sorting, so things like TreeMapor maybe PriorityQueuecome to mind. But is that the most intensive computation? And what is the key you're using to sort them? Do you plan on locating (getting) entities based on other properties that aren't the key? If so, that requires separate planning. Otherwise you'd need to iterate over all ten million entries.
您提到了排序,因此您会想到TreeMap或PriorityQueue 之类的东西。但这是最密集的计算吗?您用来对它们进行排序的密钥是什么?您是否计划根据不是关键的其他属性来定位(获取)实体?如果是这样,那就需要单独规划。否则,您需要遍历所有一千万个条目。
Do your computations run in a single thread or multiple threads? If you might have concurrent modifications of your data, that requires a separate solution. Data structures such as TreeMap and PriorityQueue would have to be either locked or replaced with concurrent structures such as ConcurrentLinkedHashMapor ConcurrentSkipListMap.
您的计算是在单线程还是多线程中运行?如果您可能同时对数据进行修改,则需要单独的解决方案。数据结构(例如 TreeMap 和 PriorityQueue )必须被锁定或替换为并发结构(例如ConcurrentLinkedHashMap或ConcurrentSkipListMap )。
回答by bachr
You can use a caching system, check MapDBit's very efficient and has a tree map implementation (so you can have your data ordered without any effort). Also, it provides data stores to save your data to disk when it cannot be held on memory.
您可以使用缓存系统,检查MapDB,它非常有效并且具有树形图实现(因此您可以毫不费力地对数据进行排序)。此外,它还提供数据存储,以便在无法将数据保存在内存中时将数据保存到磁盘。
// here a sample that uses the off-heap memory to back the map
Map<String, String> map = DBMaker.newMemoryDirectDB().make().getTreeMap("words");
//put some stuff into map
map.put("aa", "bb");
map.put("cc", "dd");
回答by Radu Stoenescu
My intuition is that you could take inspiration from the initial MapReduceparadigm and partition your problem into several smaller but similar ones and then aggregate these partial results in order to reach the complete solution.
我的直觉是,您可以从最初的MapReduce范式中汲取灵感,将您的问题分成几个较小但相似的问题,然后汇总这些部分结果以获得完整的解决方案。
If you solve one smaller problem instance at a time (i.e. file chunk) this will guarantee you a space consumption penalty bounded by the space requirements for this single instance.
如果您一次解决一个较小的问题实例(即文件块),这将保证您受到此单个实例的空间要求限制的空间消耗损失。
This approach to process the file lazily will work invariant of the data structure you choose.
这种懒惰地处理文件的方法将不受您选择的数据结构的影响。