Java 中的 HashMap,1 亿个条目

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4080370/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 11:22:55  来源:igfitidea点击:

HashMap in Java, 100 Million entries

javahashmap

提问by ablimit

I want to store 100 Million terms and their frequencies (in a text database ) into a HashMap <String, Double>. It is giving me "Out of Memory" Error. I tried to increase the heap-space to -Xmx15000M. However it runs half an hour then again throw the same exception. The file size from which I'm trying to read the words and frequencies is 1.7GB.

我想将 1 亿个术语及其频率(在文本数据库中)存储到HashMap <String, Double>. 它给了我“内存不足”错误。我试图将堆空间增加到-Xmx15000M. 但是它运行了半小时然后再次抛出相同的异常。我试图从中读取单词和频率的文件大小为 1.7GB。

Any help would be much appreciated.

任何帮助将非常感激。

Thanks :-)

谢谢 :-)

采纳答案by Christoffer

For word processing like that the answer is usually a tree rather than hashmap, if you can live with the longer lookup times. That structure is quite memory efficient for natural languages, where many words have common start strings.

对于像这样的文字处理,答案通常是一棵树而不是哈希图,如果你能忍受更长的查找时间。这种结构对于自然语言来说非常有效,因为自然语言中的许多词都有共同的起始字符串。

Depending on the input, a Patricia tree might be even better.

根据输入,Patricia 树可能会更好。

(Also, if this is indeed words from a natural language, are you sure you really need 100,000,000 entries? The majority of commonly used words is surprisingly low, commercial solutions (word prediction, spelling correction) rarely use more than 100,000 words regardless of language.)

(另外,如果这确实是来自自然语言的词,你确定你真的需要 100,000,000 个词条吗?大多数常用词出奇的低,商业解决方案(词预测、拼写校正)很少使用超过 100,000 个词,不管什么语言.)

回答by DJClayworth

With 100 million terms you are almost certainly over the limit of what should be stored in-memory. Store your terms in a database of some kind. Either use a commercial database, or write something that allows you to access the file to get the information you want. If the file format you have doesn't let you quickly access the file then convert it to one that does - for example make each record a fixed size, so you can instantly calculate the file offset for any record number. Sorting the records will then allow you to do a binary search very quickly. You can also write code to hugely speed up access to the files without needing to store the whole file in memory.

对于 1 亿个术语,您几乎可以肯定超出了应该存储在内存中的限制。将您的术语存储在某种数据库中。要么使用商业数据库,要么编写一些允许您访问文件以获取所需信息的内容。如果您拥有的文件格式不允许您快速访问该文件,则将其转换为可以访问的格式 - 例如,使每个记录具有固定大小,以便您可以立即计算任何记录编号的文件偏移量。对记录进行排序将允许您非常快速地进行二分查找。您还可以编写代码来极大地加快对文件的访问速度,而无需将整个文件存储在内存中。

回答by AHungerArtist

Trove THashMap uses a lot less memory. Still, doubt if that would be enough of a reduction in size. You need somewhere else to store this information for retrieval besides strictly in memory.

Trove THashMap 使用的内存要少得多。不过,怀疑这是否足以减少尺寸。除了严格存储在内存中之外,您还需要其他地方来存储这些信息以供检索。

回答by josefx

Your problem is that 1.7 GB raw Text is more than 1500 MB even without the overhead added by the individual string objects. For huge mappings you should either use a database or a file backed Map, these would use disk memory instead of heap.

您的问题是,即使没有单个字符串对象增加的开销,1.7 GB 原始文本也超过 1500 MB。对于巨大的映射,您应该使用数据库或文件支持的映射,这些将使用磁盘内存而不是堆。

Update

更新

I don't think allocating 15 GB for the heap is possible for most jvms. It wont work with any 32bit jvm and I don't think that a 64bit jvm would work either. 15 GB of memory should work on a 64 bit jvm when enough RAM is available.

我认为大多数 jvm 不可能为堆分配 15 GB。它不适用于任何 32 位 jvm,我认为 64 位 jvm 也不会工作。当有足够的 RAM 可用时,15 GB 的内存应该可以在 64 位 jvm 上工作。

回答by exiter2000

For the reason why it failed, I would agree with the above answers.

由于失败的原因,我同意上述答案。

DB is good choice.. But even comercial level of DB, they would also suggest 'Partitioning' the data to do effective action.

DB 是不错的选择.. 但是即使是商业级别的 DB,他们也会建议对数据进行“分区”以进行有效操作。

Depending on your environment, I might suggest to use distribute your data multiple nodes that connedte through LAN. Based on the Key value,

根据您的环境,我可能建议使用通过 LAN 连接的多个节点分发数据。基于Key值,

Node 01 has key starting with 'a' Node 02 has key starging with 'b'....

节点 01 的键以“a”开头 节点 02 的键以“b”开头....

So your program suddenly changed to network programming..

所以你的程序突然变成了网络编程..

回答by zengr

Its a bad design. Having 1.7GB of data in memory on a HashMap, I would have done any of the two:

它的设计很糟糕。在 HashMap 的内存中有 1.7GB 的数据,我会做以下两个中的任何一个:

  1. Persist all the data (file/database) and have the top 1% or something in memory. Use some algorithm for deciding which IDs will be in memory and when.

  2. Use memcached. The easiest way out. An in-memory distributed hashable. This is exactly what DHTs are used for.

  1. 保留所有数据(文件/数据库)并在内存中保留前 1% 或其他内容。使用一些算法来决定哪些 ID 将在内存中以及何时。

  2. 使用内存缓存。最简单的出路。内存中分布式哈希。这正是 DHT 的用途。

回答by Adamski

Other answers have already pointed out that the problem lies with memory usage. Depending on your problem domain you could design a key class that reduced the overall memory footprint. For example, if your key consists of natural language phrases you could separate and intern the words that make up the phrase; e.g.

其他答案已经指出问题在于内存使用。根据您的问题域,您可以设计一个减少整体内存占用的关键类。例如,如果您的密钥由自然语言短语组成,您可以将组成该短语的单词分开并插入;例如

public class Phrase {
  private final String[] interned;

  public Phrase(String phrase) {
    String[] tmp = phrase.split(phrase, "\s");

    this.interned = new String[tmp.length];

    for (int i=0; i<tmp.length; ++i) {
      this.interned[i] = tmp[i].intern();
    }
  }

  public boolean equals(Object o) { /* TODO */ }
  public int hashCode() { /* TODO */ }
}

In fact this solution might work even if the Strings do not represent natural language, providing there is significant overlap that can be exploited between Strings.

事实上,即使字符串不代表自然语言,该解决方案也可能有效,前提是字符串之间存在可利用的显着重叠。

回答by Joshua

If you want just a lightweight KeyValue (Map) store, I would look into using Redis. It is very fast and has the ability to persist the data if it needs. The only downside is that you need to run the Redis store on a linux machine.

如果你只想要一个轻量级的 KeyValue (Map) 存储,我会考虑使用 Redis。它非常快,并且能够在需要时持久保存数据。唯一的缺点是您需要在 linux 机器上运行 Redis 存储。

If you are limited to Windows, MongoDB is a good option if you can run it in 64bit.

如果您仅限于使用 Windows,那么如果您可以在 64 位上运行 MongoDB,那么 MongoDB 是一个不错的选择。

回答by Barend

Drop the HashMapand load all that data into HBase or one of the other NoSQL datastores and write your queries in terms of MapReduceoperations. This is the approach taken by Google Search and many other sites dealing with huge amounts of data. It has proven to scale to basically infinite size.

删除HashMap所有数据并将其加载到 HBase 或其他 NoSQL 数据存储区之一,并根据MapReduce操作编写查询。这是 Google 搜索和许多其他处理大量数据的网站所采用的方法。它已被证明可以扩展到基本上无限的大小。

回答by Ivan

You could also try stemming to increase the number of duplicates.

您也可以尝试使用词干来增加重复的数量。

For instance, cat = Cats = cats = Cat

例如,猫 = 猫 = 猫 = 猫

or

或者

swim = swimming = swims

游泳 = 游泳 = 游泳

try Googling "Porter Stemmer"

尝试谷歌搜索“Porter Stemmer”