Java 8 的字符串去重特性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27949213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 05:21:10  来源:igfitidea点击:

String Deduplication feature of Java 8

javastringjava-8

提问by Joe

Since Stringin Java (like other languages) consumes a lot of memory because each character consumes two bytes, Java 8 has introduced a new feature called String Deduplicationwhich takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

由于String在 Java 中(像其他语言一样)因为每个字符消耗两个字节而消耗大量内存,因此 Java 8 引入了一个称为String Deduplication的新功能,它利用了 char 数组是字符串内部的和 final 的事实,因此 JVM可以惹他们。

I have read this exampleso far but since I am not a pro java coder, I am having a hard time grasping the concept.

到目前为止,我已经阅读了这个示例,但由于我不是专业的 Java 编码员,因此我很难掌握这个概念。

Here is what it says,

这是它所说的,

Various strategies for String Duplication have been considered, but the one implemented now follows the following approach: Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char. If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.

已经考虑了多种字符串复制策略,但现在实现的策略遵循以下方法:每当垃圾收集器访问 String 对象时,它都会记录字符数组。它获取它们的哈希值并将其与对数组的弱引用一起存储。一旦它找到另一个具有相同哈希码的字符串,它就会逐个字符地比较它们。如果它们也匹配,则一个 String 将被修改并指向第二个 String 的 char 数组。然后第一个字符数组不再被引用并且可以被垃圾收集。

这整个过程当然会带来一些开销,但受到严格限制。例如,如果一段时间内未发现字符串重复,则将不再检查该字符串。

My First question,

我的第一个问题,

There is still a lack of resources on this topic since it is recently added in Java 8 update 20, could anyone here share some practical examples on how it help in reducing the memory consumed by Stringin Java ?

由于最近在 Java 8 update 20 中添加了该主题,因此仍然缺乏有关此主题的资源,这里的任何人都可以分享一些有关它如何帮助减少StringJava 中消耗的内存的实际示例吗?

Edit:

编辑:

The above link says,

上面的链接说,

As soon as it finds another String which has the same hash code it compares them char by char

一旦找到另一个具有相同哈希码的字符串,它就会逐个字符地比较它们

My 2nd question,

我的第二个问题,

If hash code of two Stringare same then the Stringsare already the same, then why compare them charby charonce it is found that the two Stringhave same hash code ?

如果两个哈希码String相同,则Strings已经是相同的,那么为什么对它们进行比较charchar,一旦发现,这两个String具有相同的散列码?

采纳答案by assylias

Imagine you have a phone book, which contains people, which have a String firstNameand a String lastName. And it happens that in your phone book, 100,000 people have the same firstName = "John".

假设您有一个电话簿,其中包含人员,其中有一个String firstName和一个String lastName。碰巧在您的电话簿中,有 100,000 人拥有相同的firstName = "John".

Because you get the data from a database or a file those strings are not interned so your JVM memory contains the char array {'J', 'o', 'h', 'n'}100 thousand times, one per John string. Each of these arrays takes, say, 20 bytes of memory so those 100k Johns take up 2 MB of memory.

因为您从数据库或文件中获取数据,所以这些字符串不会被保留,因此您的 JVM 内存包含{'J', 'o', 'h', 'n'}10 万次char 数组,每个 John 字符串一次。例如,这些数组中的每一个都占用 20 字节的内存,因此那些 100k 约翰占用了 2 MB 的内存。

With deduplication, the JVM will realise that "John" is duplicated many times and make all those John strings point to the same underlying char array, decreasing the memory usage from 2MB to 20 bytes.

通过重复数据删除,JVM 将意识到“John”被多次复制,并使所有这些 John 字符串指向相同的底层字符数组,从而将内存使用量从 2MB 减少到 20 字节。

You can find a more detailed explanation in the JEP. In particular:

您可以在JEP 中找到更详细的解释。特别是:

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2)is true. Having duplicate String objects on the heap is, essentially, just a waste of memory.

[...]

The actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.

许多大型 Java 应用程序目前都存在内存瓶颈。测量表明,这些类型的应用程序中大约 25% 的 Java 堆实时数据集被 String 对象使用。此外,这些 String 对象中大约有一半是重复的,其中重复的意思string1.equals(string2)是真的。在堆上放置重复的 String 对象本质上只是浪费内存。

[...]

实际的预期收益最终会减少大约 10% 的堆。请注意,此数字是基于各种应用计算得出的平均值。特定应用程序的堆减少量可能会有很大差异。

回答by geert3

The strategy they describe is to simply reuse the internal character array of one String in possibly many equalStrings. There's no need for each String to have its own copy if they are equal.

他们描述的策略是在可能的多个equal字符串中简单地重用一个字符串的内部字符数组。如果它们相等,则不需要每个 String 都有自己的副本。

In order to more quickly determine if 2 strings are equal, the hash code is used as a first step, as it is a fast way to determine if Strings maybe equal. Hence their statement:

为了更快速地确定是否2个字符串相等,哈希码作为第一步骤,因为它是一个快速的方法,以确定是否字符串可以是相等的。因此他们的声明:

As soon as it finds another String which has the same hash code it compares them char by char

一旦找到另一个具有相同哈希码的字符串,它就会逐个字符地比较它们

This is to make a certain(but slower) comparison for equality once possibleequality has been determined using the hash code.

这是为了在使用哈希码确定可能的相等性后,对相等性进行某种(但较慢)的比较。

In the end, equal Strings will share a single underlying char array.

最后,相等的字符串将共享一个底层字符数组。

Java has had String.intern()for a long time, to do more or less the same (i.e. save memory by deduplicating equal Strings). What's novel about this is that it happens during garbage collection time and can be externally controlled.

Java 已经有String.intern()很长一段时间,或多或少地做相同的事情(即通过对相等的字符串进行重复数据删除来节省内存)。这件事的新颖之处在于它发生在垃圾收集期间,并且可以从外部进行控制。

回答by mbomb007

Since your first question has already been answered, I'll answer your second question.

你的第一个问题已经回答过了,我来回答你的第二个问题。

The Stringobjects must be compared character by character, because though equal Objects implies equal hashes, the inverse is notnecessarily true.

String对象必须比较逐个字符,因为虽然等于Object小号意味着等于哈希,逆是不是一定是真的。

As Holgersaid in his comment, this represents a hash collision.

正如Holger在他的评论中所说,这代表了哈希冲突。

The applicable specifications for the hashcode()method are as follows:

hashcode()方法的适用规范如下:

  • If two objects are equal according to the equals(Object)method, then calling the hashCodemethod on each of the two objects must produce the same integer result.

  • It is not required that if two objects are unequal according to the equals(java.lang.Object)method, then calling the hashCodemethod on each of the two objects must produce distinct integer results. ...

  • 如果根据equals(Object)方法两个对象相等,则hashCode对两个对象中的每一个调用该方法必须产生相同的整数结果。

  • 如果两个对象根据equals(java.lang.Object)方法不相等,则不需要对两个对象中的hashCode每一个调用该方法必须产生不同的整数结果。...

This means that in order for them to guarantee equality, the comparison of each character is necessary in order for them to confirm the equality of the two objects. They start by comparing hashCodes rather than using equalssince they are using a hash table for the references, and this improves performance.

这意味着,为了保证它们相等,必须对每个字符进行比较,以便它们确认两个对象的相等性。他们从比较hashCodes 而不是 using开始,equals因为他们使用哈希表作为引用,这提高了性能。

回答by Robert Niestroj

@assylias answer basiclly tells you how it work and is very good answer. I have tested a production application with String Deduplication and have some results. The web app heavily uses Strings so i think the advantage is pretty clear.

@assylias 的回答基本上会告诉您它是如何工作的,并且是非常好的回答。我已经使用字符串重复数据删除测试了一个生产应用程序并获得了一些结果。Web 应用程序大量使用字符串,所以我认为优势非常明显。

To enable String Deduplication you have to add these JVM params (you need at least Java 8u20):

要启用字符串重复数据删除,您必须添加这些 JVM 参数(您至少需要 Java 8u20):

-XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics

The last one is optional but like the name says it shows you the String Deduplication statistics. Here are mine:

最后一个是可选的,但正如名称所说,它向您显示字符串重复数据删除统计信息。这是我的:

[GC concurrent-string-deduplication, 2893.3K->2672.0B(2890.7K), avg 97.3%, 0.0175148 secs]
   [Last Exec: 0.0175148 secs, Idle: 3.2029081 secs, Blocked: 0/0.0000000 secs]
      [Inspected:           96613]
         [Skipped:              0(  0.0%)]
         [Hashed:           96598(100.0%)]
         [Known:                2(  0.0%)]
         [New:              96611(100.0%)   2893.3K]
      [Deduplicated:        96536( 99.9%)   2890.7K( 99.9%)]
         [Young:                0(  0.0%)      0.0B(  0.0%)]
         [Old:              96536(100.0%)   2890.7K(100.0%)]
   [Total Exec: 452/7.6109490 secs, Idle: 452/776.3032184 secs, Blocked: 11/0.0258406 secs]
      [Inspected:        27108398]
         [Skipped:              0(  0.0%)]
         [Hashed:        26828486( 99.0%)]
         [Known:            19025(  0.1%)]
         [New:           27089373( 99.9%)    823.9M]
      [Deduplicated:     26853964( 99.1%)    801.6M( 97.3%)]
         [Young:             4732(  0.0%)    171.3K(  0.0%)]
         [Old:           26849232(100.0%)    801.4M(100.0%)]
   [Table]
      [Memory Usage: 2834.7K]
      [Size: 65536, Min: 1024, Max: 16777216]
      [Entries: 98687, Load: 150.6%, Cached: 415, Added: 252375, Removed: 153688]
      [Resize Count: 6, Shrink Threshold: 43690(66.7%), Grow Threshold: 131072(200.0%)]
      [Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0]
      [Age Threshold: 3]
   [Queue]
      [Dropped: 0]

These are the results after running the app for 10 minutes. As you can see String Deduplication was executed 452times and "deduplicated" 801.6 MBStrings. String Deduplication inspected 27 000 000Strings. When i compared my memory consumption from Java 7 with the standard Parallel GC to Java 8u20 with the G1 GC and enabled String Deduplication the heap dropped approximatley 50%:

这些是运行该应用程序 10 分钟后的结果。如您所见,字符串重复数据删除被执行了452次,并且“重复数据删除”了801.6 MB字符串。字符串重复数据删除检查了27 000 000 个字符串。当我将 Java 7 与标准并行 GC 的内存消耗与 Java 8u20 与 G1 GC 的内存消耗进行比较并启用字符串重复数据删除时,堆下降了大约50%

Java 7 Parallel GC

Java 7 并行 GC

Java 7 Parallel GC

Java 7 并行 GC

Java 8 G1 GC with String Deduplication

Java 8 G1 GC 与字符串重复数据删除

Java 8 G1 GC with String Deduplication

Java 8 G1 GC 与字符串重复数据删除