java.lang.String 是否有内存高效的替代品?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/231051/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 11:48:42  来源:igfitidea点击:

Is there a memory-efficient replacement of java.lang.String?

javastringoptimizationmemoryperformance

提问by the.duckman

After reading this old articlemeasuring the memory consumption of several object types, I was amazed to see how much memory Strings use in Java:

看完这篇测量几种对象类型内存消耗的老文章后,我惊讶地看到StringJava中使用了多少内存:

length: 0, {class java.lang.String} size = 40 bytes
length: 7, {class java.lang.String} size = 56 bytes

While the article has some tips to minimize this, I did not find them entirely satisfying. It seems to be wasteful to use char[]for storing the data. The obvious improvement for most western languages would be to use byte[]and an encoding like UTF-8 instead, as you only need a single byte to store the most frequent characters then instead of two bytes.

虽然这篇文章有一些技巧可以最大限度地减少这种情况,但我发现它们并不完全令人满意。char[]用于存储数据似乎很浪费。大多数西方语言的明显改进是使用byte[]UTF-8 之类的编码,因为您只需要一个字节来存储最常用的字符,而不是两个字节。

Of course one could use String.getBytes("UTF-8")and new String(bytes, "UTF-8"). Even the overhead of the String instance itself would be gone. But then there you lose very handy methods like equals(), hashCode(), length(), ...

当然可以使用String.getBytes("UTF-8")and new String(bytes, "UTF-8")。甚至 String 实例本身的开销也会消失。但是随后您会丢失非常方便的方法,例如equals(), hashCode(), length(), ...

Sun has a patenton byte[]representation of Strings, as far as I can tell.

Sun有一个专利byte[]字符串表示,据我可以告诉。

Frameworks for efficient representation of string objects in Java programming environments
... The techniques can be implemented to create Java string objects as arrays of one-byte characters when it is appropriate ...

用于在 Java 编程环境中有效表示字符串对象的框架
……可以实现这些技术以在适当的时候将 Java 字符串对象创建为一字节字符的数组……

But I failed to find an API for that patent.

但是我没有找到该专利的 API。

Why do I care?
In most cases I don't. But I worked on applications with huge caches, containing lots of Strings, which would have benefitted from using the memory more efficiently.

我为什么在乎?
大多数情况下我不会。但是我研究了具有大量缓存的应用程序,其中包含大量字符串,这将从更有效地使用内存中受益。

Does anybody know of such an API? Or is there another way to keep your memory footprint for Strings small, even at the cost of CPU performance or uglier API?

有人知道这样的API吗?或者是否有另一种方法可以让 Strings 的内存占用保持较小,即使以 CPU 性能或更丑陋的 API 为代价?

Please don't repeat the suggestions from the above article:

请不要重复以上文章中的建议:

  • own variant of String.intern()(possibly with SoftReferences)
  • storing a single char[]and exploiting the current String.subString(.)implementation to avoid data copying (nasty)
  • 自己的变体String.intern()(可能与SoftReferences
  • 存储单个char[]并利用当前String.subString(.)实现来避免数据复制(讨厌)

Update

更新

I ran the code from the article on Sun's current JVM (1.6.0_10). It yielded the same results as in 2002.

我运行了有关 Sun 当前 JVM (1.6.0_10) 的文章中的代码。它产生了与 2002 年相同的结果。

回答by FlySwat

Out of curiosity, is the few bytes saved really worth it?

出于好奇,节省的几个字节真的值得吗?

Normally, I suggest ditching strings for performance reasons, in favor of StringBuffer (Remember, Strings are immutable).

通常,出于性能原因,我建议放弃字符串,转而使用 StringBuffer(请记住,字符串是不可变的)。

Are you seriously exhausting your heap from string references?

你是否真的从字符串引用中耗尽了你的堆?

回答by jsight

Just compress them all with gzip. :) Just kidding... but I have seen stranger things, and it would give you much smaller data at significant CPU expense.

只需用 gzip 压缩它们。:) 开个玩笑……但我见过更奇怪的事情,它会以显着的 CPU 开销为您提供更小的数据。

The only other String implementations that I'm aware of are the ones in the Javolution classes. I don't think that they are more memory efficient, though:

我所知道的唯一其他 String 实现是 Javolution 类中的那些。不过,我不认为它们的内存效率更高:

http://www.javolution.com/api/javolution/text/Text.html
http://www.javolution.com/api/javolution/text/TextBuilder.html

http://www.javolution.com/api/javolution/text/Text.html
http://www.javolution.com/api/javolution/text/TextBuilder.html

回答by matt b

I think you should be very cautious about basing any ideas and/or assumptions off of a javaworld.com article from 2002. There have been many, many changes to the compiler and JVM in the six years since then. At the very least, test your hypothesis and solution against a modern JVM first to make sure that the solution is even worth the effort.

我认为您应该非常谨慎地根据 2002 年的 javaworld.com 文章提出任何想法和/或假设。从那时起的六年中,编译器和 JVM 发生了很多很多变化。至少,首先要针对现代 JVM 测试您的假设和解决方案,以确保该解决方案值得付出努力。

回答by nkr1pt

I believe that Strings are less memory intensive for some time now, because the Java engineers have implemented the flyweight design pattern to share as much as possible. In fact Strings that have the same value point to the very same object in memory I believe.

我相信 Strings 的内存密集度已经有一段时间了,因为 Java 工程师已经实现了享元设计模式来尽可能多地共享。事实上,我相信具有相同值的字符串指向内存中的同一个对象。

回答by Sam Stokes

You said not to repeat the article's suggestion of rolling your own interning scheme, but what's wrong with String.internitself? The article contains the following throwaway remark:

你说不要重复文章的建议滚动你自己的实习计划,但String.intern它本身有什么问题?该文章包含以下一次性评论:

Numerous reasons exist to avoid the String.intern() method. One is that few modern JVMs can intern large amounts of data.

有很多原因可以避免使用 String.intern() 方法。一是很少有现代 JVM 可以存储大量数据。

But even if the memory usage figures from 2002 still hold six years later, I'd be surprised if no progress has been made on how much data JVMs can intern.

但即使 2002 年的内存使用数据在 6 年后仍然成立,如果在 JVM 可以实习的数据量方面没有取得任何进展,我也会感到惊讶。

This isn't purely a rhetorical question - I'm interested to know if there are good reasons to avoid it. Is it implemented inefficiently for highly-multithreaded use? Does it fill up some special JVM-specific area of the heap? Do you really have hundreds of megabytes of unique strings (so interning would be useless anyway)?

这不仅仅是一个修辞问题 - 我很想知道是否有充分的理由来避免它。对于高度多线程使用,它是否实现效率低下?它是否填满了堆的一些特殊的 JVM 特定区域?你真的有数百兆字节的独特字符串吗(所以实习是没用的)?

回答by Bill K

There is the overhead of creating an object (at least a dispatch table), the overhead of the fact that it uses 2 bytes per letter, and the overhead of a few extra variables in there that are created to actually improve speed and memory usage in many cases.

有创建对象(至少是一个调度表)的开销,它每个字母使用 2 个字节的开销,以及在那里创建的一些额外变量的开销,这些变量实际上是为了提高速度和内存使用率许多情况。

If you are going to use OO programming, this is the cost of having clear, usable, maintainable code.

如果您打算使用 OO 编程,这就是拥有清晰、可用、可维护的代码的成本。

For an answer besides the obvious (which is that if memory usage is that important, you should probably be using C), you could implement your own Strings with an internal representation in BCD byte-arrays.

除了显而易见的答案(如果内存使用如此重要,您可能应该使用 C),您可以使用 BCD 字节数组中的内部表示来实现您自己的字符串。

That actually sounds fun, I might do it just for kicks :)

这听起来很有趣,我可能只是为了踢球:)

A Java array takes 2 bytes per item. A BCD encoded digit takes 6 bits per letter IIRC, making your strings significantly smaller. There would be a little conversion cost in time, but not too bad really. The really big problem is that you'd have to convert to string to do anything with it.

Java 数组每项占用 2 个字节。BCD 编码的数字每个字母 IIRC 需要 6 位,使您的字符串显着更小。及时会有一点转换成本,但真的不算太糟糕。真正的大问题是您必须转换为字符串才能对其进行任何操作。

You still have the overhead of an object instance to worry about... but that would be better addressed by revamping your design than trying to eliminate instances.

您仍然需要担心对象实例的开销......但是通过修改设计比试图消除实例更好地解决这个问题。

Finally a note. I'm completely against deploying anything like this unless you have 3 things:

最后附注。我完全反对部署这样的东西,除非你有 3 件事:

  • An implementation done the most readable way
  • Test results and requirements showing how that implementation doesn't meet requirements
  • Test results on how the "improved" implementation DOES meet requirements.
  • 以最易读的方式完成的实现
  • 测试结果和要求表明该实施如何不满足要求
  • 关于“改进的”实现如何满足要求的测试结果。

Without all three of those, I'd kick any optimized solution a developer presented to me.

如果没有所有这三个,我会踢开发人员提供给我的任何优化解决方案。

回答by Mecki

Java chose UTF-16 for a compromise of speed and storage size. Processing UTF-8 data is much more PITA than processing UTF-16 data (e.g. when trying to find the position of character X in the byte array, how are you going to do so in a fast manner, if every character can have one, two, three or even up to six bytes? Ever thought about that? Going over the string byte by byte is not really fast, you see?). Of course UTF-32 would be easiest to process, but waste twice the storage space. Things have changed since the early Unicode days. Now certain characters need 4 byte, even when UTF-16 is used. Handling these correctly make UTF-16 almost equally bad as UTF-8.

Java 选择 UTF-16 是为了兼顾速度和存储大小。处理 UTF-8 数据比处理 UTF-16 数据更 PITA(例如,当尝试在字节数组中查找字符 X 的位置时,您将如何以快速的方式执行此操作,如果每个字符都可以有一个,两个、三个甚至最多六个字节?有没有想过?逐字节检查字符串并不是很快,你明白吗?)。当然,UTF-32 最容易处理,但会浪费两倍的存储空间。自 Unicode 早期以来,情况发生了变化。现在某些字符需要 4 个字节,即使使用 UTF-16 也是如此。正确处理这些使得 UTF-16 几乎与 UTF-8 一样糟糕。

Anyway, rest assured that if you implement a String class with an internal storage that uses UTF-8, you might win some memory, but you will lose processing speed for many string methods. Also your argument is a way too limited point of view. Your argument will not hold true for someone in Japan, since Japanese characters will not be smaller in UTF-8 than in UTF-16 (actually they will take 3 bytes in UTF-8, while they are only two bytes in UTF-16). I don't understand why programmers in such a global world like today with the omnipresent Internet still talk about "western languages", as if this is all that would count, as if only the western world has computers and the rest of it lives in caves. Sooner or later any application gets bitten by the fact that it fails to effectively process non-western characters.

无论如何,请放心,如果您使用使用 UTF-8 的内部存储实现 String 类,您可能会赢得一些内存,但您将失去许多字符串方法的处理速度。此外,您的论点是一种过于有限的观点。你的论点对日本人来说不成立,因为日语字符在 UTF-8 中不会比在 UTF-16 中小(实际上它们在 UTF-8 中会占用 3 个字节,而在 UTF-16 中它们只有两个字节) . 我不明白为什么在像今天这样一个互联网无处不在的全球化世界里,程序员们还在谈论“西方语言”,好像这就是全部,好像只有西方世界有电脑,其余的都生活在洞穴。迟早任何应用程序都会被它无法有效处理非西方字符的事实所困扰。

回答by benjismith

An internal UTF-8 encoding has its advantages (such as the smaller memory footprint that you pointed out), but it has disadvantages too.

内部 UTF-8 编码有其优点(例如您指出的较小的内存占用),但它也有缺点。

For example, determining the character-length (rather than the byte-length) of a UTF-8 encoded string is an O(n) operation. In a java string, the cost of determining the character-length is O(1), while generating the UTF-8 representation is O(n).

例如,确定 UTF-8 编码字符串的字符长度(而不是字节长度)是一个 O(n) 操作。在 java 字符串中,确定字符长度的成本是 O(1),而生成 UTF-8 表示的成本是 O(n)。

It's all about priorities.

这都是关于优先级的。

Data-structure design can often be seen as a tradeoff between speed and space. In this case, I think the designers of the Java string API made a choice based on these criteria:

数据结构设计通常可以被视为速度和空间之间的权衡。在这种情况下,我认为 Java 字符串 API 的设计者基于以下标准做出了选择:

  • The String class must support all possible unicode characters.

  • Although unicode defines 1 byte, 2 byte, and 4-byte variants, the 4-byte characters are (in practice) pretty rare, so it's okay to represent them as surrogate pairs. That's why java uses a 2-byte char primitive.

  • When people call length(), indexOf(), and charAt() methods, they're interested in the character position, not the byte position. In order to create fast implementations of these methods, it's necessary to avoid the internal UTF-8 encoding.

  • Languages like C++ make the programmer's life more complicated by defining three different character types and forcing the programmer to choose between them. Most programmers start off using simple ASCII strings, but when they eventually need to support international characters, the process of modifying the code to use multibyte characters is extremely painful. I think the Java designers made an excellent compromise choice by saying that all strings consist of 2-byte characters.

  • String 类必须支持所有可能的 unicode 字符。

  • 尽管 unicode 定义了 1 字节、2 字节和 4 字节变体,但 4 字节字符(实际上)非常罕见,因此可以将它们表示为代理对。这就是 java 使用 2 字节 char 原语的原因。

  • 当人们调用 length()、indexOf() 和 charAt() 方法时,他们对字符位置感兴趣,而不是字节位置。为了创建这些方法的快速实现,有必要避免内部 UTF-8 编码。

  • 像 C++ 这样的语言通过定义三种不同的字符类型并迫使程序员在它们之间进行选择,使程序员的生活变得更加复杂。大多数程序员开始使用简单的 ASCII 字符串,但是当他们最终需要支持国际字符时,修改代码以使用多字节字符的过程非常痛苦。我认为 Java 设计者做出了一个很好的折衷选择,说所有字符串都由 2 字节字符组成。

回答by Stephen Denne

The article points out two things:

文章指出两点:

  1. Character arrays increase in chunks of 8 bytes.
  2. There is a large difference in size between char[] and String objects.
  1. 字符数组以 8 字节的块增加。
  2. char[] 和 String 对象之间的大小差异很大。

The overhead is due to including a char[] object reference, and three ints: an offset, a length, and space for storing the String's hashcode, plus the standard overhead of simply being an object.

开销是由于包含一个 char[] 对象引用和三个整数:一个偏移量、一个长度和用于存储 String 哈希码的空间,加上作为一个对象的标准开销。

Slightly different from String.intern(), or a character array used by String.substring() is using a single char[] for all Strings, this means you do not need to store the object reference in your wrapper String-like object. You would still need the offset, and you introduce a (large) limit on how many characters you can have in total.

与 String.intern() 略有不同,或者 String.substring() 使用的字符数组对所有字符串使用单个 char[],这意味着您不需要将对象引用存储在您的包装类字符串对象中。您仍然需要偏移量,并且您对总共可以拥有的字符数引入了(大)限制。

You would no longer need the length if you use a special end of string marker. That saves four bytes for the length, but costs you two bytes for the marker, plus the additional time, complexity, and buffer overrun risks.

如果您使用特殊的字符串结束标记,您将不再需要长度。这节省了四个字节的长度,但花费了两个字节的标记,加上额外的时间、复杂性和缓冲区溢出风险。

The space-time trade-off of not storing the hash may help you if you do not need it often.

如果您不经常需要它,不存储散列的时空权衡可能会对您有所帮助。

For an application that I've worked with, where I needed super fast and memory efficient treatment of a large number of strings, I was able to leave the data in its encoded form, and work with byte arrays. My output encoding was the same as my input encoding, and I didn't need to decode bytes to characters nor encode back to bytes again for output.

对于我使用过的应用程序,我需要对大量字符串进行超快速和内存高效的处理,我能够将数据保留为编码形式,并使用字节数组。我的输出编码与我的输入编码相同,我不需要将字节解码为字符,也不需要再次编码回字节进行输出。

In addition, I could leave the input data in the byte array it was originally read into - a memory mapped file.

此外,我可以将输入数据保留在最初读入的字节数组中——一个内存映射文件。

My objects consisted of an int offset (the limit suited my situation), an int length, and an int hashcode.

我的对象由一个 int 偏移量(适合我的情况的限制)、一个 int 长度和一个 int 哈希码组成。

java.lang.String was the familiar hammer for what I wanted to do, but not the best tool for the job.

java.lang.String 是我想要做的事情的熟悉锤子,但不是完成这项工作的最佳工具。

回答by Alex Miller

At Terracotta, we have some cases where we compress big Strings as they are sent around the network and actually leave them compressed until decompression is necessary. We do this by converting the char[] to byte[], compressing the byte[], then encoding that byte[] back into the original char[]. For certain operations like hash and length, we can answer those questions without decoding the compressed string. For data like big XML strings, you can get substantial compression this way.

在 Terracotta,我们在某些情况下会在大字符串在网络上发送时对其进行压缩,然后实际上将它们压缩,直到需要解压缩为止。为此,我们将 char[] 转换为 byte[],压缩 byte[],然后将该 byte[] 编码回原始的 char[]。对于哈希和长度等某些操作,我们可以在不解码压缩字符串的情况下回答这些问题。对于像大 XML 字符串这样的数据,您可以通过这种方式获得大量压缩。

Moving the compressed data around the network is a definite win. Keeping it compressed is dependent on the use case. Of course, we have some knobs to turn this off and change the length at which compression turns on, etc.

在网络中移动压缩数据绝对是一个胜利。保持压缩取决于用例。当然,我们有一些旋钮可以关闭它并更改压缩打开的长度等。

This is all done with byte code instrumentation on java.lang.String which we've found is very delicate due to how early String is used in startup but is stable if you follow some guidelines.

这一切都是通过 java.lang.String 上的字节码检测完成的,我们发现它非常微妙,因为 String 在启动时使用的时间很早,但如果您遵循一些指导方针,则它是稳定的。