java 如何处理大字符串和有限内存

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2148394/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 19:41:46  来源:igfitidea点击:

How to deal with big strings and limited memory

javastringmemoryout-of-memory

提问by hsmit

I have a file from which I read data. All the text from this file is stored in a String variable (a very big variable). Then in another part of my app I want to walk through this string and extract useful information, step-by-step (parsing the string).

我有一个从中读取数据的文件。该文件中的所有文本都存储在一个字符串变量(一个非常大的变量)中。然后在我的应用程序的另一部分中,我想逐步浏览此字符串并提取有用的信息(解析字符串)。

In the meanwhile my memory gets full and an OutOfMemory exception keeps me from further processing. I think it would be better to process the data directly while reading the inputstream from the file. But for organizational aims, I would like to pass the String to another part in my application.

与此同时,我的内存已满, OutOfMemory 异常使我无法进一步处理。我认为在从文件中读取输入流时直接处理数据会更好。但是出于组织目的,我想将字符串传递给我的应用程序中的另一个部分。

What should I do to keep the memory from overflowing?

我该怎么做才能防止内存溢出?

回答by Zombies

You should be using the BufferedInputReaderinstead of storing this all into one large string.

您应该使用BufferedInputReader而不是将所有这些都存储到一个大字符串中。

If what you want to parse happens to be on the same line, then StringTokenizerwill work quite nicely, else you have to devise a way to read what you want from the file to parse out statements, then apply StringTokenizer to each statement.

如果您要解析的内容恰好在同一行,那么StringTokenizer将工作得很好,否则您必须设计一种方法从文件中读取您想要的内容来解析语句,然后将 StringTokenizer 应用于每个语句。

回答by Thomas Jung

If you can loosen your requirements a bit you could implement a java.lang.CharSequencebacked by your file.

如果您可以稍微放宽您的要求,您可以实现一个由您的文件支持的java.lang.CharSequence

The CharSequence is supported many places in the JDK(A String is a CharSequence) . So this is a good alternative to a Reader-based implementation.

CharSequence在 JDK 中的很多地方都受支持(字符串是 CharSequence)。所以这是基于 Reader 实现的一个很好的替代方案。

回答by Kevin Brock

Others have suggested reading and processing portions of your file at a single time. If possible, one of those ways would be better.

其他人建议一次阅读和处理文件的部分内容。如果可能,其中一种方法会更好。

However, if this is not possible and you are able to load the Stringinitially into memory as you indicate but it is later parsing of this string that creates problems, you may be able to use substrings. In Java a sub-string maps on top of the original chararray and just takes memory for the base Objectand then the start and length int pointers.

但是,如果这是不可能的,并且您可以String按照您的指示将初始加载到内存中,但稍后解析此字符串会产生问题,您可以使用子字符串。在 Java 中,一个子字符串映射在原始char数组的顶部,并且只占用基数的内存Object,然后是起始和长度 int 指针。

So, when you find a portion of the string that you want to keep separately, use something like:

因此,当您找到要单独保留的字符串部分时,请使用以下内容:

String piece = largeString.substring(foundStart, foundEnd);

If you instead this or code that internally does this, then the memory use will increase dramatically:

如果您改为这样或在内部执行此操作的代码,则内存使用量将急剧增加:

new String(largeString.substring(foundStart, foundEnd));

Note that you must use String.substring()with care for this very reason. You could have a very large string off of which you take a substring and then discard your reference to the original string. The problem is the substring still references the original large chararray. The GC will not release that until the substring also is removed. In cases like this, it's useful to actually use new String(...)to ensure the unused large array will be discarded by the GC (this is one of the few cases where you should ever use new String(...)).

请注意,String.substring()出于这个原因,您必须小心使用。您可以有一个非常大的字符串,从中取出一个子字符串,然后丢弃对原始字符串的引用。问题是子字符串仍然引用原始的大char数组。在子字符串也被删除之前,GC 不会释放它。在这种情况下,实际使用new String(...)以确保 GC 丢弃未使用的大数组很有用(这是您应该使用的少数情况之一new String(...))。

Another technique, if you expect to have lots of little strings around and these are likely to have the same values, but come from an external source (like a file), is to use .intern()after creating the new string.

另一种技术,如果您希望周围有很多小字符串并且它们可能具有相同的值,但来自外部源(如文件),则.intern()在创建新字符串后使用。

Note: This does depend on the implementation of Stringwhich you really shouldn't have to be aware of, but in practice for large applications sometimes you do have to rely on that knowledge. Be aware that future versions of Java may change this (though not likely).

注意:这确实取决于String您真正不应该知道的实现,但实际上对于大型应用程序,有时您确实必须依赖这些知识。请注意,Java 的未来版本可能会改变这一点(虽然不太可能)。

回答by whiter4bbit

You must review your algorithm for dealing woth large data. You must process chunk-by-chank this data, or use random file access without storing data in memory. For example you can use StringTokenizer or StreamTokenizer as said @Zombies. You can see parser-lexer techniques: when parser parses some expression it asks to lexer to read next lexem(tokens), but doesn't reads whole input stream at once.

您必须检查处理大数据的算法。您必须逐块处理此数据,或使用随机文件访问而不将数据存储在内存中。例如,您可以像@Zombies 一样使用 StringTokenizer 或 StreamTokenizer。您可以看到解析器-词法分析器技术:当解析器解析某个表达式时,它会要求词法分析器读取下一个词法(标记),但不会一次读取整个输入流。