javascript 大量对象上的 JSON.parse() 使用的内存比应有的多
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30564728/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
JSON.parse() on a large array of objects is using way more memory than it should
提问by Ahmed Fasih
I generate a ~200'000-element array of objects (using object literal notation inside map
rather than new Constructor()
), and I'm saving a JSON.stringify
'd version of it to disk, where it takes up 31 MB, including newlines and one-space-per-indentation level (JSON.stringify(arr, null, 1)
).
我生成了一个约 200'000 个元素的对象数组(在内部使用对象文字符号map
而不是new Constructor()
),并且我将它的一个JSON.stringify
'd 版本保存到磁盘,它占用 31 MB,包括换行符和一个空格-每个缩进级别 ( JSON.stringify(arr, null, 1)
)。
Then, in a new node process, I read the entire file into a UTF-8 string and pass it to JSON.parse
:
然后,在一个新的节点进程中,我将整个文件读入一个 UTF-8 字符串并将其传递给JSON.parse
:
var fs = require('fs');
var arr1 = JSON.parse(fs.readFileSync('JMdict-all.json', {encoding : 'utf8'}));
Node memory usage is about 1.05 GB according to Mavericks' Activity Monitor! Even typing into a Terminal feels laggier on my ancient 4 GB RAM machine.
根据小牛队的活动监视器,节点内存使用量约为 1.05 GB!在我古老的 4 GB RAM 机器上,即使在终端中输入也感觉更慢。
But if, in a new node process, I load the file's contents into a string, chop it up at element boundaries, and JSON.parse
each element individually, ostensibly getting the same object array:
但是,如果在一个新的节点进程中,我将文件的内容加载到一个字符串中,在元素边界处将其切碎,并且JSON.parse
每个元素单独,表面上获得相同的对象数组:
var fs = require('fs');
var arr2 = fs.readFileSync('JMdict-all.json', {encoding : 'utf8'}).trim().slice(1,-3).split('\n },').map(function(s) {return JSON.parse(s+'}');});
node is using just ~200 MB of memory, and no noticeable system lag. This pattern persists across many restarts of node: JSON.parse
ing the whole array takes a gig of memory while parsing it element-wise is much more memory-efficient.
node 仅使用约 200 MB 的内存,并且没有明显的系统滞后。这种模式在节点的多次重启中持续存在:JSON.parse
整个数组需要大量内存,而按元素解析它的内存效率更高。
Why is there such a huge disparity in memory usage? Is this a problem with JSON.parse
preventing efficient hidden class generation in V8? How can I get good memory performance without slicing-and-dicing strings? Must I use a streaming JSON parse ?
为什么内存使用会有如此巨大的差异?这是JSON.parse
在 V8 中阻止有效隐藏类生成的问题吗?如何在不对字符串进行切片和切块的情况下获得良好的内存性能?我必须使用流式 JSON 解析吗?
For ease of experimentation, I've put the JSON file in question in a Gist, please feel free to clone it.
为了便于实验,我将有问题的 JSON 文件放在Gist 中,请随意克隆它。
采纳答案by Michael Geary
A few points to note:
需要注意的几点:
- You've found that, for whatever reason, it's much more efficient to do individual
JSON.parse()
calls on each element of your array instead of one bigJSON.parse()
. - The data format you're generating is under your control. Unless I misunderstood, the data file as a whole does not have to be valid JSON, as long as you can parse it.
- It sounds like the only issue with your second, more efficient method is the fragility of splitting the original generated JSON.
- 您已经发现,无论出于何种原因,
JSON.parse()
对数组的每个元素进行单独调用比对一个 big进行单独调用要高效得多JSON.parse()
。 - 您生成的数据格式在您的控制之下。除非我误解了,数据文件作为一个整体不必是有效的 JSON,只要你能解析它。
- 听起来您的第二个更有效的方法的唯一问题是拆分原始生成的 JSON 的脆弱性。
This suggests a simple solution: Instead of generating one giant JSON array, generate an individual JSON string for each element of your array - with no newlines in the JSON string, i.e. just use JSON.stringify(item)
with no space
argument. Then join those JSON strings with newline (or any character that you know will never appear in your data) and write that data file.
这提出了一个简单的解决方案:不是生成一个巨大的 JSON 数组,而是为数组的每个元素生成一个单独的 JSON 字符串 - JSON 字符串中没有换行符,即只使用JSON.stringify(item)
不带space
参数。然后将这些 JSON 字符串与换行符(或您知道永远不会出现在您的数据中的任何字符)连接起来并写入该数据文件。
When you read this data, split the incoming data on the newline, then do the JSON.parse()
on each of those lines individually. In other words, this step is just like your second solution, but with a straightforward string split instead of having to fiddle with the character counts and curly braces.
当您读取此数据时,在换行符上拆分传入数据,然后分别JSON.parse()
对每一行执行此操作。换句话说,这一步就像您的第二个解决方案一样,但使用简单的字符串拆分,而不必摆弄字符数和花括号。
Your code might look something like this (really just a simplified version of what you posted):
您的代码可能看起来像这样(实际上只是您发布的内容的简化版本):
var fs = require('fs');
var arr2 = fs.readFileSync(
'JMdict-all.json',
{ encoding: 'utf8' }
).trim().split('\n').map( function( line ) {
return JSON.parse( line );
});
As you noted in an edit, you could simplify this code to:
正如您在编辑中指出的,您可以将此代码简化为:
var fs = require('fs');
var arr2 = fs.readFileSync(
'JMdict-all.json',
{ encoding: 'utf8' }
).trim().split('\n').map( JSON.parse );
But I would be careful about this. It does work in this particular case, but there is a potential danger in the more general case.
但我会小心这一点。它在这种特殊情况下确实有效,但在更一般的情况下存在潜在危险。
The JSON.parse
function takes two arguments: the JSON text and an optional "reviver" function.
该JSON.parse
函数有两个参数:JSON 文本和一个可选的“reviver”函数。
The [].map()
function passes threeargumentsto the function it calls: the item value, array index, and the entire array.
该[].map()
函数将三个参数传递给它调用的函数:项目值、数组索引和整个数组。
So if you pass JSON.parse
directly, it is being called with JSON text as the first argument (as expected), but it is also being passed a numberfor the "reviver" function. JSON.parse()
ignores that second argument because it is not a function reference, so you're OK here. But you can probably imagine other cases where you could get into trouble - so it's always a good idea to triple-check this when you pass an arbitrary function that you didn't write into [].map()
.
因此,如果您JSON.parse
直接传递,则会使用 JSON 文本作为第一个参数调用它(正如预期的那样),但它也会传递一个数字以用于“reviver”函数。JSON.parse()
忽略第二个参数,因为它不是函数引用,所以你在这里没问题。但是您可能可以想象其他可能会遇到麻烦的情况 - 因此,当您传递未写入[].map()
.
回答by debater
I think a comment hinted at the answer to this question, but I'll expand on it a little. The 1 GB of memory being used presumably includes a large number of allocations of data that is actually 'dead' (in that it has become unreachable and is therefore not really being used by the program any more) but has not yet been collected by the Garbage Collector.
我认为评论暗示了这个问题的答案,但我会稍微扩展一下。正在使用的 1 GB 内存大概包括大量实际上“死”的数据分配(因为它变得无法访问,因此不再真正被程序使用)但尚未被程序收集垃圾收集器。
Almost any algorithm processing a large data set is likely to produce a very large amount of detritus in this manner, when the programming language/technology used is a typical modern one (e.g. Java/JVM, c#/.NET, JavaScript). Eventually the GC removes it.
当使用的编程语言/技术是典型的现代语言/技术(例如 Java/JVM、c#/.NET、JavaScript)时,几乎任何处理大型数据集的算法都可能以这种方式产生大量的碎屑。最终 GC 将其删除。
It is interesting to note that techniques can be used to dramatically reduce the amount of ephemeral memory allocation that certain algorithms incur (by having pointers into the middles of strings), but I think these techniques are hard or impossible to employ in JavaScript.
有趣的是,技术可用于显着减少某些算法产生的临时内存分配量(通过将指针指向字符串的中间),但我认为这些技术很难或不可能在 JavaScript 中使用。