Javascript 在 Nodejs 中解析大型 JSON 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11874096/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 07:29:37  来源:igfitidea点击:

Parse large JSON file in Nodejs

javascriptjsonfilenode.js

提问by dgh

I have a file which stores many JavaScript objects in JSON form and I need to read the file, create each of the objects, and do something with them (insert them into a db in my case). The JavaScript objects can be represented a format:

我有一个文件,它以 JSON 形式存储了许多 JavaScript 对象,我需要读取该文件,创建每个对象,并用它们做一些事情(在我的情况下将它们插入到 db 中)。JavaScript 对象可以表示为一种格式:

Format A:

格式A:

[{name: 'thing1'},
....
{name: 'thing999999999'}]

or Format B:

格式 B:

{name: 'thing1'}         // <== My choice.
...
{name: 'thing999999999'}

Note that the ...indicates a lot of JSON objects. I am aware I could read the entire file into memory and then use JSON.parse()like this:

请注意,...表示很多 JSON 对象。我知道我可以将整个文件读入内存,然后JSON.parse()像这样使用:

fs.readFile(filePath, 'utf-8', function (err, fileContents) {
  if (err) throw err;
  console.log(JSON.parse(fileContents));
});

However, the file could be really large, I would prefer to use a stream to accomplish this. The problem I see with a stream is that the file contents could be broken into data chunks at any point, so how can I use JSON.parse()on such objects?

但是,文件可能非常大,我更愿意使用流来完成此操作。我在流中看到的问题是文件内容可能随时被分解为数据块,那么我如何JSON.parse()在这些对象上使用?

Ideally, each object would be read as a separate data chunk, but I am not sure on how to do that.

理想情况下,每个对象都将作为一个单独的数据块读取,但我不确定如何做到这一点

var importStream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
importStream.on('data', function(chunk) {

    var pleaseBeAJSObject = JSON.parse(chunk);           
    // insert pleaseBeAJSObject in a database
});
importStream.on('end', function(item) {
   console.log("Woot, imported objects into the database!");
});*/

Note, I wish to prevent reading the entire file into memory. Time efficiency does not matter to me. Yes, I could try to read a number of objects at once and insert them all at once, but that's a performance tweak - I need a way that is guaranteed not to cause a memory overload, not matter how many objects are contained in the file.

请注意,我希望防止将整个文件读入内存。时间效率对我来说并不重要。是的,我可以尝试一次读取多个对象并一次插入它们,但这是一个性能调整 - 我需要一种保证不会导致内存过载的方法,无论文件中包含多少个对象.

I can choose to use FormatAor FormatBor maybe something else, just please specify in your answer. Thanks!

我可以选择使用FormatAFormatB其他东西,请在您的答案中指定。谢谢!

采纳答案by josh3736

To process a file line-by-line, you simply need to decouple the reading of the file and the code that acts upon that input. You can accomplish this by buffering your input until you hit a newline. Assuming we have one JSON object per line (basically, format B):

要逐行处理文件,您只需要将文件的读取与作用于该输入的代码分离。您可以通过缓冲输入直到遇到换行符来完成此操作。假设我们每行有一个 JSON 对象(基本上是格式 B):

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var buf = '';

stream.on('data', function(d) {
    buf += d.toString(); // when data is read, stash it in a string buffer
    pump(); // then process the buffer
});

function pump() {
    var pos;

    while ((pos = buf.indexOf('\n')) >= 0) { // keep going while there's a newline somewhere in the buffer
        if (pos == 0) { // if there's more than one newline in a row, the buffer will now start with a newline
            buf = buf.slice(1); // discard it
            continue; // so that the next iteration will start with data
        }
        processLine(buf.slice(0,pos)); // hand off the line
        buf = buf.slice(pos+1); // and slice the processed data off the buffer
    }
}

function processLine(line) { // here's where we do something with a line

    if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard CR (0x0D)

    if (line.length > 0) { // ignore empty lines
        var obj = JSON.parse(line); // parse the JSON
        console.log(obj); // do something with the data here!
    }
}

Each time the file stream receives data from the file system, it's stashed in a buffer, and then pumpis called.

每次文件流从文件系统接收数据时,它都会存储在缓冲区中,然后pump被调用。

If there's no newline in the buffer, pumpsimply returns without doing anything. More data (and potentially a newline) will be added to the buffer the next time the stream gets data, and then we'll have a complete object.

如果缓冲区中没有换行符,则pump直接返回而不做任何事情。下一次流获取数据时,更多数据(可能还有换行符)将添加到缓冲区中,然后我们将拥有一个完整的对象。

If there is a newline, pumpslices off the buffer from the beginning to the newline and hands it off to process. It then checks again if there's another newline in the buffer (the whileloop). In this way, we can process all of the lines that were read in the current chunk.

如果有换行符,pump则将缓冲区从开头切掉到换行符并将其交给process. 然后它再次检查缓冲区(while循环)中是否还有另一个换行符。通过这种方式,我们可以处理当前块中读取的所有行。

Finally, processis called once per input line. If present, it strips off the carriage return character (to avoid issues with line endings – LF vs CRLF), and then calls JSON.parseone the line. At this point, you can do whatever you need to with your object.

最后,process每个输入行调用一次。如果存在,它会去掉回车符(以避免出现行尾问题——LF 与 CRLF),然后调用JSON.parse一行。此时,您可以对对象执行任何您需要的操作。

Note that JSON.parseis strict about what it accepts as input; you must quote your identifiers and string values with double quotes. In other words, {name:'thing1'}will throw an error; you must use {"name":"thing1"}.

请注意,JSON.parse它对接受的输入内容很严格;你必须用双引号引用你的标识符和字符串值。换句话说,{name:'thing1'}会抛出错误;你必须使用{"name":"thing1"}.

Because no more than a chunk of data will ever be in memory at a time, this will be extremely memory efficient. It will also be extremely fast. A quick test showed I processed 10,000 rows in under 15ms.

因为一次在内存中不会超过一大块数据,所以这将是非常高效的内存。它也会非常快。快速测试显示我在 15 毫秒内处理了 10,000 行。

回答by josh3736

Just as I was thinking that it would be fun to write a streaming JSON parser, I also thought that maybe I should do a quick search to see if there's one already available.

就像我认为编写流式 JSON 解析器会很有趣一样,我也想也许我应该快速搜索一下,看看是否已经有一个可用的。

Turns out there is.

原来有。

Since I just found it, I've obviously not used it, so I can't comment on its quality, but I'll be interested to hear if it works.

因为我刚刚找到它,我显然没有使用过它,所以我不能评论它的质量,但我很想知道它是否有效。

It does work consider the following Javascript and _.isString:

考虑以下 Javascript 和_.isString

stream.pipe(JSONStream.parse('*'))
  .on('data', (d) => {
    console.log(typeof d);
    console.log("isString: " + _.isString(d))
  });

This will log objects as they come in if the stream is an array of objects. Therefore the only thing being buffered is one object at a time.

如果流是一个对象数组,这将在对象进入时记录它们。因此,唯一被缓冲的是一次一个对象。

回答by arcseldon

As of October 2014, you can just do something like the following (using JSONStream) - https://www.npmjs.org/package/JSONStream

截至 2014 年 10 月,您可以执行以下操作(使用 JSONStream) - https://www.npmjs.org/package/JSONStream

var fs = require('fs'),
    JSONStream = require('JSONStream'),

var getStream() = function () {
    var jsonData = 'myData.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
}

getStream().pipe(MyTransformToDoWhateverProcessingAsNeeded).on('error', function (err) {
    // handle any errors
});

To demonstrate with a working example:

用一个工作示例来演示:

npm install JSONStream event-stream

data.json:

数据.json:

{
  "greeting": "hello world"
}

hello.js:

你好.js:

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    es = require('event-stream');

var getStream = function () {
    var jsonData = 'data.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
};

getStream()
    .pipe(es.mapSync(function (data) {
        console.log(data);
    }));
$ node hello.js
// hello world

回答by karthick N

I had similar requirement, i need to read a large json file in node js and process data in chunks and call a api and save in mongodb. inputFile.json is like:

我有类似的需求,我需要在 node js 中读取一个大的 json 文件并分块处理数据并调用 api 并保存在 mongodb 中。inputFile.json 是这样的:

{
 "customers":[
       { /*customer data*/},
       { /*customer data*/},
       { /*customer data*/}....
      ]
}

Now i used JsonStream and EventStream to achieve this synchronously.

现在我使用 JsonStream 和 EventStream 来同步实现这一点。

var JSONStream = require("JSONStream");
var es = require("event-stream");

fileStream = fs.createReadStream(filePath, { encoding: "utf8" });
fileStream.pipe(JSONStream.parse("customers.*")).pipe(
  es.through(function(data) {
    console.log("printing one customer object read from file ::");
    console.log(data);
    this.pause();
    processOneCustomer(data, this);
    return data;
  }),
  function end() {
    console.log("stream reading ended");
    this.emit("end");
  }
);

function processOneCustomer(data, es) {
  DataModel.save(function(err, dataModel) {
    es.resume();
  });
}

回答by Evan Siroky

I realize that you want to avoid reading the whole JSON file into memory if possible, however if you have the memory available it may not be a bad idea performance-wise. Using node.js's require() on a json file loads the data into memory really fast.

我意识到如果可能的话,您希望避免将整个 JSON 文件读入内存,但是如果您有可用的内存,那么在性能方面可能不是一个坏主意。在 json 文件上使用 node.js 的 require() 可以非常快速地将数据加载到内存中。

I ran two tests to see what the performance looked like on printing out an attribute from each feature from a 81MB geojson file.

我进行了两次测试,以查看从 81MB geojson 文件中打印出每个要素的属性时的性能。

In the 1st test, I read the entire geojson file into memory using var data = require('./geo.json'). That took 3330 milliseconds and then printing out an attribute from each feature took 804 milliseconds for a grand total of 4134 milliseconds. However, it appeared that node.js was using 411MB of memory.

在第一个测试中,我使用var data = require('./geo.json'). 这需要 3330 毫秒,然后打印出每个特征的属性需要 804 毫秒,总计 4134 毫秒。然而,node.js 似乎使用了 411MB 的内存。

In the second test, I used @arcseldon's answer with JSONStream + event-stream. I modified the JSONPath query to select only what I needed. This time the memory never went higher than 82MB, however, the whole thing now took 70 seconds to complete!

在第二个测试中,我将@arcseldon 的答案与 JSONStream + 事件流一起使用。我修改了 JSONPath 查询以仅选择我需要的内容。这一次内存从未超过 82MB,然而,整个过程现在需要 70 秒才能完成!

回答by Phil Booth

I wrote a module that can do this, called BFJ. Specifically, the method bfj.matchcan be used to break up a large stream into discrete chunks of JSON:

我写了一个可以做到这一点的模块,称为BFJ。具体来说,该方法bfj.match可用于将大流分解为离散的 JSON 块:

const bfj = require('bfj');
const fs = require('fs');

const stream = fs.createReadStream(filePath);

bfj.match(stream, (key, value, depth) => depth === 0, { ndjson: true })
  .on('data', object => {
    // do whatever you need to do with object
  })
  .on('dataError', error => {
    // a syntax error was found in the JSON
  })
  .on('error', error => {
    // some kind of operational error occurred
  })
  .on('end', error => {
    // finished processing the stream
  });

Here, bfj.matchreturns a readable, object-mode stream that will receive the parsed data items, and is passed 3 arguments:

在这里,bfj.match返回一个可读的对象模式流,它将接收解析的数据项,并传递 3 个参数:

  1. A readable stream containing the input JSON.

  2. A predicate that indicates which items from the parsed JSON will be pushed to the result stream.

  3. An options object indicating that the input is newline-delimited JSON (this is to process format B from the question, it's not required for format A).

  1. 包含输入 JSON 的可读流。

  2. 一个谓词,指示解析的 JSON 中的哪些项目将被推送到结果流。

  3. 一个选项对象,指示输入是以换行符分隔的 JSON(这是为了处理问题中的格式 B,格式 A 不需要)。

Upon being called, bfj.matchwill parse JSON from the input stream depth-first, calling the predicate with each value to determine whether or not to push that item to the result stream. The predicate is passed three arguments:

被调用时,bfj.match将从输入流中深度优先解析 JSON,使用每个值调用谓词以确定是否将该项目推送到结果流。谓词传递了三个参数:

  1. The property key or array index (this will be undefinedfor top-level items).

  2. The value itself.

  3. The depth of the item in the JSON structure (zero for top-level items).

  1. 属性键或数组索引(这将undefined用于顶级项目)。

  2. 值本身。

  3. JSON 结构中项目的深度(顶级项目为零)。

Of course a more complex predicate can also be used as necessary according to requirements. You can also pass a string or a regular expression instead of a predicate function, if you want to perform simple matches against property keys.

当然也可以根据需要使用更复杂的谓词。如果要对属性键执行简单匹配,还可以传递字符串或正则表达式而不是谓词函数。

回答by Steve Hanov

If you have control over the input file, and it's an array of objects, you can solve this more easily. Arrange to output the file with each record on one line, like this:

如果您可以控制输入文件,并且它是一个对象数组,则可以更轻松地解决此问题。安排在一行上输出每个记录的文件,如下所示:

[
   {"key": value},
   {"key": value},
   ...

This is still valid JSON.

这仍然是有效的 JSON。

Then, use the node.js readline module to process them one line at a time.

然后,使用 node.js readline 模块一次处理一行。

var fs = require("fs");

var lineReader = require('readline').createInterface({
    input: fs.createReadStream("input.txt")
});

lineReader.on('line', function (line) {
    line = line.trim();

    if (line.charAt(line.length-1) === ',') {
        line = line.substr(0, line.length-1);
    }

    if (line.charAt(0) === '{') {
        processRecord(JSON.parse(line));
    }
});

function processRecord(record) {
    // Process the records one at a time here! 
}

回答by Brian Leathem

I solved this problem using the split npm module. Pipe your stream into split, and it will "Break up a stream and reassemble it so that each line is a chunk".

我使用split npm 模块解决了这个问题。通过管道将您的流拆分为拆分,它将“分解流并重新组装它,以便每一行都是一个块”。

Sample code:

示例代码:

var fs = require('fs')
  , split = require('split')
  ;

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var lineStream = stream.pipe(split());
linestream.on('data', function(chunk) {
    var json = JSON.parse(chunk);           
    // ...
});

回答by Vadim Baryshev

I think you need to use a database. MongoDB is a good choice in this case because it is JSON compatible.

我认为您需要使用数据库。在这种情况下,MongoDB 是一个不错的选择,因为它与 JSON 兼容。

UPDATE: You can use mongoimporttool to import JSON data into MongoDB.

更新:您可以使用mongoimport工具将 JSON 数据导入 MongoDB。

mongoimport --collection collection --file collection.json