用 PHP 处理大型 JSON 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4049428/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 11:49:00  来源:igfitidea点击:

Processing large JSON files in PHP

phpjsonlarge-files

提问by The Mighty Rubber Duck

I am trying to process somewhat large (possibly up to 200M) JSON files. The structure of the file is basically an array of objects.

我正在尝试处理有点大(可能高达 200M)的 JSON 文件。文件的结构基本上是一个对象数组。

So something along the lines of:

所以类似的东西:

[
  {"property":"value", "property2":"value2"},
  {"prop":"val"},
  ...
  {"foo":"bar"}
]

Each object has arbitrary properties and does not necessary share them with other objects in the array (as in, having the same).

每个对象都具有任意属性,并且不必与数组中的其他对象共享它们(例如,具有相同的属性)。

I want to apply a processing on each object in the array and as the file is potentially huge, I cannot slurp the whole file content in memory, decoding the JSON and iterating over the PHP array.

我想对数组中的每个对象进行处理,由于文件可能很大,我无法在内存中提取整个文件内容,解码 JSON 并迭代 PHP 数组。

So ideally I would like to read the file, fetch enough info for each object and process it. A SAX-type approach would be OK if there was a similar library available for JSON.

所以理想情况下,我想读取文件,为每个对象获取足够的信息并处理它。如果有一个类似的可用于 JSON 的库,那么 SAX 类型的方法就可以了。

Any suggestion on how to deal with this problem best?

关于如何最好地处理这个问题的任何建议?

采纳答案by The Mighty Rubber Duck

I decided on working on an event based parser. It's not quite done yet and will edit the question with a link to my work when I roll out a satisfying version.

我决定使用基于事件的解析器。它还没有完全完成,当我推出一个令人满意的版本时,我将使用指向我的作品的链接来编辑问题。

EDIT:

编辑:

I finally worked out a version of the parser that I am satisfied with. It's available on GitHub:

我终于制定了一个我满意的解析器版本。它可以在 GitHub 上找到:

https://github.com/kuma-giyomu/JSONParser

https://github.com/kuma-giyomu/JSONParser

There's probably room for some improvement and am welcoming feedback.

可能有一些改进的空间,并欢迎反馈。

回答by user3942918

I've written a streaming JSON pull parser pcrov/JsonReaderfor PHP 7 with an api based on XMLReader.

我已经为 PHP 7编写了一个流式 JSON 拉式解析器pcrov/JsonReader,带有基于XMLReader的 api 。

It differs significantly from event-based parsers in that instead of setting up callbacks and letting the parser do its thing, you call methods on the parser to move along or retrieve data as desired. Found your desired bits and want to stop parsing? Then stop parsing (and call close()because it's the nice thing to do.)

它与基于事件的解析器的显着不同之处在于,不是设置回调并让解析器执行其操作,而是调用解析器上的方法以根据需要移动或检索数据。找到您想要的位并想停止解析?然后停止解析(并打电话,close()因为这是一件好事。)

(For a slightly longer overview of pull vs event-based parsers see XML reader models: SAX versus XML pull parser.)

(有关拉式解析器与基于事件的解析器的略长概述,请参阅XML 阅读器模型:SAX 与 XML 拉式解析器。)



Example 1:

示例 1:

Read each object as a whole from your JSON.

从 JSON 中整体读取每个对象。

use pcrov\JsonReader\JsonReader;

$reader = new JsonReader();
$reader->open("data.json");

$reader->read(); // Outer array.
$depth = $reader->depth(); // Check in a moment to break when the array is done.
$reader->read(); // Step to the first object.
do {
    print_r($reader->value()); // Do your thing.
} while ($reader->next() && $reader->depth() > $depth); // Read each sibling.

$reader->close();

Output:

输出:

Array
(
    [property] => value
    [property2] => value2
)
Array
(
    [prop] => val
)
Array
(
    [foo] => bar
)

Objects get returned as stringly-keyed arrays due (in part) to edge cases where valid JSON would produce property names that are not allowed in PHP objects. Working around these conflicts isn't worthwhile as an anemic stdClass object brings no value over a simple array anyway.

对象作为字符串键控数组返回,原因(部分)是由于边缘情况,在这种情况下,有效的 JSON 会产生 PHP 对象中不允许的属性名称。解决这些冲突是不值得的,因为一个贫乏的 stdClass 对象无论如何都不会给简单数组带来任何价值。



Example 2:

示例 2:

Read each named element individually.

单独读取每个命名元素。

$reader = new pcrov\JsonReader\JsonReader();
$reader->open("data.json");

while ($reader->read()) {
    $name = $reader->name();
    if ($name !== null) {
        echo "$name: {$reader->value()}\n";
    }
}

$reader->close();

Output:

输出:

property: value
property2: value2
prop: val
foo: bar


Example 3:

示例 3:

Read each property of a given name. Bonus: read from a string instead of a URI, plus get data from properties with duplicate names in the same object (which is allowed in JSON, how fun.)

读取给定名称的每个属性。奖励:从字符串而不是 URI 中读取,以及从同一对象中具有重复名称的属性中获取数据(这在 JSON 中是允许的,多有趣。)

$json = <<<'JSON'
[
    {"property":"value", "property2":"value2"},
    {"foo":"foo", "foo":"bar"},
    {"prop":"val"},
    {"foo":"baz"},
    {"foo":"quux"}
]
JSON;

$reader = new pcrov\JsonReader\JsonReader();
$reader->json($json);

while ($reader->read("foo")) {
    echo "{$reader->name()}: {$reader->value()}\n";
}

$reader->close();

Output:

输出:

foo: foo
foo: bar
foo: baz
foo: quux


How exactly to best read through your JSON depends on its structure and what you want to do with it. These examples should give you a place to start.

如何最好地阅读您的 JSON 取决于它的结构以及您想用它做什么。这些示例应该为您提供了一个起点。

回答by Aaron Averill

This is a simple, streaming parser for processing large JSON documents. Use it for parsing very large JSON documents to avoid loading the entire thing into memory, which is how just about every other JSON parser for PHP works.

这是一个简单的流式解析器,用于处理大型 JSON 文档。使用它来解析非常大的 JSON 文档以避免将整个内容加载到内存中,这就是几乎所有其他 PHP JSON 解析器的工作方式。

https://github.com/salsify/jsonstreamingparser

https://github.com/salsify/jsonstreamingparser

回答by Filip Halaxa

Recently I made a library called JSON Machine, which efficiently parses unpredictably big JSON files. Usage is via simple foreach. I use it myself for my project.

最近我做了一个名为 JSON Machine 的库,它可以有效地解析不可预测的大 JSON 文件。用法是通过简单的foreach。我自己在我的项目中使用它。

Example:

例子:

foreach (JsonMachine::fromFile('employees.json') as $employee) {
    $employee['name']; // etc
}

See https://github.com/halaxa/json-machine

https://github.com/halaxa/json-machine

回答by joni

There exists something like this, but only for C++and Java. Unless you can access one of these libraries from PHP, there's no implementation for this in PHP but json_read()as far as I know. However, if the json is structured that simple, it's easy to just read the file until the next }and then process the JSON received via json_read(). But you should better do that buffered, like reading 10kb, split by }, if not found, read another 10k, and else process the found values. Then read the next block and so on..

存在这样的东西,但仅适用于C++Java。除非您可以从 PHP 访问这些库之一,否则在 PHP 中没有实现,但json_read()据我所知。但是,如果 json 的结构如此简单,则很容易只读取文件直到下一个},然后处理通过json_read(). 但是你最好做缓冲,比如读取 10kb,用 } 分割,如果没有找到,再读取 10k,然后处理找到的值。然后读取下一个块,依此类推。

回答by Nigel Ren

I know that the JSON streaming parser https://github.com/salsify/jsonstreamingparserhas already been mentioned. But as I have recently(ish) added a new listener to it to try and make it easier to use out of the box I thought I would (for a change) put some information out about what it does...

我知道已经提到了 JSON 流解析器https://github.com/salsify/jsonstreamingparser。但是正如我最近(ish)向它添加了一个新的侦听器以尝试使其更易于开箱即用,我想我会(进行更改)提供一些有关其功能的信息......

There is a very good write up about the basic parser at https://www.salsify.com/blog/engineering/json-streaming-parser-for-php, but the issue I have with the standard setup was that you always had to write a listener to process a file. This is not always a simple task and can also take a certain amount of maintenance if/when the JSON changed. So I wrote the RegexListener.

https://www.salsify.com/blog/engineering/json-streaming-parser-for-php 上有一篇关于基本解析器的非常好的文章,但我对标准设置的问题是你总是有编写一个监听器来处理一个文件。这并不总是一项简单的任务,如果/当 JSON 更改时,也可能需要一定量的维护。所以我写了RegexListener.

The basic principle is to allow you to say what elements you are interested in (via a regex expression) and give it a callback to say what to do when it finds the data. Whilst reading the JSON, it keeps track of the path to each component - similar to a directory structure. So /name/forenameor for arrays /items/item/2/partid- this is what the regex matches against.

基本原则是允许你说出你感兴趣的元素(通过正则表达式),并给它一个回调,告诉它找到数据时要做什么。在读取 JSON 时,它会跟踪每个组件的路径 - 类似于目录结构。所以/name/forename或对于数组/items/item/2/partid- 这就是正则表达式匹配的内容。

An example is (from the source on github)...

一个例子是(来自github 上源代码)...

$filename = __DIR__.'/../tests/data/example.json';
$listener = new RegexListener([
    '/1/name' => function ($data): void {
        echo PHP_EOL."Extract the second 'name' element...".PHP_EOL;
        echo '/1/name='.print_r($data, true).PHP_EOL;
    },
    '(/\d*)' => function ($data, $path): void {
        echo PHP_EOL."Extract each base element and print 'name'...".PHP_EOL;
        echo $path.'='.$data['name'].PHP_EOL;
    },
    '(/.*/nested array)' => function ($data, $path): void {
        echo PHP_EOL."Extract 'nested array' element...".PHP_EOL;
        echo $path.'='.print_r($data, true).PHP_EOL;
    },
]);
$parser = new Parser(fopen($filename, 'r'), $listener);
$parser->parse();

Just a couple of explanations...

只是一些解释......

'/1/name' => function ($data)

So the /1is the the second element in an array (0 based), so this allows accessing particular instances of elements. /nameis the nameelement. The value is then passed to the closure as $data

因此/1是数组中的第二个元素(基于 0),因此这允许访问元素的特定实例。/namename元素。然后将该值传递给闭包作为$data

"(/\d*)" => function ($data, $path )

This will select each element of an array and pass it one at a time, as it's using a capture group, this information will be passed as $path. This means when a set of records is present in a file, you can process each item one at a time. And also know which element without having to keep track.

这将选择数组的每个元素并一次传递一个,因为它使用捕获组,此信息将作为$path. 这意味着当文件中存在一组记录时,您可以一次处理每个项目。并且无需跟踪即可知道哪个元素。

The last one

最后一个

'(/.*/nested array)' => function ($data, $path):

effectively scans for any elements called nested arrayand passes each one along with where it is in the document.

有效地扫描任何被调用的元素nested array并将每个元素连同它在文档中的位置一起传递。

Another useful feature I found was that if in a large JSON file, you just wanted the summary details at the top, you can grab those bits and then just stop...

我发现的另一个有用的功能是,如果在一个大的 JSON 文件中,你只想要顶部的摘要详细信息,你可以抓住那些位然后停下来......

$filename = __DIR__.'/../tests/data/ratherBig.json';
$listener = new RegexListener();
$parser = new Parser(fopen($filename, 'rb'), $listener);
$listener->setMatch(["/total_rows" => function ($data ) use ($parser) {
    echo "/total_rows=".$data.PHP_EOL;
    $parser->stop();
}]);

This saves time when you are not interested in the remaining content.

当您对其余内容不感兴趣时​​,这可以节省时间。

One thing to note is that these will react to the content, so that each one is triggered when the end of the matching content is found and may be in various orders. But also that the parser only keeps track of the content you are interested in and discards anything else.

需要注意的一件事是,这些都会对内容做出反应,因此当找到匹配内容的结尾时会触发每一个,并且可能以不同的顺序。而且解析器只跟踪您感兴趣的内容并丢弃其他任何内容。

If you find any interesting features (sometimes horribly know as bugs), please let me know or report an issue on the github page.

如果您发现任何有趣的功能(有时被称为错误),请告诉我或在 github 页面上报告问题。

回答by Alex Jasmin

There is http://github.com/sfalvo/php-yajl/I didn't use it myself.

http://github.com/sfalvo/php-yajl/我自己没用过。