php 解析大型 JSON 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15373529/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 09:07:23  来源:igfitidea点击:

Parse large JSON file

phpmysqljson

提问by Dan Ramos

I'm working on a cron script that hits an API, receives JSON file (a large array of objects) and stores it locally. Once that is complete another script needs to parse the downloaded JSON file and insert each object into a MySQL database.

我正在处理一个命中 API、接收 JSON 文件(大量对象)并将其存储在本地的 cron 脚本。完成后,另一个脚本需要解析下载的 JSON 文件并将每个对象插入 MySQL 数据库。

I'm currently using a file_get_contents()along with json_decode(). This will attempt to read the whole file into memory before trying to process it. This would be fine except for the fact that my JSON files will usually range from 250MB-1GB+. I know I can increase my PHP memory limit but that doesn't seem to be the greatest answer in my mind. I'm aware that I can run fopen()and fgets()to read the file in line by line, but I need to read the file in by each json object.

我目前使用file_get_contents()沿json_decode()。这将尝试在尝试处理之前将整个文件读入内存。除了我的 JSON 文件通常在 250MB-1GB+ 的范围内,这很好。我知道我可以增加我的 PHP 内存限制,但这似乎不是我心目中最好的答案。我知道我可以运行fopen()fgets()逐行读取文件,但我需要通过每个 json 对象读取文件。

Is there a way to read in the file per object, or is there another similar approach?

有没有办法读取每个对象的文件,还是有另一种类似的方法?

采纳答案by Kovo

This really depends on what the json files contain.

这实际上取决于 json 文件包含的内容。

If opening the file one shot into memory is not an option, your only other option, as you eluded to, is fopen/fgets.

如果无法将文件一次性打开到内存中,那么您唯一的选择是 fopen/fgets。

Reading line by line is possible, and if these json objects have a consistent structure, you can easily detect when a json object in a file starts, and ends.

逐行读取是可能的,如果这些 json 对象具有一致的结构,您可以轻松检测文件中的 json 对象何时开始和结束。

Once you collect a whole object, you insert it into a db, then go on to the next one.

收集整个对象后,将其插入到数据库中,然后继续下一个。

There isn't much more to it. the algorithm to detect the beginning and end of a json object may get complicating depending on your data source, but I hvae done something like this before with a far more complex structure (xml) and it worked fine.

没有更多了。根据您的数据源,检测 json 对象开头和结尾的算法可能会变得复杂,但我之前使用更复杂的结构 (xml) 完成了类似的操作,并且效果很好。

回答by Pawel Dubiel

try this lib https://github.com/shevron/ext-jsonreader

试试这个库https://github.com/shevron/ext-jsonreader

The existing ext/json which is shipped with PHP is very convenient and simple to use - but it is inefficient when working with large ammounts of JSON data, as it requires reading the entire JSON data into memory (e.g. using file_get_contents()) and then converting it into a PHP variable at once - for large data sets, this takes up a lot of memory.

JSONReader is designed for memory efficiency - it works on streams and can read JSON data from any PHP stream without loading the entire data into memory. It also allows the developer to extract specific values from a JSON stream without decoding and loading all data into memory.

PHP 附带的现有 ext/json 非常方便且易于使用 - 但在处理大量 JSON 数据时效率低下,因为它需要将整个 JSON 数据读入内存(例如使用 file_get_contents())然后立即将其转换为 PHP 变量 - 对于大型数据集,这会占用大量内存。

JSONReader 旨在提高内存效率 - 它适用于流,可以从任何 PHP 流读取 JSON 数据,而无需将整个数据加载到内存中。它还允许开发人员从 JSON 流中提取特定值,而无需解码所有数据并将其加载到内存中。

回答by Wayne Whitty

Best possible solution:

最佳解决方案:

Use some sort of delimiter (pagination, timestamp, object ID etc) that allows you to read the data in smaller chunks over multiple requests. This solution assumes that you have some sort of control of how these JSON files are generated. I'm basing my assumption on:

使用某种分隔符(分页、时间戳、对象 ID 等),允许您在多个请求中以较小的块读取数据。此解决方案假定您对这些 JSON 文件的生成方式有某种控制。我的假设基于:

This would be fine except for the fact that myJSON files will usually range from 250MB-1GB+.

这很好,除了我的JSON 文件通常在 250MB-1GB+ 的范围内。

Reading in and processing 1GB of JSON data is simply ridiculous. A better approach is most definitely needed.

读入和处理 1GB 的 JSON 数据简直是荒谬的。绝对需要更好的方法。