Python 如何使用 'json' 模块一次读入一个 JSON 对象?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21708192/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:27:18  来源:igfitidea点击:

How do I use the 'json' module to read in one JSON object at a time?

pythonjson

提问by Cam

I have a multi-gigabyte JSON file. The file is made up of JSON objects that are no more than a few thousand characters each, but there are no line breaks between the records.

我有一个多 GB 的 JSON 文件。该文件由每个不超过几千个字符的 JSON 对象组成,但记录之间没有换行符。

Using Python 3 and the jsonmodule, how can I read one JSON object at a time from the file into memory?

使用 Python 3 和json模块,如何一次从文件中读取一个 JSON 对象到内存中?

The data is in a plain text file. Here is an example of a similar record. The actual records contains many nested dictionaries and lists.

数据位于纯文本文件中。下面是一个类似记录的例子。实际记录包含许多嵌套的字典和列表。

Record in readable format:

以可读格式记录:

{
    "results": {
      "__metadata": {
        "type": "DataServiceProviderDemo.Address"
      },
      "Street": "NE 228th",
      "City": "Sammamish",
      "State": "WA",
      "ZipCode": "98074",
      "Country": "USA"
    }
  }
}

Actual format. New records start one after the other without any breaks.

实际格式。新的记录一个接一个地开始,没有任何中断。

{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }

采纳答案by Martijn Pieters

Generally speaking, putting more than one JSON object into a file makes that file invalid, broken JSON. That said, you can still parse data in chunks using the JSONDecoder.raw_decode()method.

一般来说,将多个 JSON 对象放入一个文件会使该文件无效,损坏 JSON。也就是说,您仍然可以使用JSONDecoder.raw_decode()方法分块解析数据。

The following will yield complete objects as the parser finds them:

当解析器找到它们时,以下将产生完整的对象:

from json import JSONDecoder
from functools import partial


def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
    buffer = ''
    for chunk in iter(partial(fileobj.read, buffersize), ''):
         buffer += chunk
         while buffer:
             try:
                 result, index = decoder.raw_decode(buffer)
                 yield result
                 buffer = buffer[index:].lstrip()
             except ValueError:
                 # Not enough data to decode, read more
                 break

This function will read chunks from the given file object in buffersizechunks, and have the decoderobject parse whole JSON objects from the buffer. Each parsed object is yielded to the caller.

此函数将从块中的给定文件对象中读取块buffersize,并让decoder对象从缓冲区解析整个 JSON 对象。每个解析的对象都被交给调用者。

Use it like this:

像这样使用它:

with open('yourfilename', 'r') as infh:
    for data in json_parse(infh):
        # process object

Use this only if your JSON objects are written to a file back-to-back, with no newlines in between. If you dohave newlines, and each JSON object is limited to a single line, you have a JSON Lines document, in which case you can use Loading and parsing a JSON file with multiple JSON objects in Pythoninstead.

仅当您的 JSON 对象背靠背写入文件时才使用此选项,中间没有换行符。如果您确实有换行符,并且每个 JSON 对象仅限于一行,则您有一个JSON Lines 文档,在这种情况下,您可以使用在 Python 中加载和解析具有多个 JSON 对象的 JSON 文件

回答by poke

If your JSON documents contains a list of objects, and you want to read one object one-at-a-time, you can use the iterative JSON parser ijsonfor the job. It will only read more content from the file when it needs to decode the next object.

如果您的 JSON 文档包含一个对象列表,并且您想一次读取一个对象,您可以使用迭代 JSON 解析器ijson来完成这项工作。它只会在需要解码下一个对象时从文件中读取更多内容。

Note that you should use it with the YAJLlibrary, otherwise you will likely not see any performance increase.

请注意,您应该将它与YAJL库一起使用,否则您可能看不到任何性能提升。

That being said, unless your file is really big, reading it completely into memory and then parsing it with the normal JSON module will probably still be the best option.

话虽如此,除非您的文件非常大,否则将其完全读入内存然后使用普通 JSON 模块解析它可能仍然是最佳选择。

回答by unutbu

Here is a slight modification of Martijn Pieters' solution, which will handle JSON strings separated with whitespace.

这是对Martijn Pieters 的解决方案的轻微修改,它将处理用空格分隔的 JSON 字符串。

def json_parse(fileobj, decoder=json.JSONDecoder(), buffersize=2048, 
               delimiters=None):
    remainder = ''
    for chunk in iter(functools.partial(fileobj.read, buffersize), ''):
        remainder += chunk
        while remainder:
            try:
                stripped = remainder.strip(delimiters)
                result, index = decoder.raw_decode(stripped)
                yield result
                remainder = stripped[index:]
            except ValueError:
                # Not enough data to decode, read more
                break


For example,if data.txtcontains JSON strings separated by a space:

例如,如果data.txt包含以空格分隔的 JSON 字符串:

{"business_id": "1", "Accepts Credit Cards": true, "Price Range": 1, "type": "food"} {"business_id": "2", "Accepts Credit Cards": true, "Price Range": 2, "type": "cloth"} {"business_id": "3", "Accepts Credit Cards": false, "Price Range": 3, "type": "sports"}

then

然后

In [47]: list(json_parse(open('data')))
Out[47]: 
[{u'Accepts Credit Cards': True,
  u'Price Range': 1,
  u'business_id': u'1',
  u'type': u'food'},
 {u'Accepts Credit Cards': True,
  u'Price Range': 2,
  u'business_id': u'2',
  u'type': u'cloth'},
 {u'Accepts Credit Cards': False,
  u'Price Range': 3,
  u'business_id': u'3',
  u'type': u'sports'}]