Python 如何使用 'json' 模块一次读入一个 JSON 对象?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21708192/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I use the 'json' module to read in one JSON object at a time?
提问by Cam
I have a multi-gigabyte JSON file. The file is made up of JSON objects that are no more than a few thousand characters each, but there are no line breaks between the records.
我有一个多 GB 的 JSON 文件。该文件由每个不超过几千个字符的 JSON 对象组成,但记录之间没有换行符。
Using Python 3 and the jsonmodule, how can I read one JSON object at a time from the file into memory?
使用 Python 3 和json模块,如何一次从文件中读取一个 JSON 对象到内存中?
The data is in a plain text file. Here is an example of a similar record. The actual records contains many nested dictionaries and lists.
数据位于纯文本文件中。下面是一个类似记录的例子。实际记录包含许多嵌套的字典和列表。
Record in readable format:
以可读格式记录:
{
"results": {
"__metadata": {
"type": "DataServiceProviderDemo.Address"
},
"Street": "NE 228th",
"City": "Sammamish",
"State": "WA",
"ZipCode": "98074",
"Country": "USA"
}
}
}
Actual format. New records start one after the other without any breaks.
实际格式。新的记录一个接一个地开始,没有任何中断。
{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }
采纳答案by Martijn Pieters
Generally speaking, putting more than one JSON object into a file makes that file invalid, broken JSON. That said, you can still parse data in chunks using the JSONDecoder.raw_decode()method.
一般来说,将多个 JSON 对象放入一个文件会使该文件无效,损坏 JSON。也就是说,您仍然可以使用JSONDecoder.raw_decode()方法分块解析数据。
The following will yield complete objects as the parser finds them:
当解析器找到它们时,以下将产生完整的对象:
from json import JSONDecoder
from functools import partial
def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
buffer = ''
for chunk in iter(partial(fileobj.read, buffersize), ''):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:].lstrip()
except ValueError:
# Not enough data to decode, read more
break
This function will read chunks from the given file object in buffersizechunks, and have the decoderobject parse whole JSON objects from the buffer. Each parsed object is yielded to the caller.
此函数将从块中的给定文件对象中读取块buffersize,并让decoder对象从缓冲区解析整个 JSON 对象。每个解析的对象都被交给调用者。
Use it like this:
像这样使用它:
with open('yourfilename', 'r') as infh:
for data in json_parse(infh):
# process object
Use this only if your JSON objects are written to a file back-to-back, with no newlines in between. If you dohave newlines, and each JSON object is limited to a single line, you have a JSON Lines document, in which case you can use Loading and parsing a JSON file with multiple JSON objects in Pythoninstead.
仅当您的 JSON 对象背靠背写入文件时才使用此选项,中间没有换行符。如果您确实有换行符,并且每个 JSON 对象仅限于一行,则您有一个JSON Lines 文档,在这种情况下,您可以使用在 Python 中加载和解析具有多个 JSON 对象的 JSON 文件。
回答by poke
If your JSON documents contains a list of objects, and you want to read one object one-at-a-time, you can use the iterative JSON parser ijsonfor the job. It will only read more content from the file when it needs to decode the next object.
如果您的 JSON 文档包含一个对象列表,并且您想一次读取一个对象,您可以使用迭代 JSON 解析器ijson来完成这项工作。它只会在需要解码下一个对象时从文件中读取更多内容。
Note that you should use it with the YAJLlibrary, otherwise you will likely not see any performance increase.
请注意,您应该将它与YAJL库一起使用,否则您可能看不到任何性能提升。
That being said, unless your file is really big, reading it completely into memory and then parsing it with the normal JSON module will probably still be the best option.
话虽如此,除非您的文件非常大,否则将其完全读入内存然后使用普通 JSON 模块解析它可能仍然是最佳选择。
回答by unutbu
Here is a slight modification of Martijn Pieters' solution, which will handle JSON strings separated with whitespace.
这是对Martijn Pieters 的解决方案的轻微修改,它将处理用空格分隔的 JSON 字符串。
def json_parse(fileobj, decoder=json.JSONDecoder(), buffersize=2048,
delimiters=None):
remainder = ''
for chunk in iter(functools.partial(fileobj.read, buffersize), ''):
remainder += chunk
while remainder:
try:
stripped = remainder.strip(delimiters)
result, index = decoder.raw_decode(stripped)
yield result
remainder = stripped[index:]
except ValueError:
# Not enough data to decode, read more
break
For example,if data.txtcontains JSON strings separated by a space:
例如,如果data.txt包含以空格分隔的 JSON 字符串:
{"business_id": "1", "Accepts Credit Cards": true, "Price Range": 1, "type": "food"} {"business_id": "2", "Accepts Credit Cards": true, "Price Range": 2, "type": "cloth"} {"business_id": "3", "Accepts Credit Cards": false, "Price Range": 3, "type": "sports"}
then
然后
In [47]: list(json_parse(open('data')))
Out[47]:
[{u'Accepts Credit Cards': True,
u'Price Range': 1,
u'business_id': u'1',
u'type': u'food'},
{u'Accepts Credit Cards': True,
u'Price Range': 2,
u'business_id': u'2',
u'type': u'cloth'},
{u'Accepts Credit Cards': False,
u'Price Range': 3,
u'business_id': u'3',
u'type': u'sports'}]

