Python 将多个 JSON 记录读入 Pandas 数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20037430/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading multiple JSON records into a Pandas dataframe
提问by seanv507
I'd like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with working solution, I need it for potentially very large number of records. Example use would be to process output from Hadoop Pig JSonStorage function.
我想知道是否有一种内存高效的方式将多记录 JSON 文件(每一行都是一个 JSON 字典)读入 Pandas 数据帧。下面是一个带有工作解决方案的 2 行示例,我需要它来处理可能非常大量的记录。示例用途是处理来自 Hadoop Pig JsonStorage 函数的输出。
import json
import pandas as pd
test='''{"a":1,"b":2}
{"a":3,"b":4}'''
#df=pd.read_json(test,orient='records') doesn't work, expects []
l=[ json.loads(l) for l in test.splitlines()]
df=pd.DataFrame(l)
采纳答案by Andy Hayden
Note: Line separated json is now supported in read_json(since 0.19.0):
注意:现在支持行分隔的 json read_json(自 0.19.0 起):
In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
a b
0 1 2
1 3 4
or with a file/filepath rather than a json string:
或使用文件/文件路径而不是 json 字符串:
pd.read_json(json_file, lines=True)
It's going to depend on the size of you DataFrames which is faster, but another option is to use str.jointo smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:
这将取决于更快的 DataFrame 的大小,但另一种选择是使用str.join将多行“JSON”(注意:它不是有效的 json)粉碎为有效的 json 并使用 read_json:
In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...
对于这个小例子,这会更慢,如果在 100 左右,如果它更大,那么它是类似的,显着的收益......
In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 μs per loop
In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 μs per loop
In [23]: test_100 = '\n'.join([test] * 100)
In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop
In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop
In [26]: test_1000 = '\n'.join([test] * 1000)
In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop
In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop
Note: of that time the join is surprisingly fast.
注意:那个时候的连接速度出奇的快。
回答by Doctor J
If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:
如果您试图节省内存,那么一次读取一行文件的内存效率会更高:
with open('test.json') as f:
data = pd.DataFrame(json.loads(line) for line in f)
Also, if you import simplejson as json, the compiled C extensions included with simplejsonare much faster than the pure-Python jsonmodule.
此外,如果您import simplejson as json,包含的已编译 C 扩展simplejson比纯 Pythonjson模块快得多。
回答by Bob Baxley
++++++++Update++++++++++++++
++++++++更新++++++++++++++++
As of v0.19, Pandas supports this natively (see https://github.com/pandas-dev/pandas/pull/13351). Just run:
从 v0.19 开始,Pandas 本身就支持这一点(参见https://github.com/pandas-dev/pandas/pull/13351)。赶紧跑:
df=pd.read_json('test.json', lines=True)
++++++++Old Answer++++++++++
++++++++旧答案++++++++++
The existing answers are good, but for a little variety, here is another way to accomplish your goal that requires a simple pre-processing step outside of python so that pd.read_json()can consume the data.
现有的答案很好,但对于一些变化,这是实现目标的另一种方法,它需要在 python 之外进行简单的预处理步骤,以便pd.read_json()可以使用数据。
- Install jq https://stedolan.github.io/jq/.
- Create a valid json file with
cat test.json | jq -c --slurp . > valid_test.json - Create dataframe with
df=pd.read_json('valid_test.json')
- 安装 jq https://stedolan.github.io/jq/。
- 创建一个有效的 json 文件
cat test.json | jq -c --slurp . > valid_test.json - 创建数据框
df=pd.read_json('valid_test.json')
In ipython notebook, you can run the shell command directly from the cell interface with
在 ipython notebook 中,你可以直接从 cell 界面运行 shell 命令
!cat test.json | jq -c --slurp . > valid_test.json
df=pd.read_json('valid_test.json')
回答by Doctor J
As of Pandas 0.19, read_jsonhas native support for line-delimited JSON:
从 Pandas 0.19 开始,read_json原生支持以行分隔的 JSON:
pd.read_json(jsonfile, lines=True)

