Python 如何从大文件中读取以行分隔的 JSON（逐行）

Question

提问by Cat

I'm trying to load a large file (2GB in size) filled with JSON strings, delimited by newlines. Ex:

我正在尝试加载一个充满 JSON 字符串的大文件（大小为 2GB），以换行符分隔。前任：

{
    "key11": value11,
    "key12": value12,
}
{
    "key21": value21,
    "key22": value22,
}
…

The way I'm importing it now is:

我现在导入它的方式是：

content = open(file_path, "r").read() 
j_content = json.loads("[" + content.replace("}\n{", "},\n{") + "]")

Which seems like a hack (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list).

这似乎是一个黑客（在每个 JSON 字符串之间添加逗号以及开始和结束方括号以使其成为正确的列表）。

Is there a better way to specify the JSON delimiter (newline \ninstead of comma ,)?

有没有更好的方法来指定 JSON 分隔符（换行符\n而不是逗号,）？

Also, Pythoncan't seem to properly allocate memory for an object built from 2GB of data, is there a way to construct each JSONobject as I'm reading the file line by line? Thanks!

此外，Python似乎无法为由 2GB 数据构建的对象正确分配内存，有没有办法在JSON我逐行读取文件时构造每个对象？谢谢！

Answer 1

回答by njzk2

Just read each line and construct a json object at this time:

这个时候只要读取每一行，构造一个json对象即可：

with open(file_path) as f:
    for line in f:
        j_content = json.loads(line)

This way, you load proper complete json object (provided there is no \nin a json value somewhere or in the middle of your json object) and you avoid memory issue as each object is created when needed.

这样，您可以加载正确的完整 json 对象（前提是 json 对象的\n某处或中间没有json 值），并且可以避免内存问题，因为每个对象都是在需要时创建的。

There is also this answer.:

还有这个答案。：

https://stackoverflow.com/a/7795029/671543

Answer 2

回答by Dane White

This will work for the specific file format that you gave. If your format changes, then you'll need to change the way the lines are parsed.

这将适用于您提供的特定文件格式。如果您的格式发生变化，那么您需要更改解析行的方式。

{
    "key11": 11,
    "key12": 12
}
{
    "key21": 21,
    "key22": 22
}

Just read line-by-line, and build the JSON blocks as you go:

只需逐行阅读，然后构建 JSON 块：

with open(args.infile, 'r') as infile:

    # Variable for building our JSON block
    json_block = []

    for line in infile:

        # Add the line to our JSON block
        json_block.append(line)

        # Check whether we closed our JSON block
        if line.startswith('}'):

            # Do something with the JSON dictionary
            json_dict = json.loads(''.join(json_block))
            print(json_dict)

            # Start a new block
            json_block = []

If you are interested in parsing one very large JSON file without saving everything to memory, you should look at using the object_hook or object_pairs_hook callback methods in the json.load API.

如果您有兴趣解析一个非常大的 JSON 文件而不将所有内容保存到内存中，您应该考虑使用 json.load API 中的 object_hook 或 object_pairs_hook 回调方法。

Answer 3

回答by Tjorriemorrie

contents = open(file_path, "r").read() 
data = [json.loads(str(item)) for item in contents.strip().split('\n')]

Answer 4

回答by Jim

Just read it line by line and parse e through a stream while ur hacking trick (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list) isn't memory-friendly if the file is too more than 1GB as the whole content will land on the RAM.

只需逐行读取并通过流解析 e，而您的黑客技巧（在每个 JSON 字符串之间添加逗号以及开始和结束方括号以使其成为正确的列表）如果文件太多，则对内存不友好超过 1GB，因为整个内容将落在 RAM 上。

Answer 5

回答by Pranav Kasetti

The line by line reading approach is good, as mentioned in some of the above answers.

正如上面的一些答案中提到的那样，逐行阅读方法很好。

However across multiple JSONtree structures I would recommend decomposition into 2 functions to have more robust error handling.

但是，在多个JSON树结构中，我建议将其分解为 2 个函数以进行更强大的错误处理。

For example,

例如，

def load_cases(file_name):
    with open(file_name) as file:
        cases = (parse_case_line(json.loads(line)) for line in file)
        cases = filter(None, cases)
        return list(cases)

parse_case_linecan encapsulate the key parsing logic required in your above example, for example with regex matching, or application-specific requirements. It also means that you can select which json key-values you want to parse out.

parse_case_line可以封装上面示例中所需的关键解析逻辑，例如使用正则表达式匹配或特定于应用程序的要求。这也意味着您可以选择要解析出的 json 键值。

Another advantage of this approach is filterhandles multiple \nin the middle of your json object, and parses the whole file :-).

这种方法的另一个优点是在 json 对象中间filter处理多个\n，并解析整个文件:-)。

Answer 6

回答by Cohen

Had to read some data from AWS S3 and parse a newline delimited jsonl file. My solution was this using splitlines

必须从 AWS S3 读取一些数据并解析换行符分隔的 jsonl 文件。我的解决方案是使用splitlines

The code:

编码：

for line in json_input.splitlines():
     one_json = json.loads(line)

Answer 7

回答by denson

This expands Cohen's answer:

这扩展了Cohen的回答：

content_object = s3_resource.Object(BucketName, KeyFileName)
file_buffer = io.StringIO()
file_buffer = content_object.get()['Body'].read().decode('utf-8')

json_lines = []
for line in file_buffer.splitlines():
    j_content = json.loads(line)
    json_lines.append(j_content)

df_readback = pd.DataFrame(json_lines)

This assumes that the entire file will fit in memory. If it is too big then this will have to be modified to read in chunks or use Dask.

这假设整个文件都适合内存。如果它太大，则必须修改它以分块读取或使用Dask。

Python 如何从大文件中读取以行分隔的 JSON（逐行）

提问by Cat

回答by njzk2

回答by Dane White

回答by Tjorriemorrie

回答by Jim

回答by Pranav Kasetti

回答by Cohen

回答by denson

相关推荐

最近更新

标签

Python 如何从大文件中读取以行分隔的 JSON（逐行）

提问by Cat

回答by njzk2

回答by Dane White

回答by Tjorriemorrie

回答by Jim

回答by Pranav Kasetti

回答by Cohen

回答by denson

相关推荐

Python 在 PyCharm 的控制台中运行代码

Python 将二进制数据读入 Pandas

python中两个numpy数组的区别

Python 不区分大小写的 Flask-SQLAlchemy 查询

相关推荐

最近更新

标签