Python:尝试对文件中的多个 JSON 对象进行反序列化,每个对象跨越多个但间隔一致的行数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20400818/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:21:06  来源:igfitidea点击:

Python: Trying to Deserialize Multiple JSON objects in a file with each object spanning multiple but consistently spaced number of lines

pythonjson

提问by horatio1701d

Ok, after nearly a week of research I'm going to give SO a shot. I have a text file that looks as follows (showing 3 separate json objects as an example but file has 50K of these):

好的,经过近一周的研究,我打算试一试。我有一个如下所示的文本文件(以 3 个单独的 json 对象为例,但文件有 50K 个):

{
"zipcode":"00544",
"current":{"canwc":null,"cig":7000,"class":"observation"},
"triggers":[178,30,176,103,179,112,21,20,48,7,50,40,57]
}
{
"zipcode":"00601",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[12,23,34,28,100]
}
{
"zipcode":"00602",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[13,85,43,101,38,31]
}

I know how to work with JSON objects using the Python json library but I'm having a challenge with how to create 50 thousand different json objects from reading the file. (Perhaps I'm not even thinking about this correctly but ultimately I need to deserialize and load into a database) I've tried itertools thinking that I need a generator so I was able to use:

我知道如何使用 Python json 库处理 JSON 对象,但我在如何通过读取文件创建 50,000 个不同的 json 对象方面遇到了挑战。(也许我什至没有正确考虑这一点,但最终我需要反序列化并加载到数据库中)我试过 itertools 认为我需要一个生成器所以我能够使用:

with open(file) as f:
    for line in itertools.islice(f, 0, 7): #since every 7 lines is a json object
        jfile = json.load(line)

But the above obviously won't work since it is not reading the 7 lines as a single json object and I'm also not sure how to then iterate on entire file and load individual json objects.

但是上面的内容显然不起作用,因为它没有将 7 行作为单个 json 对象读取,而且我也不确定如何迭代整个文件并加载单个 json 对象。

The following would give me a list I can slice:

以下会给我一个我可以切片的列表:

list(open(file))[:7]

Any help would be really appreciated.

任何帮助将非常感激。



Extemely close to what I need and I think literally one step away but still struggling a little with iteration. This will finally get me an iterative printout of all of the dataframes but how do I make it so that I can capture one giant dataframe with all of the pieces essentially concatenated? I could then export that final dataframe to csv etc. (Also is there a better way to upload this result into a database rather than creating a giant dataframe first?)

非常接近我需要的东西,我认为实际上只有一步之遥,但仍然在迭代方面有些挣扎。这最终会让我得到所有数据帧的迭代打印输出,但是我如何制作它以便我可以捕获一个巨大的数据帧,其中所有部分基本上都连接在一起?然后我可以将最终的数据帧导出到 csv 等。(还有没有更好的方法可以将此结果上传到数据库中,而不是先创建一个巨大的数据帧?)

def lines_per_n(f, n):
    for line in f:
        yield ''.join(chain([line], itertools.islice(f, n - 1)))

def flatten(jfile):
    for k, v in jfile.items():
        if isinstance(v, list):
            jfile[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                jfile['%s' % (kk)] = vv
            del jfile[k]
            return jfile

with open('deadzips.json') as f:
    for chunk in lines_per_n(f, 7):
        try:
            jfile = json.loads(chunk)
            pd.DataFrame(flatten(jfile).items())
        except ValueError, e:
            pass
        else:
            pass

采纳答案by Martijn Pieters

Load 6 extra lines instead, and pass the stringto json.loads():

改为加载 6 行,并将字符串传递给json.loads()

with open(file) as f:
    for line in f:
        # slice the next 6 lines from the iterable, as a list.
        lines = [line] + list(itertools.islice(f, 6))
        jfile = json.loads(''.join(lines))

        # do something with jfile

json.load()will slurp up more than just the next object in the file, and islice(f, 0, 7)would read only the first 7 lines, rather than read the file in 7-line blocks.

json.load()将不仅仅islice(f, 0, 7)读取文件中的下一个对象,并且只会读取前 7 行,而不是读取 7 行块中的文件。

You can wrap reading a file in blocks of size N in a generator:

您可以在生成器中以大小为 N 的块读取文件:

from itertools import islice, chain

def lines_per_n(f, n):
    for line in f:
        yield ''.join(chain([line], itertools.islice(f, n - 1)))

then use that to chunk up your input file:

然后用它来分块你的输入文件:

with open(file) as f:
    for chunk in lines_per_n(f, 7):
        jfile = json.loads(chunk)

        # do something with jfile

Alternatively, if your blocks turn out to be of variable length, read until you have something that parses:

或者,如果您的块结果是可变长度,请阅读直到您有解析的内容:

with open(file) as f:
    for line in f:
        while True:
            try:
                jfile = json.loads(line)
                break
            except ValueError:
                # Not yet a complete JSON value
                line += next(f)

        # do something with jfile

回答by Jeff Younker

As stated elsewhere, a general solution is to read the file in pieces, append each piece to the last, and try to parse that new chunk. If it doesn't parse, continue until you get something that does. Once you have something that parses, return it, and restart the process. Rinse-lather-repeat until you run out of data.

如别处所述,一般的解决方案是分块读取文件,将每个部分附加到最后一个,然后尝试解析该新块。如果它没有解析,继续直到你得到一些可以解析的东西。一旦你有了解析的东西,就返回它,然后重新开始这个过程。冲洗泡沫重复直到用完数据。

Here is a succinct generator that will do this:

这是一个简洁的生成器,可以做到这一点:

def load_json_multiple(segments):
    chunk = ""
    for segment in segments:
        chunk += segment
        try:
            yield json.loads(chunk)
            chunk = ""
        except ValueError:
            pass

Use it like this:

像这样使用它:

with open('foo.json') as f:
   for parsed_json in load_json_multiple(f):
      print parsed_json

I hope this helps.

我希望这有帮助。