在 Python 中处理惰性 JSON - '期望属性名称'

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4033633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:55:55  来源:igfitidea点击:

Handling lazy JSON in Python - 'Expecting property name'

pythonjson

提问by Seidr

Using Pythons (2.7) 'json' module I'm looking to process various JSON feeds. Unfortunately some of these feeds do not conform with JSON standards - in specific some keys are not wrapped in double speech-marks ("). This is causing Python to bug out.

使用 Pythons (2.7) 'json' 模块我希望处理各种 JSON 提要。不幸的是,其中一些提要不符合 JSON 标准 - 特别是某些键没有包含在双语音标记 (") 中。这导致 Python 出错。

Before writing an ugly-as-hell piece of code to parse and repair the incoming data, I thought I'd ask - is there any way to allow Python to either parse this malformed JSON or 'repair' the data so that it would be valid JSON?

在编写一段丑陋的代码来解析和修复传入的数据之前,我想我会问 - 有没有办法让 Python 解析这个格式错误的 JSON 或“修复”数据,以便它有效的 JSON?

Working example

工作示例

import json
>>> json.loads('{"key1":1,"key2":2,"key3":3}')
{'key3': 3, 'key2': 2, 'key1': 1}

Broken example

破碎的例子

import json
>>> json.loads('{key1:1,key2:2,key3:3}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\json\__init__.py", line 310, in loads
    return _default_decoder.decode(s)
  File "C:\Python27\lib\json\decoder.py", line 346, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Python27\lib\json\decoder.py", line 362, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 1 (char 1)

I've written a small REGEX to fix the JSON coming from this particular provider, but I forsee this being an issue in the future. Below is what I came up with.

我已经编写了一个小的 REGEX 来修复来自这个特定提供者的 JSON,但我预计这在未来会成为一个问题。下面是我想出来的。

>>> import re
>>> s = '{key1:1,key2:2,key3:3}'
>>> s = re.sub('([{,])([^{:\s"]*):', lambda m: '%s"%s":'%(m.group(1),m.group(2)),s)
>>> s
'{"key1":1,"key2":2,"key3":3}'

采纳答案by Ned Batchelder

You're trying to use a JSON parser to parse something that isn't JSON. Your best bet is to get the creator of the feeds to fix them.

您正在尝试使用 JSON 解析器来解析不是 JSON 的内容。最好的办法是让提要的创建者修复它们。

I understand that isn't always possible. You might be able to fix the data using regexes, depending on how broken it is:

我明白这并不总是可能的。您也许可以使用正则表达式修复数据,具体取决于数据损坏的程度:

j = re.sub(r"{\s*(\w)", r'{"', j)
j = re.sub(r",\s*(\w)", r',"', j)
j = re.sub(r"(\w):", r'":', j)

回答by cheeseinvert

Expanding on Ned's suggestion, the following has been helpful for me:

扩展 Ned 的建议,以下内容对我有帮助:

j = re.sub(r"{\s*'?(\w)", r'{"', j)
j = re.sub(r",\s*'?(\w)", r',"', j)
j = re.sub(r"(\w)'?\s*:", r'":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':""', j)

回答by Joel

Another option is to use the demjsonmodule which can parse json in non-strict mode.

另一种选择是使用demjson模块,它可以在非严格模式下解析 json。

回答by psanchez

The regular expressions pointed out by Ned and cheeseinvert don't take into account when the match is inside a string.

Ned 和 cheeseinvert 指出的正则表达式不会考虑匹配项是否在字符串内。

See the following example (using cheeseinvert's solution):

请参阅以下示例(使用 cheeseinvert 的解决方案):

>>> fixLazyJsonWithRegex ('{ key : "a { a : b }", }')
'{ "key" : "a { "a": b }" }'

The problem is that the expected output is:

问题是预期的输出是:

'{ "key" : "a { a : b }" }'

Since JSON tokens are a subset of python tokens, we can use python's tokenize module.

由于 JSON 令牌是 python 令牌的子集,我们可以使用 python 的tokenize 模块

Please correct me if I'm wrong, but the following code will fix a lazy json string in all the cases:

如果我错了,请纠正我,但以下代码将在所有情况下修复惰性 json 字符串:

import tokenize
import token
from StringIO import StringIO

def fixLazyJson (in_text):
  tokengen = tokenize.generate_tokens(StringIO(in_text).readline)

  result = []
  for tokid, tokval, _, _, _ in tokengen:
    # fix unquoted strings
    if (tokid == token.NAME):
      if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
        tokid = token.STRING
        tokval = u'"%s"' % tokval

    # fix single-quoted strings
    elif (tokid == token.STRING):
      if tokval.startswith ("'"):
        tokval = u'"%s"' % tokval[1:-1].replace ('"', '\"')

    # remove invalid commas
    elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
      if (len(result) > 0) and (result[-1][1] == ','):
        result.pop()

    # fix single-quoted strings
    elif (tokid == token.STRING):
      if tokval.startswith ("'"):
        tokval = u'"%s"' % tokval[1:-1].replace ('"', '\"')

    result.append((tokid, tokval))

  return tokenize.untokenize(result)

So in order to parse a json string, you might want to encapsulate a call to fixLazyJson once json.loads fails (to avoid performance penalties for well-formed json):

因此,为了解析 json 字符串,您可能希望在 json.loads 失败后封装对 fixLazyJson 的调用(以避免对格式良好的 json 造成性能损失):

import json

def json_decode (json_string, *args, **kwargs):
  try:
    json.loads (json_string, *args, **kwargs)
  except:
    json_string = fixLazyJson (json_string)
    json.loads (json_string, *args, **kwargs)

The only problem I see when fixing lazy json, is that if the json is malformed, the error raised by the second json.loads won't be referencing the line and column from the original string, but the modified one.

我在修复惰性 json 时看到的唯一问题是,如果 json 格式错误,则第二个 json.loads 引发的错误将不会引用原始字符串中的行和列,而是引用修改后的行和列。

As a final note I just want to point out that it would be straightforward to update any of the methods to accept a file object instead of a string.

最后,我只想指出,更新任何方法以接受文件对象而不是字符串是很简单的。

BONUS: Apart from this, people usually likes to include C/C++ comments when json is used for configuration files, in this case, you can either remove comments using a regular expression, or use the extended version and fix the json string in one pass:

奖励:除此之外,人们通常喜欢在 json 用于配置文件时包含 C/C++ 注释,在这种情况下,您可以使用正则表达式删除注释,或者使用扩展版本并一次性修复 json 字符串:

import tokenize
import token
from StringIO import StringIO

def fixLazyJsonWithComments (in_text):
  """ Same as fixLazyJson but removing comments as well
  """
  result = []
  tokengen = tokenize.generate_tokens(StringIO(in_text).readline)

  sline_comment = False
  mline_comment = False
  last_token = ''

  for tokid, tokval, _, _, _ in tokengen:

    # ignore single line and multi line comments
    if sline_comment:
      if (tokid == token.NEWLINE) or (tokid == tokenize.NL):
        sline_comment = False
      continue

    # ignore multi line comments
    if mline_comment:
      if (last_token == '*') and (tokval == '/'):
        mline_comment = False
      last_token = tokval
      continue

    # fix unquoted strings
    if (tokid == token.NAME):
      if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
        tokid = token.STRING
        tokval = u'"%s"' % tokval

    # fix single-quoted strings
    elif (tokid == token.STRING):
      if tokval.startswith ("'"):
        tokval = u'"%s"' % tokval[1:-1].replace ('"', '\"')

    # remove invalid commas
    elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
      if (len(result) > 0) and (result[-1][1] == ','):
        result.pop()

    # detect single-line comments
    elif tokval == "//":
      sline_comment = True
      continue

    # detect multiline comments
    elif (last_token == '/') and (tokval == '*'):
      result.pop() # remove previous token
      mline_comment = True
      continue

    result.append((tokid, tokval))
    last_token = tokval

  return tokenize.untokenize(result)

回答by tzot

In a similar case, I have used ast.literal_eval. AFAIK, this won't work only when the constant null(corresponding to Python None) appears in the JSON.

在类似的情况下,我使用了ast.literal_eval. AFAIK,只有当常量null(对应于 Python None)出现在 JSON 中时,这才会起作用。

Given that you know about the null/Nonepredicament, you can:

鉴于您了解null/None困境,您可以:

import ast
decoded_object= ast.literal_eval(json_encoded_text)

回答by Stan

In addition to Neds and cheeseinvert suggestion, adding (?!/)should avoid the mentioned problem with urls

除了 Neds 和 cheeseinvert 建议,添加(?!/)应该避免提到的 url 问题

j = re.sub(r"{\s*'?(\w)", r'{"', j)
j = re.sub(r",\s*'?(\w)", r',"', j)
j = re.sub(r"(\w)'?\s*:(?!/)", r'":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':""', j) 
j = re.sub(r",\s*]", "]", j)