在 Python 中处理惰性 JSON - '期望属性名称'
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4033633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Handling lazy JSON in Python - 'Expecting property name'
提问by Seidr
Using Pythons (2.7) 'json' module I'm looking to process various JSON feeds. Unfortunately some of these feeds do not conform with JSON standards - in specific some keys are not wrapped in double speech-marks ("). This is causing Python to bug out.
使用 Pythons (2.7) 'json' 模块我希望处理各种 JSON 提要。不幸的是,其中一些提要不符合 JSON 标准 - 特别是某些键没有包含在双语音标记 (") 中。这导致 Python 出错。
Before writing an ugly-as-hell piece of code to parse and repair the incoming data, I thought I'd ask - is there any way to allow Python to either parse this malformed JSON or 'repair' the data so that it would be valid JSON?
在编写一段丑陋的代码来解析和修复传入的数据之前,我想我会问 - 有没有办法让 Python 解析这个格式错误的 JSON 或“修复”数据,以便它有效的 JSON?
Working example
工作示例
import json
>>> json.loads('{"key1":1,"key2":2,"key3":3}')
{'key3': 3, 'key2': 2, 'key1': 1}
Broken example
破碎的例子
import json
>>> json.loads('{key1:1,key2:2,key3:3}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\json\__init__.py", line 310, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 346, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 362, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 1 (char 1)
I've written a small REGEX to fix the JSON coming from this particular provider, but I forsee this being an issue in the future. Below is what I came up with.
我已经编写了一个小的 REGEX 来修复来自这个特定提供者的 JSON,但我预计这在未来会成为一个问题。下面是我想出来的。
>>> import re
>>> s = '{key1:1,key2:2,key3:3}'
>>> s = re.sub('([{,])([^{:\s"]*):', lambda m: '%s"%s":'%(m.group(1),m.group(2)),s)
>>> s
'{"key1":1,"key2":2,"key3":3}'
采纳答案by Ned Batchelder
You're trying to use a JSON parser to parse something that isn't JSON. Your best bet is to get the creator of the feeds to fix them.
您正在尝试使用 JSON 解析器来解析不是 JSON 的内容。最好的办法是让提要的创建者修复它们。
I understand that isn't always possible. You might be able to fix the data using regexes, depending on how broken it is:
我明白这并不总是可能的。您也许可以使用正则表达式修复数据,具体取决于数据损坏的程度:
j = re.sub(r"{\s*(\w)", r'{"', j)
j = re.sub(r",\s*(\w)", r',"', j)
j = re.sub(r"(\w):", r'":', j)
回答by cheeseinvert
Expanding on Ned's suggestion, the following has been helpful for me:
扩展 Ned 的建议,以下内容对我有帮助:
j = re.sub(r"{\s*'?(\w)", r'{"', j)
j = re.sub(r",\s*'?(\w)", r',"', j)
j = re.sub(r"(\w)'?\s*:", r'":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':""', j)
回答by Joel
回答by psanchez
The regular expressions pointed out by Ned and cheeseinvert don't take into account when the match is inside a string.
Ned 和 cheeseinvert 指出的正则表达式不会考虑匹配项是否在字符串内。
See the following example (using cheeseinvert's solution):
请参阅以下示例(使用 cheeseinvert 的解决方案):
>>> fixLazyJsonWithRegex ('{ key : "a { a : b }", }')
'{ "key" : "a { "a": b }" }'
The problem is that the expected output is:
问题是预期的输出是:
'{ "key" : "a { a : b }" }'
Since JSON tokens are a subset of python tokens, we can use python's tokenize module.
由于 JSON 令牌是 python 令牌的子集,我们可以使用 python 的tokenize 模块。
Please correct me if I'm wrong, but the following code will fix a lazy json string in all the cases:
如果我错了,请纠正我,但以下代码将在所有情况下修复惰性 json 字符串:
import tokenize
import token
from StringIO import StringIO
def fixLazyJson (in_text):
tokengen = tokenize.generate_tokens(StringIO(in_text).readline)
result = []
for tokid, tokval, _, _, _ in tokengen:
# fix unquoted strings
if (tokid == token.NAME):
if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
tokid = token.STRING
tokval = u'"%s"' % tokval
# fix single-quoted strings
elif (tokid == token.STRING):
if tokval.startswith ("'"):
tokval = u'"%s"' % tokval[1:-1].replace ('"', '\"')
# remove invalid commas
elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
if (len(result) > 0) and (result[-1][1] == ','):
result.pop()
# fix single-quoted strings
elif (tokid == token.STRING):
if tokval.startswith ("'"):
tokval = u'"%s"' % tokval[1:-1].replace ('"', '\"')
result.append((tokid, tokval))
return tokenize.untokenize(result)
So in order to parse a json string, you might want to encapsulate a call to fixLazyJson once json.loads fails (to avoid performance penalties for well-formed json):
因此,为了解析 json 字符串,您可能希望在 json.loads 失败后封装对 fixLazyJson 的调用(以避免对格式良好的 json 造成性能损失):
import json
def json_decode (json_string, *args, **kwargs):
try:
json.loads (json_string, *args, **kwargs)
except:
json_string = fixLazyJson (json_string)
json.loads (json_string, *args, **kwargs)
The only problem I see when fixing lazy json, is that if the json is malformed, the error raised by the second json.loads won't be referencing the line and column from the original string, but the modified one.
我在修复惰性 json 时看到的唯一问题是,如果 json 格式错误,则第二个 json.loads 引发的错误将不会引用原始字符串中的行和列,而是引用修改后的行和列。
As a final note I just want to point out that it would be straightforward to update any of the methods to accept a file object instead of a string.
最后,我只想指出,更新任何方法以接受文件对象而不是字符串是很简单的。
BONUS: Apart from this, people usually likes to include C/C++ comments when json is used for configuration files, in this case, you can either remove comments using a regular expression, or use the extended version and fix the json string in one pass:
奖励:除此之外,人们通常喜欢在 json 用于配置文件时包含 C/C++ 注释,在这种情况下,您可以使用正则表达式删除注释,或者使用扩展版本并一次性修复 json 字符串:
import tokenize
import token
from StringIO import StringIO
def fixLazyJsonWithComments (in_text):
""" Same as fixLazyJson but removing comments as well
"""
result = []
tokengen = tokenize.generate_tokens(StringIO(in_text).readline)
sline_comment = False
mline_comment = False
last_token = ''
for tokid, tokval, _, _, _ in tokengen:
# ignore single line and multi line comments
if sline_comment:
if (tokid == token.NEWLINE) or (tokid == tokenize.NL):
sline_comment = False
continue
# ignore multi line comments
if mline_comment:
if (last_token == '*') and (tokval == '/'):
mline_comment = False
last_token = tokval
continue
# fix unquoted strings
if (tokid == token.NAME):
if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
tokid = token.STRING
tokval = u'"%s"' % tokval
# fix single-quoted strings
elif (tokid == token.STRING):
if tokval.startswith ("'"):
tokval = u'"%s"' % tokval[1:-1].replace ('"', '\"')
# remove invalid commas
elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
if (len(result) > 0) and (result[-1][1] == ','):
result.pop()
# detect single-line comments
elif tokval == "//":
sline_comment = True
continue
# detect multiline comments
elif (last_token == '/') and (tokval == '*'):
result.pop() # remove previous token
mline_comment = True
continue
result.append((tokid, tokval))
last_token = tokval
return tokenize.untokenize(result)
回答by tzot
In a similar case, I have used ast.literal_eval. AFAIK, this won't work only when the constant null(corresponding to Python None) appears in the JSON.
在类似的情况下,我使用了ast.literal_eval. AFAIK,只有当常量null(对应于 Python None)出现在 JSON 中时,这才会起作用。
Given that you know about the null/Nonepredicament, you can:
鉴于您了解null/None困境,您可以:
import ast
decoded_object= ast.literal_eval(json_encoded_text)
回答by Stan
In addition to Neds and cheeseinvert suggestion, adding (?!/)should avoid the mentioned problem with urls
除了 Neds 和 cheeseinvert 建议,添加(?!/)应该避免提到的 url 问题
j = re.sub(r"{\s*'?(\w)", r'{"', j)
j = re.sub(r",\s*'?(\w)", r',"', j)
j = re.sub(r"(\w)'?\s*:(?!/)", r'":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':""', j)
j = re.sub(r",\s*]", "]", j)

