Python 如何自动修复无效的 JSON 字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18514910/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I automatically fix an invalid JSON string?
提问by Anton Barycheuski
From the 2gis API I got the following JSON string.
从 2gis API 我得到了以下 JSON 字符串。
{
"api_version": "1.3",
"response_code": "200",
"id": "3237490513229753",
"lon": "38.969916127827",
"lat": "45.069889625267",
"page_url": null,
"name": "ATB",
"firm_group": {
"id": "3237499103085728",
"count": "1"
},
"city_name": "Krasnodar",
"city_id": "3237585002430511",
"address": "Turgeneva, 172/1",
"create_time": "2008-07-22 10:02:04 07",
"modification_time": "2013-08-09 20:04:36 07",
"see_also": [
{
"id": "3237491513434577",
"lon": 38.973110606808,
"lat": 45.029031222211,
"name": "Advance",
"hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
"ads": {
"sponsored_article": {
"title": "Center "ADVANCE"",
"text": "Business.English."
},
"warning": null
}
}
]
}
But Python doesn't recognize it:
但是 Python 无法识别它:
json.loads(firm_str)
Expecting , delimiter: line 1 column 3646 (char 3645)
期望,分隔符:第 1 行第 3646 列(字符 3645)
It looks like a problem with quotes in: "title": "Center "ADVANCE""
看起来像引号的问题:“title”:“Center“ADVANCE””
How can I fix it automatically in Python?
如何在 Python 中自动修复它?
采纳答案by tobias_k
The answer by @Michaelgave me an idea... not a very pretty idea, but it seems to work, at least on your example: Try to parse the JSON string, and if it fails, look for the character where it failed in the exception string1and replace that character.
@Michael的回答给了我一个想法......不是一个很好的想法,但它似乎有效,至少在你的例子中:尝试解析 JSON 字符串,如果失败,请查找失败的字符异常字符串1并替换该字符。
while True:
try:
result = json.loads(s) # try to parse...
break # parsing worked -> exit loop
except Exception as e:
# "Expecting , delimiter: line 34 column 54 (char 1158)"
# position of unexpected character after '"'
unexp = int(re.findall(r'\(char (\d+)\)', str(e))[0])
# position of unescaped '"' before that
unesc = s.rfind(r'"', 0, unexp)
s = s[:unesc] + r'\"' + s[unesc+1:]
# position of correspondig closing '"' (+2 for inserted '\')
closg = s.find(r'"', unesc + 2)
s = s[:closg] + r'\"' + s[closg+1:]
print result
You may want to add some additional checks to prevent this from ending in an infinite loop (e.g., at max as many repetitions as there are characters in the string). Also, this will still not work if an incorrect "
is actually followed by a comma, as pointed out by @gnibbler.
您可能需要添加一些额外的检查以防止以无限循环结束(例如,最多重复与字符串中的字符一样多)。此外,"
如@gnibbler 所指出的,如果不正确后跟逗号,这仍然不起作用。
Update:This seems to work prettywell now (though still not perfect), even if the unescaped "
is followed by a comma, or closing bracket, as in this case it will likely get a complaint about a syntax error after that (expected property name, etc.) and trace back to the last "
. It also automatically escapes the corresponding closing "
(assuming there is one).
更新:这现在似乎工作得很好(尽管仍然不完美),即使未转义"
后跟一个逗号或右括号,因为在这种情况下,它可能会在此之后收到有关语法错误的投诉(预期的属性名称等)并追溯到最后一个"
. 它还自动转义相应的关闭"
(假设有一个)。
1)The exception's str
is "Expecting , delimiter: line XXX column YYY (char ZZZ)"
, where ZZZ is the position in the string where the error occurred. Note, though, that this message may depend on the version of Python, the json
module, the OS, or the locale, and thus this solution may have to be adapted accordingly.
1)异常str
是"Expecting , delimiter: line XXX column YYY (char ZZZ)"
,其中 ZZZ 是字符串中发生错误的位置。但请注意,此消息可能取决于 Python 的版本、json
模块、操作系统或区域设置,因此可能必须相应地调整此解决方案。
回答by atorres757
If this is exactly what the API is returning then there is a problem with their API. This is invalid JSON. Especially around this area:
如果这正是 API 返回的内容,那么他们的 API 存在问题。这是无效的 JSON。尤其是在这个区域:
"ads": {
"sponsored_article": {
"title": "Образовательный центр "ADVANCE"", <-- here
"text": "Бизнес.Риторика.Английский язык.Подготовка к школе.Подготовка к ЕГЭ."
},
"warning": null
}
The double quotes around ADVANCE are not escaped. You can tell by using something like http://jsonlint.com/to validate it.
ADVANCE 周围的双引号不会被转义。您可以通过使用类似http://jsonlint.com/ 的内容来验证它。
This is a problem with the "
not being escaped, the data is bad at the source if this is what you are getting. They need to fix it.
这是一个"
没有被转义的问题,如果这是你得到的,数据在源头是坏的。他们需要修复它。
Parse error on line 4:
...азовательный центр "ADVANCE"",
-----------------------^
Expecting '}', ':', ',', ']'
This fixes the problem:
这解决了这个问题:
"title": "Образовательный центр \"ADVANCE\"",
回答by Michael Foukarakis
You need to escape double quotes in JSON strings, as follows:
您需要转义 JSON 字符串中的双引号,如下所示:
"title": "Образовательный центр \"ADVANCE\"",
To fix it programmatically, the simplest way would be to modify your JSON parser so you have some context for the error, then attempt to repair it.
要以编程方式修复它,最简单的方法是修改 JSON 解析器,以便为错误提供一些上下文,然后尝试修复它。
回答by Paolo
The only real and definitive solution is to ask 2gis to fix their API.
唯一真正确定的解决方案是要求 2gis 修复他们的 API。
In the meantime it is possible to fix the badly encoded JSON escaping double quotes inside strings. If every key-value pair is followed by a newline (as it seems to be from the posted data) the following function will do the job:
同时,可以修复字符串中编码错误的 JSON 转义双引号。如果每个键值对后跟一个换行符(似乎来自发布的数据),则以下函数将完成这项工作:
def fixjson(badjson):
s = badjson
idx = 0
while True:
try:
start = s.index( '": "', idx) + 4
end1 = s.index( '",\n',idx)
end2 = s.index( '"\n', idx)
if end1 < end2:
end = end1
else:
end = end2
content = s[start:end]
content = content.replace('"', '\"')
s = s[:start] + content + s[end:]
idx = start + len(content) + 6
except:
return s
Please, note that some assumtions made:
请注意,做了一些假设:
The function attemps to escape double quotes characters inside value stringbelonging to key-value pairs.
该函数尝试转义属于键值对的值字符串中的双引号字符。
It is assumed that the text to be escaped begins after the sequence
假设要转义的文本在序列之后开始
": "
and ends before the sequence
并在序列之前结束
",\n
or
或者
"\n
Passing the posted JSON to the function results in this returned value
将发布的 JSON 传递给函数会导致此返回值
{
"api_version": "1.3",
"response_code": "200",
"id": "3237490513229753",
"lon": "38.969916127827",
"lat": "45.069889625267",
"page_url": null,
"name": "ATB",
"firm_group": {
"id": "3237499103085728",
"count": "1"
},
"city_name": "Krasnodar",
"city_id": "3237585002430511",
"address": "Turgeneva, 172/1",
"create_time": "2008-07-22 10:02:04 07",
"modification_time": "2013-08-09 20:04:36 07",
"see_also": [
{
"id": "3237491513434577",
"lon": 38.973110606808,
"lat": 45.029031222211,
"name": "Advance",
"hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
"ads": {
"sponsored_article": {
"title": "Center \"ADVANCE\"",
"text": "Business.English."
},
"warning": null
}
}
]
}
Keep in mind you can easily customize the function if your needs are not fully satisfied.
请记住,如果您的需求不完全满足,您可以轻松自定义该功能。
回答by theBuzzyCoder
The above Idea is good but I had problem with that. My json Sting consisted only one additional double quote in it. So, I made a fix to the above given code.
上面的想法很好,但我遇到了问题。我的 json Sting 只包含一个额外的双引号。所以,我对上面给出的代码进行了修复。
The jsonStr was
jsonStr 是
{
"api_version": "1.3",
"response_code": "200",
"id": "3237490513229753",
"lon": "38.969916127827",
"lat": "45.069889625267",
"page_url": null,
"name": "ATB",
"firm_group": {
"id": "3237499103085728",
"count": "1"
},
"city_name": "Krasnodar",
"city_id": "3237585002430511",
"address": "Turgeneva, 172/1",
"create_time": "2008-07-22 10:02:04 07",
"modification_time": "2013-08-09 20:04:36 07",
"see_also": [
{
"id": "3237491513434577",
"lon": 38.973110606808,
"lat": 45.029031222211,
"name": "Advance",
"hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
"ads": {
"sponsored_article": {
"title": "Center "ADVANCE",
"text": "Business.English."
},
"warning": null
}
}
]
}
The fix is as follows:
修复方法如下:
import json, re
def fixJSON(jsonStr):
# Substitue all the backslash from JSON string.
jsonStr = re.sub(r'\', '', jsonStr)
try:
return json.loads(jsonStr)
except ValueError:
while True:
# Search json string specifically for '"'
b = re.search(r'[\w|"]\s?(")\s?[\w|"]', jsonStr)
# If we don't find any the we come out of loop
if not b:
break
# Get the location of \"
s, e = b.span(1)
c = jsonStr[s:e]
# Replace \" with \'
c = c.replace('"',"'")
jsonStr = jsonStr[:s] + c + jsonStr[e:]
return json.loads(jsonStr)
This code also works for JSON string mentioned in problem statement
此代码也适用于问题陈述中提到的 JSON 字符串
OR you can also do this:
或者你也可以这样做:
def fixJSON(jsonStr):
# First remove the " from where it is supposed to be.
jsonStr = re.sub(r'\', '', jsonStr)
jsonStr = re.sub(r'{"', '{`', jsonStr)
jsonStr = re.sub(r'"}', '`}', jsonStr)
jsonStr = re.sub(r'":"', '`:`', jsonStr)
jsonStr = re.sub(r'":', '`:', jsonStr)
jsonStr = re.sub(r'","', '`,`', jsonStr)
jsonStr = re.sub(r'",', '`,', jsonStr)
jsonStr = re.sub(r',"', ',`', jsonStr)
jsonStr = re.sub(r'\["', '\[`', jsonStr)
jsonStr = re.sub(r'"\]', '`\]', jsonStr)
# Remove all the unwanted " and replace with ' '
jsonStr = re.sub(r'"',' ', jsonStr)
# Put back all the " where it supposed to be.
jsonStr = re.sub(r'\`','\"', jsonStr)
return json.loads(jsonStr)
回答by Frost
Within sources of https://fix-json.comI found a solution, but it's very dirty and looks like a hack. Just adapt it to python
在https://fix-json.com 的来源中,我找到了一个解决方案,但它非常脏,看起来像一个黑客。只需适应python
jsString.match(/:.*"(.*)"/gi).forEach(function(element){
var filtered = element.replace(/(^:\s*"|"(,)?$)/gi, '').trim();
jsString = jsString.replace(filtered, filtered.replace(/(\*)\"/gi, "\\""));
});
回答by madjardi
it's not perfect and ugly but it helps to me
它并不完美和丑陋,但对我有帮助
def get_json_info(info_row: str, type) -> dict:
try:
info = json.loads(info_row)
except JSONDecodeError:
data = {
}
try:
for s in info_row.split('","'):
if not s:
continue
key, val = s.split(":", maxsplit=1)
key = key.strip().lstrip("{").strip('"')
val: str = re.sub('"', '\"', val.lstrip('"').strip('\"}'))
data[key] = val
except ValueError:
print("ERROR:", info_row)
info = data
return info
回答by tink
I make a jsonfixer to solve a problem like this.
我制作了一个 jsonfixer 来解决这样的问题。
It's Python Package (2.7) (a half-done command line tool)
它是 Python Package (2.7)(一个半完成的命令行工具)
just see https://github.com/half-pie/half-json
只需查看https://github.com/half-pie/half-json
from half_json.core import JSONFixer
f = JSONFixer(max_try=100)
new_s = s.replace('\n', '')
result = f.fix(new_s)
d = json.loads(result.line)
# {u'name': u'ATB', u'modification_time': u'2013-08-09 20:04:36 07', u'city_id': u'3237585002430511', u'see_also': [{u'hash': u'5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e', u'ads': {u'warning': None, u'sponsored_article': {u'ADVANCE': u', ', u'text': u'Business.English.', u'title': u'Center '}}, u'lon': 38.973110606808, u'lat': 45.029031222211, u'id': u'3237491513434577', u'name': u'Advance'}], u'response_code': u'200', u'lon': u'38.969916127827', u'firm_group': {u'count': u'1', u'id': u'3237499103085728'}, u'create_time': u'2008-07-22 10:02:04 07', u'city_name': u'Krasnodar', u'address': u'Turgeneva, 172/1', u'lat': u'45.069889625267', u'id': u'3237490513229753', u'api_version': u'1.3', u'page_url': None}
and test case in https://github.com/half-pie/half-json/blob/master/tests/test_cases.py#L76-L80
和https://github.com/half-pie/half-json/blob/master/tests/test_cases.py#L76-L80 中的测试用例
line = '{"title": "Center "ADVANCE"", "text": "Business.English."}'
ok, newline, _ = JSONFixer().fix(line)
self.assertTrue(ok)
self.assertEqual('{"title": "Center ","ADVANCE":", ","text": "Business.English."}', newline)
回答by shekhar chander
Fix #1
修复 #1
If you fetched it from some website, please make sure you are using the same string. In my case, I was doing .replace('\\"','"')
. Because of this, the data was not the json anymore. If you also did something. like that, please fix that.
如果您从某个网站获取它,请确保您使用的是相同的字符串。就我而言,我正在做.replace('\\"','"')
. 因此,数据不再是 json。如果你也做了些什么。像那样,请修复它。
Fix #2
修复 #2
Try adding some character in all the places insted of the key name. It will be fine.
尝试在键名的所有位置添加一些字符。没事的。