Python 和 JSON：ValueError：未终止的字符串起始于：

Question

提问by Gabe Spradlin

I have read multiple StackOverflow articles on this and most of the top 10 Google results. Where my issue deviates is that I am using one script in python to create my JSON files. And the next script, run not 10 minutes later, can't read that very file.

我已经阅读了多篇关于此的 StackOverflow 文章以及 Google 排名前 10 的大多数结果。我的问题偏离的地方是我在 python 中使用一个脚本来创建我的 JSON 文件。下一个脚本在 10 分钟后运行，无法读取那个文件。

Short version, I generate leads for my online business. I am attempting to learn python in order to have better analytics on these leads. I am scouring 2 years worth of leads with the intent being to retain the useful data and drop anything personal - email addresses, names, etc. - while also saving 30,000+ leads into a few dozen files for easy access.

简短版本，我为我的在线业务生成潜在客户。我正在尝试学习 python 以便对这些线索进行更好的分析。我正在搜索 2 年的潜在客户，目的是保留有用的数据并删除任何个人信息 - 电子邮件地址、姓名等 - 同时还将 30,000 多个潜在客户保存到几十个文件中以便于访问。

So my first script opens every single individual lead file - 30,000+ - determines the date it was capture based on a timestamp in the file. Then it saves that lead to the appropriate key in dict. When all the data has been aggregated into this dict text files are written using json.dumps.

所以我的第一个脚本打开每个单独的潜在客户文件 - 30,000+ - 根据文件中的时间戳确定捕获日期。然后它会在 dict 中保存导致适当的键。当所有数据聚合到这个 dict 文本文件时，使用 json.dumps 写入。

The dict's structure is:

字典的结构是：

addData['lead']['July_2013'] = { ... }

where the 'lead' key can be lead, partial, and a few others and the 'July_2013' key is obviously a date based key that can be any combination of the full month and 2013 or 2014 going back to 'February_2013'.

其中“lead”键可以是lead、partial和一些其他键，而“July_2013”键显然是基于日期的键，可以是整月和从“February_2013”开始的2013年或2014年的任意组合。

The full error is this:

完整的错误是这样的：

ValueError: Unterminated string starting at: line 1 column 9997847 (char 9997846)

But I've manually looked at the file and my IDE says there are only 76,655 chars in the file. So how did it get to 9997846?

但是我手动查看了文件，我的 IDE 说文件中只有 76,655 个字符。那么它是如何到达 9997846 的呢？

The file that fails is the 8th to be read; the other 7 and all other files that come after it read in via json.loads just fine.

失败的文件是第 8 个要读取的文件；其他 7 个文件和所有其他文件通过 json.loads 读入后就可以了。

Python says there is in an unterminated string so I looked at the end of the JSON in the file that fails and it appears to be fine. I've seen some mention about newlines being \n in JSON but this string is all one line. I've seen mention of \ vs \ but in a quick look over the whole file I didn't see any . Other files do have \ and they read in fine. And, these files were all created by json.dumps.

Python 说有一个未终止的字符串，所以我查看了失败的文件中 JSON 的结尾，它似乎没问题。我看到有人提到在 JSON 中换行是 \n ，但这个字符串都是一行。我已经看到提到 \ vs \ 但在快速浏览整个文件时我没有看到任何 . 其他文件确实有 \ 并且它们可以正常读取。而且，这些文件都是由 json.dumps 创建的。

I can't post the file because it still has personal info in it. Manually attempting to validate the JSON of a 76,000 char file isn't really viable.

我无法发布该文件，因为其中仍有个人信息。手动尝试验证 76,000 个字符文件的 JSON 并不是真正可行的。

Thoughts on how to debug this would be appreciated. In the mean time I am going to try to rebuild the files and see if this wasn't just a one off bug but that takes a while.

关于如何调试的想法将不胜感激。与此同时，我将尝试重建文件，看看这是否不仅仅是一个一次性的错误，而是需要一段时间。

Python 2.7 via Spyder & Anaconda
Windows 7 Pro

Python 2.7 通过 Spyder 和 Anaconda
视窗 7 专业版

--- Edit --- Per request I am posting the Write Code here:

--- 编辑 --- 根据请求，我在此处发布写入代码：

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global archiveDir
global aggLeads


def aggregate_individual_lead_files():
    """

    """

    # Get the aggLead global and 
    global aggLeads

    # Get all the Files with a 'lead' extension & aggregate them
    exts = [
        'lead',
        'partial',
        'inp',
        'err',
        'nobuyer',
        'prospect',
        'sent'
    ]

    for srchExt in exts:
        agg = {}
        leads = f.recursiveGlob(leadDir, '*.cd.' + srchExt)
        print "There are {} {} files to process".format(len(leads), srchExt)

        for lead in leads:
            # Get the Base Filename
            fname = f.basename(lead)
            #uniqID = st.fetchBefore('.', fname)

            #print "File: ", lead

            # Get Lead Data
            leadData = json.loads(f.file_get_contents(lead))

            agg = agg_data(leadData, agg, fname)

        aggLeads[srchExt] = copy.deepcopy(agg)

        print "Aggregate Top Lvl Keys: ", aggLeads.keys()
        print "Aggregate Next Lvl Keys: "

        for key in aggLeads:
            print "{}: ".format(key)

            for arcDate in aggLeads[key].keys():
                print "{}: {}".format(arcDate, len(aggLeads[key][arcDate]))

        # raw_input("Press Enter to continue...")


def agg_data(leadData, agg, fname=None):
    """

    """
    #print "Lead: ", leadData

    # Get the timestamp of the lead
    try:
        ts = leadData['timeStamp']
        leadData.pop('timeStamp')
    except KeyError:
        return agg

    leadDate = datetime.fromtimestamp(ts)
    arcDate = leadDate.strftime("%B_%Y")

    #print "Archive Date: ", arcDate

    try:
        agg[arcDate][ts] = leadData
    except KeyError:
        agg[arcDate] = {}
        agg[arcDate][ts] = leadData
    except TypeError:
        print "Timestamp: ", ts
        print "Lead: ", leadData
        print "Archive Date: ", arcDate
        return agg

    """
    if fname is not None:
        archive_lead(fname, arcDate)
    """

    #print "File: {} added to {}".format(fname, arcDate)

    return agg


def archive_lead(fname, arcDate):
    # Archive Path
    newArcPath = archiveDir + arcDate + '//'

    if not os.path.exists(newArcPath):
        os.makedirs(newArcPath)

    # Move the file to the archive
    os.rename(leadDir + fname, newArcPath + fname)


def reformat_old_agg_data():
    """

    """

    # Get the aggLead global and 
    global aggLeads
    aggComplete = {}
    aggPartial = {}

    oldAggFiles = f.recursiveGlob(leadDir, '*.cd.agg')
    print "There are {} old aggregate files to process".format(len(oldAggFiles))

    for agg in oldAggFiles:
        tmp = json.loads(f.file_get_contents(agg))

        for uniqId in tmp:
            leadData = tmp[uniqId]

            if leadData['isPartial'] == True:
                aggPartial = agg_data(leadData, aggPartial)
            else:
                aggComplete = agg_data(leadData, aggComplete)

    arcData = dict(aggLeads['lead'].items() + aggComplete.items())
    aggLeads['lead'] = arcData

    arcData = dict(aggLeads['partial'].items() + aggPartial.items())
    aggLeads['partial'] = arcData    


def output_agg_files():
    for ext in aggLeads:
        for arcDate in aggLeads[ext]:
            arcFile = leadDir + arcDate + '.cd.' + ext + '.agg'

            if f.file_exists(arcFile):
                tmp = json.loads(f.file_get_contents(arcFile))
            else:
                tmp = {}

            arcData = dict(tmp.items() + aggLeads[ext][arcDate].items())

            f.file_put_contents(arcFile, json.dumps(arcData))


def main():
    global leadDir
    global archiveDir
    global aggLeads

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    archiveDir = leadDir + 'archive//'
    aggLeads = {}


    # Aggregate all the old individual file
    aggregate_individual_lead_files()

    # Reformat the old aggregate files
    reformat_old_agg_data()

    # Write it all out to an aggregate file
    output_agg_files()


if __name__ == "__main__":
    main()

Here is the read code:

这是读取代码：

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global fields
global fieldTimes
global versions


def parse_agg_file(aggFile):
    global leadDir
    global fields
    global fieldTimes

    try:
        tmp = json.loads(f.file_get_contents(aggFile))
    except ValueError:
        print "{} failed the JSON load".format(aggFile)
        return False

    print "Opening: ", aggFile

    for ts in tmp:
        try:
            tmpTs = float(ts)
        except:
            print "Timestamp: ", ts
            continue

        leadData = tmp[ts]

        for field in leadData:
            if field not in fields:
                fields[field] = []

            fields[field].append(float(ts))


def determine_form_versions():
    global fieldTimes
    global versions

    # Determine all the fields and their start and stop times
    times = []
    for field in fields:
        minTs = min(fields[field])
        fieldTimes[field] = [minTs, max(fields[field])]
        times.append(minTs)
        print 'Min ts: {}'.format(minTs)

    times = set(sorted(times))
    print "Times: ", times
    print "Fields: ", fieldTimes

    versions = {}
    for ts in times:
        d = datetime.fromtimestamp(ts)
        ver = d.strftime("%d_%B_%Y")

        print "Version: ", ver

        versions[ver] = []
        for field in fields:
            if ts in fields[field]:
                versions[ver].append(field)


def main():
    global leadDir
    global fields
    global fieldTimes

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    fields = {}
    fieldTimes = {}

    aggFiles = f.glob(leadDir + '*.lead.agg')

    for aggFile in aggFiles:
        parse_agg_file(aggFile)

    determine_form_versions()

    print "Versions: ", versions




if __name__ == "__main__":
    main()

Answer 1

回答by Gabe Spradlin

So I figured it out... I post this answer just in case someone else makes the same error.

所以我想通了......我发布了这个答案，以防其他人犯同样的错误。

First, I found a work around but I wasn't sure why this worked. From my original code, here is my file_get_contentsfunction:

首先，我找到了一个解决方法，但我不确定为什么会这样。从我的原始代码，这是我的file_get_contents功能：

def file_get_contents(fname):
    if s.stripos(fname, 'http://'):
        import urllib2
        return urllib2.urlopen(fname).read(maxUrlRead)
    else:
        return open(fname).read(maxFileRead)

I used it via:

我通过以下方式使用它：

tmp = json.loads(f.file_get_contents(aggFile))

This failed, over and over and over again. However, as I was attempting to get Python to at least give me the JSON string to put through a JSON validatorI came across mention of json.loadvs json.loads. So I tried this instead:

这一次又一次地失败了。但是，当我试图让 Python 至少给我 JSON 字符串以通过JSON 验证器时，我遇到了json.loadvs json.loads. 所以我尝试了这个：

a = open('D://Server Data//eagle805//emmetrics//forms//leads\July_2014.cd.lead.agg')
b = json.load(a)

While I haven't tested this output in my overall code this code chunk does in fact read in the file, decode the JSON, and will even display the data without crashing Spyder. The variable explorer in Spyder shows that b is a dict of size 1465 and that is exactly how many records it should have. The portion of the displayed text from the end of the dict all looks good. So overall I have a reasonably high level confidence that the data was parsed correctly.

虽然我没有在我的整体代码中测试这个输出，但这个代码块实际上会读入文件，解码 JSON，甚至会在不使 Spyder 崩溃的情况下显示数据。Spyder 中的变量资源管理器显示 b 是一个大小为 1465 的字典，这正是它应该有多少记录。字典末尾的显示文本部分看起来都不错。所以总的来说，我对数据被正确解析有相当高的信心。

When I wrote the file_get_contentsfunction I saw several recommendations that I always provide a max number of bytes to read so as to prevent Python from hanging on a bad return. The value of maxReadFilewas 1E7. When I manually forced maxReadFileto be 1E9everything worked fine. Turns out the file is just under 1.2E7 bytes. So the resulting string from reading the file was not the full string in the file and as a result was invalid JSON.

当我编写file_get_contents函数时，我看到了一些建议，我总是提供最大数量的字节来读取，以防止 Python 挂在错误的返回上。值maxReadFile是1E7。当我手动强制maxReadFile是1E9一切工作的罚款。原来该文件不到 1.2E7 字节。因此，读取文件的结果字符串不是文件中的完整字符串，因此是无效的 JSON。

Normally I would think this is a bug but clearly when opening and reading a file you need to be able to read just a chunk at a time for memory management. So I got bit by my own shortsightedness with regards to the maxReadFilevalue. The error message was correct but sent me off on a wild goose chase.

通常我会认为这是一个错误，但很明显，在打开和读取文件时，您需要一次只读取一个块以进行内存管理。所以我对自己对maxReadFile价值的短视感到有些不满。错误消息是正确的，但让我大吃一惊。

Hopefully this could save someone else some time.

希望这可以为其他人节省一些时间。

Answer 2

回答by Samantha

I got the same problem. As it turned out, the last line of the file was incomplete probably due to the abrupt halt of the download as I found there was enough data and simply stopped the process on the terminal.

我遇到了同样的问题。事实证明，文件的最后一行不完整，可能是由于下载突然停止，因为我发现有足够的数据并只是在终端上停止了该过程。

Answer 3

回答by ssi-anik

If someone is here just like I am and if you're handling jsonfrom the form requests then check if there is any Content-Lengthheader set or not. I was getting this error because of that header. I used the JSON beautification and found the json became quite large which raised this error.

如果有人像我一样在这里，并且您正在处理json表单请求，请检查是否Content-Length设置了任何标题。由于该标题，我收到此错误。我使用了 JSON 美化，发现 json 变得非常大，从而引发了这个错误。

Python 和 JSON：ValueError：未终止的字符串起始于：

提问by Gabe Spradlin

回答by Gabe Spradlin

回答by Samantha

回答by ssi-anik

相关推荐

最近更新

标签

Python 和 JSON：ValueError：未终止的字符串起始于：

提问by Gabe Spradlin

回答by Gabe Spradlin

回答by Samantha

回答by ssi-anik

相关推荐

YAML 或 JSON 中的语言代码列表？

json Steam Web API 获取 CS:GO 库存

JSON 有效字符

使用新参数重新加载 json 存储 ExtJs Ext.data.JsonStore

相关推荐

最近更新

标签