Python 加速写入文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4961589/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Speed up writing to files
提问by chmullig
I've profiled some legacy code I've inherited with cProfile. There were a bunch of changes I've already made that have helped (like using simplejson's C extensions!).
我分析了一些我用 cProfile 继承的遗留代码。我已经做了很多有帮助的更改(比如使用 simplejson 的 C 扩展!)。
Basically this script is exporting data from one system to an ASCII fixed-width file. Each row is a record, and it has many values. Each line is 7158 characters and contains a ton of spaces. The total record count is 1.5 million records. Each row is generated one at a time, and takes a while (5-10 rows a second).
基本上,此脚本将数据从一个系统导出到 ASCII 固定宽度文件。每一行都是一条记录,它有很多值。每行有 7158 个字符,包含大量空格。总记录数为 150 万条记录。每行一次生成一个,需要一段时间(每秒 5-10 行)。
As each row is generated it's written to disk as simply as possible. The profiling indicates that about 19-20% of the total time is spent in file.write(). For a test case of 1,500 rows that's 20 seconds. I'd like to reduce that number.
生成每一行时,它都会尽可能简单地写入磁盘。分析表明大约19-20% 的总时间花费在file.write(). 对于 1,500 行的测试用例,即 20 秒。我想减少这个数字。
Now it seems the next win will be reducing the amount of time spent writing to disk. I'd like to reduce it, if possible. I can keep a cache of records in memory, but I can't wait until the end and dump it all at once.
现在看来下一个胜利将是减少写入磁盘所花费的时间。如果可能的话,我想减少它。我可以在内存中保存一个记录缓存,但我不能等到最后一次将它全部转储。
fd = open(data_file, 'w')
for c, (recordid, values) in enumerate(generatevalues()):
row = prep_row(recordid, values)
fd.write(row)
if c % 117 == 0:
if limit > 0 and c >= limit:
break
sys.stdout.write('\r%s @ %s' % (str(c + 1).rjust(7), datetime.now()))
sys.stdout.flush()
My first thought would be to keep a cache of records in a list and write them out in batches. Would that be faster? Something like:
我的第一个想法是将记录缓存在列表中并分批写出。那会更快吗?就像是:
rows = []
for c, (recordid, values) in enumerate(generatevalues()):
rows.append(prep_row(recordid, values))
if c % 117 == 0:
fd.write('\n'.join(rows))
rows = []
My second thought would be to use another thread, but that makes me want to die inside.
我的第二个想法是使用另一个线程,但这让我想死在里面。
采纳答案by chmullig
Batching the writes into groups of 500 did indeed speed up the writes significantly. For this test case the writing rows individually took 21.051 seconds in I/O, while writing in batches of 117 took 5.685 seconds to write the same number of rows. Batches of 500 took a total of only 0.266 seconds.
将写入分批写入 500 组确实显着加快了写入速度。对于这个测试用例,单独写入行在 I/O 中需要 21.051 秒,而批量写入 117 需要 5.685 秒来写入相同数量的行。500 个批次总共只需要 0.266 秒。
回答by Winston Ewert
Actually, your problem is not that file.write()takes 20% of your time. Its that 80% of the time you aren't in file.write()!
实际上,您的问题并不是file.write()占用您 20% 的时间。它有 80% 的时间你不在file.write()!
Writing to the disk is slow. There is really nothing you can do about it. It simply takes a very large amount of time to write stuff out to disk. There is almost nothing you can do to speed it up.
写入磁盘很慢。你真的无能为力。将内容写入磁盘只需要花费大量时间。您几乎无法做任何事情来加快速度。
What you want is for that I/O time to be the biggest part of the program so that your speed is limited by the speed of the hard disk not your processing time. The ideal is for file.write()to have 100% usage!
你想要的是让 I/O 时间成为程序的最大部分,这样你的速度就会受到硬盘速度的限制,而不是你的处理时间。理想的情况是file.write()使用率达到 100%!
回答by tsg
You can do mmapin python, which mighthelp. But I suspect you did some mistake while profiling, because 7k * 1500 in 20 seconds is about 0.5 Mbytes/s. Do a test in which you write random lines with the same length, and you will see it's much faster than that.
您可以在 python 中执行mmap,这可能会有所帮助。但是我怀疑您在分析时犯了一些错误,因为 20 秒内 7k * 1500 大约为 0.5 MB/s。做一个测试,你写出相同长度的随机行,你会发现它比这快得多。

