Python 在文件中写入大量数据的最快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27384093/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:44:32  来源:igfitidea点击:

Fastest way to write huge data in file

pythonperformancefile

提问by ajknzhol

I am trying to create a random real, integers, alphanumeric, alpha strings and then writing to a file till the file size reaches 10MB.

我正在尝试创建一个随机实数、整数、字母数字、字母字符串,然后写入文件直到文件大小达到10MB

The code is as follows.

代码如下。

import string
import random
import time
import sys


class Generator():
    def __init__(self):
        self.generate_alphabetical_strings()
        self.generate_integers()
        self.generate_alphanumeric()
        self.generate_real_numbers()

    def generate_alphabetical_strings(self):
        return ''.join(random.choice(string.ascii_lowercase) for i in range(12))

    def generate_integers(self):
        return ''.join(random.choice(string.digits) for i in range(12))

    def generate_alphanumeric(self):
        return ''.join(random.choice(self.generate_alphabetical_strings() +
                                     self.generate_integers()) for i in range(12))

    def _insert_dot(self, string, index):
        return string[:index].__add__('.').__add__(string[index:])


    def generate_real_numbers(self):
        rand_int_string = ''.join(random.choice(self.generate_integers()) for i in range(12))
        return self._insert_dot(rand_int_string, random.randint(0, 11))


from time import process_time
import os

a = Generator()

t = process_time()
inp = open("test.txt", "w")
lt = 10 * 1000 * 1000
count = 0
while count <= lt:
    inp.write(a.generate_alphanumeric())
    count += 39
inp.close()

elapsed_time = process_time() - t
print(elapsed_time)

It takes around 225.953125 secondsto complete. How can i improve the speed of this program? Please provide some code insights?

完成大约需要225.953125 秒。我怎样才能提高这个程序的速度?请提供一些代码见解?

采纳答案by Dr. Jan-Philip Gehrcke

Two major reasons for observed "slowness":

观察到“缓慢”的两个主要原因:

  • your while loop is slow, it has about a million iterations.
  • You do not make proper use of I/O buffering. Do not make so many system calls. Currently, you are calling write()about one million times.
  • 你的 while 循环很慢,它有大约一百万次迭代。
  • 您没有正确使用 I/O 缓冲。不要进行如此多的系统调用。目前,您呼叫了write()大约一百万次。

Create your data in a Python data structure first and call write()only once.

首先在 Python 数据结构中创建数据并且write()只调用一次

This is faster:

这更快:

t0 = time.time()
open("bla.txt", "wb").write(''.join(random.choice(string.ascii_lowercase) for i in xrange(10**7)))
d = time.time() - t0
print "duration: %.2f s." % d

Output: duration: 7.30 s.

输出: duration: 7.30 s.

Now the program spends most of its time generating the data, i.e. in randomstuff. You can easily see that by replacing random.choice(string.ascii_lowercase)with e.g. "a". Then the measured time drops to below one second on my machine.

现在该程序花费大部分时间来生成数据,即random东西。您可以通过替换random.choice(string.ascii_lowercase)为 eg来轻松看到这一点"a"。然后在我的机器上测量的时间下降到一秒以下。

And if you want to get even closer to seeing how fast your machine really is when writing to disk, use Python's fastest (?) way to generate largish data before writing it to disk:

如果您想进一步了解您的机器在写入磁盘时的实际速度,请使用 Python 的最快 (?) 方法在将其写入磁盘之前生成较大的数据:

>>> t0=time.time(); chunk="a"*10**7; open("bla.txt", "wb").write(chunk); d=time.time()-t0; print "duration: %.2f s." % d
duration: 0.02 s.

回答by Aaron Digulla

You literally create billions of objects which you then quickly throw away. In this case, it's probably better to write the strings directly into the file instead of concatenating them with ''.join().

您实际上创建了数十亿个对象,然后您很快将它们扔掉。在这种情况下,最好将字符串直接写入文件中,而不是将它们与''.join().

回答by debiatan

The while loop under main calls generate_alphanumeric, which chooses several characters out of (fresh randomly generated) strings composed of twelve ascii letters and twelve numbers. That's basically the same as choosing randomly either a random letter or a random number twelve times. That's your main bottleneck. This version will make your code one order of magnitude faster:

主调用下的 while 循环generate_alphanumeric,它从(新鲜随机生成的)由十二个 ascii 字母和十二个数字组成的字符串中选择几个字符。这与随机选择一个随机字母或随机数字 12 次基本相同。这是你的主要瓶颈。此版本将使您的代码速度提高一个数量级:

def generate_alphanumeric(self):
    res = ''
    for i in range(12):
        if random.randrange(2):
            res += random.choice(string.ascii_lowercase)
        else:
            res += random.choice(string.digits)
    return res

I'm sure it can be improved upon. I suggest you take your profiler for a spin.

我相信它可以改进。我建议你带上你的分析器。