如何编写一个内存高效的 Python 程序？

Question

提问by Hyman

It's said that Python automatically manages memory. I'm confused because I have a Python program consistently uses more than 2GB of memory.

据说Python会自动管理内存。我很困惑，因为我有一个 Python 程序始终使用超过 2GB 的内存。

It's a simple multi-thread binary data downloader and unpacker.

它是一个简单的多线程二进制数据下载器和解包器。

def GetData(url):
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    data = response.read() // data size is about 15MB
    response.close()
    count = struct.unpack("!I", data[:4])
    for i in range(0, count):
        UNPACK FIXED LENGTH OF BINARY DATA HERE
        yield (field1, field2, field3)

class MyThread(threading.Thread):
    def __init__(self, total, daterange, tickers):
        threading.Thread.__init__(self)

    def stop(self):
        self._Thread__stop()

    def run(self):
        GET URL FOR EACH REQUEST
        data = []
        items = GetData(url)
        for item in items:
            data.append(';'.join(item))
        f = open(filename, 'w')
        f.write(os.linesep.join(data))
        f.close()

There are 15 threads running. Each request gets 15MB of data and unpack it and saved to local text file. How could this program consume more than 2GB of memory? Do I need to do any memory recycling jobs in this case? How can I see how much memory each objects or functions use?

有 15 个线程在运行。每个请求获取 15MB 的数据并将其解压缩并保存到本地文本文件。这个程序怎么会消耗超过 2GB 的内存？在这种情况下我需要做任何内存回收工作吗？如何查看每个对象或函数使用了多少内存？

I would appreciate all your advices or tips on how to keep a python program running in a memory efficient mode.

我非常感谢您提供有关如何保持 Python 程序以内存高效模式运行的所有建议或技巧。

Edit:Here is the output of "cat /proc/meminfo"

编辑：这是“cat /proc/meminfo”的输出

MemTotal:        7975216 kB
MemFree:          732368 kB
Buffers:           38032 kB
Cached:          4365664 kB
SwapCached:        14016 kB
Active:          2182264 kB
Inactive:        4836612 kB

Answer 1

回答by tzot

Like others have said, you need at least the following two changes:

就像其他人所说的那样，您至少需要进行以下两项更改：

Do not create a huge list of integers with range

# use xrange
for i in xrange(0, count):
    # UNPACK FIXED LENGTH OF BINARY DATA HERE
    yield (field1, field2, field3)

do not create a huge string as the full file body to be written at once

# use writelines
f = open(filename, 'w')
f.writelines((datum + os.linesep) for datum in data)
f.close()

不要创建一个巨大的整数列表range

# use xrange
for i in xrange(0, count):
    # UNPACK FIXED LENGTH OF BINARY DATA HERE
    yield (field1, field2, field3)

不要创建一个巨大的字符串作为一次写入的完整文件体

# use writelines
f = open(filename, 'w')
f.writelines((datum + os.linesep) for datum in data)
f.close()

Even better, you could write the file as:

更好的是，您可以将文件编写为：

    items = GetData(url)
    f = open(filename, 'w')
    for item in items:
        f.write(';'.join(item) + os.linesep)
    f.close()

Answer 2

回答by Lennart Regebro

The major culprit here is as mentioned above the range() call. It will create a list with 15 million members, and that will eat up 200 MB of your memory, and with 15 processes, that's 3GB.

这里的主要罪魁祸首是上面提到的 range() 调用。它将创建一个包含 1500 万成员的列表，这将占用您 200 MB 的内存，如果有 15 个进程，则为 3GB。

But also don't read in the whole 15MB file into data(), read bit by bit from the response. Sticking those 15MB into a variable will use up 15MB of memory more than reading bit by bit from the response.

但也不要将整个 15MB 文件读入 data()，从响应中一点一点地读取。将这 15MB 放入一个变量中将比从响应中一点一点地读取更多地使用 15MB 的内存。

You might want to consider simply just extracting data until you run out if indata, and comparing the count of data you extracted with what the first bytes said it should be. Then you need neither range() nor xrange(). Seems more pythonic to me. :)

如果 indata 用完，您可能只想考虑只提取数据，然后将提取的数据计数与第一个字节所说的应该是的数量进行比较。那么你既不需要 range() 也不需要 xrange()。对我来说似乎更像 Pythonic。:)

Answer 3

回答by MarkR

Consider using xrange() instead of range(), I believe that xrange is a generator whereas range() expands the whole list.

考虑使用 xrange() 而不是 range()，我相信 xrange 是一个生成器，而 range() 扩展了整个列表。

I'd say either don't read the whole file into memory, or don't keep the whole unpacked structure in memory.

我会说要么不要将整个文件读入内存，要么不要将整个解压缩的结构保存在内存中。

Currently you keep both in memory, at the same time, this is going to be quite big. So you've got at least two copies of your data in memory, plus some metadata.

目前你将两者都保存在内存中，同时，这将是相当大的。因此，您在内存中至少有两个数据副本，以及一些元数据。

Also the final line

还有最后一行

    f.write(os.linesep.join(data))

May actually mean you've temporarily got a third copy in memory (a big string with the entire output file).

实际上可能意味着您在内存中临时获得了第三个副本（包含整个输出文件的大字符串）。

So I'd say you're doing it in quite an inefficient way, keeping the entire input file, entire output file and a fair amount of intermediate data in memory at once.

所以我会说你是在以一种非常低效的方式来做，一次将整个输入文件、整个输出文件和相当数量的中间数据保存在内存中。

Using the generator to parse it is quite a nice idea. Consider writing each record out after you've generated it (it can then be discarded and the memory reused), or if that causes too many write requests, batch them into, say, 100 rows at once.

使用生成器来解析它是一个不错的主意。考虑在生成每条记录后将其写出（然后可以将其丢弃并重新使用内存），或者如果这会导致写入请求过多，则将它们一次分批成 100 行。

Likewise, reading the response could be done in chunks. As they're fixed records this should be reasonably easy.

同样，读取响应可以分块完成。由于它们是固定记录，因此这应该相当容易。

Answer 4

回答by Caleb Hattingh

The last line should surely be f.close()? Those trailing parens are kinda important.

最后一行肯定是f.close()？那些尾随括号有点重要。

Answer 5

回答by PaulMcG

You could do more of your work in compiled C code if you convert this to a list comprehension:

如果将其转换为列表推导式，则可以在已编译的 C 代码中完成更多工作：

data = []
items = GetData(url)
for item in items:
    data.append(';'.join(item))

to:

到：

data = [';'.join(items) for items in GetData(url)]

This is actually slightly different from your original code. In your version, GetData returns a 3-tuple, which comes back in items. You then iterate over this triplet, and append ';'.join(item) for each item in it. This means that you get 3 entries added to data for every triplet read from GetData, each one ';'.join'ed. If the items are just strings, then ';'.join will give you back a string with every other character a ';' - that is ';'.join("ABC") will give back "A;B;C". I think what you actually wantedwas to have each triplet saved back to the data list as the 3 values of the triplet, separated by semicolons. That is what my version generates.

这实际上与您的原始代码略有不同。在您的版本中，GetData 返回一个 3 元组，它返回到项目中。然后迭代这个三元组，并为其中的每个项目附加 ';'.join(item) 。这意味着对于从 GetData 读取的每个三元组，您都会将 3 个条目添加到数据中，每个条目都带有 ';'.join'ed。如果项目只是字符串，那么 ';'.join 会给你一个字符串，每隔一个字符一个 ';' - 即 ';'.join("ABC") 将返回 "A;B;C"。我认为您真正想要的是将每个三元组作为三元组的 3 个值保存回数据列表，用分号分隔。这就是我的版本生成的。

This may also help somewhat with your original memory problem, as you are no longer creating as many Python values. Remember that a variable in Python has muchmore overhead than one in a language like C. Since each value is itself an object, and add the overhead of each name reference to that object, you can easily expand the theoretical storage requirement several-fold. In your case, reading 15Mb X 15 = 225Mb + the overhead of each item of each triple being stored as a string entry in your data list could quickly grow to your 2Gb observed size. At minimum, my version of your data list will have only 1/3 the entries in it, plus the separate item references are skipped, plus the iteration is done in compiled code.

这也可能对您的原始内存问题有所帮助，因为您不再创建尽可能多的 Python 值。请记住，Python中的变量有很多比一个更多的开销像C的语言由于每个值本身就是一个对象，并添加每个名称引用该对象的开销，可以轻松扩展的理论存储要求翻几番。在您的情况下，读取 15Mb X 15 = 225Mb + 作为字符串条目存储在数据列表中的每个三元组的每个项目的开销可能会迅速增长到您观察到的 2Gb 大小。至少，我的数据列表版本中只有 1/3 的条目，加上单独的项目引用被跳过，加上迭代是在编译代码中完成的。

Answer 6

回答by vy32

You can make this program more memory efficient by not reading all 15MB from the TCP connection, but instead processing each line as it is read. This will make the remote servers wait for you, of course, but that's okay.

通过不从 TCP 连接读取所有 15MB，而是在读取每一行时处理每一行，您可以使该程序的内存效率更高。当然，这将使远程服务器等待您，但这没关系。

Python is just not very memory efficient. It wasn't built for that.

Python 的内存效率不是很高。它不是为此而构建的。

Answer 7

回答by Denis Otkidach

There are 2 obvious places where you keep large data objects in memory (datavariable in GetData()and datain MyThread.run()- these two will take about 500Mb) and probably there are other places in the skipped code. There are both easy to make memory efficient. Use response.read(4)instead of reading whole response at once and do it the same way in code behind UNPACK FIXED LENGTH OF BINARY DATA HERE. Change data.append(...)in MyThread.run()to

有两个明显的地方可以将大数据对象保存在内存中（data变量 inGetData()和datain MyThread.run()- 这两个将占用大约 500Mb），并且可能在跳过的代码中还有其他地方。两者都容易使内存高效。使用response.read(4)而不是一次读取整个响应，并在后面的代码中以相同的方式执行UNPACK FIXED LENGTH OF BINARY DATA HERE。更改data.append(...)在MyThread.run()以

if not first:
    f.write(os.linesep)
f.write(';'.join(item))

These changes will save you a lot of memory.

这些更改将为您节省大量内存。

Answer 8

回答by Tarnay Kálmán

Make sure you delete the threads after they are stopped. (using del)

确保在线程停止后删除它们。（使用del）

如何编写一个内存高效的 Python 程序？

提问by Hyman

回答by tzot

回答by Lennart Regebro

回答by MarkR

回答by Caleb Hattingh

回答by PaulMcG

回答by vy32

回答by Denis Otkidach

回答by Tarnay Kálmán

相关推荐

最近更新

标签

如何编写一个内存高效的 Python 程序？

提问by Hyman

回答by tzot

回答by Lennart Regebro

回答by MarkR

回答by Caleb Hattingh

回答by PaulMcG

回答by vy32

回答by Denis Otkidach

回答by Tarnay Kálmán

相关推荐

在 Django 中使用 Python 正则表达式

python 在 Django 中，如何清除所有的 memcached 键和值？

向 Python 模块动态添加函数

有没有办法在 Python 中按第 n 个分隔符分割字符串？

相关推荐

最近更新

标签