postgresql psycopg2 大查询后内存泄漏

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17199113/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 00:58:29  来源:igfitidea点击:

psycopg2 leaking memory after large query

pythonpostgresqlpsycopg2

提问by Adam Berlinsky-Schine

I'm running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). Am I missing a cleanup step?

我正在使用 psycopg2(我升级到 2.5 版)在 python 脚本中针对我的 postgres 数据库运行一个大型查询。查询完成后,我关闭游标和连接,甚至运行gc,但该过程仍然消耗大量内存(确切地说是7.3gb)。我是否缺少清理步骤?

import psycopg2
conn = psycopg2.connect("dbname='dbname' user='user' host='host'")
cursor = conn.cursor()
cursor.execute("""large query""")
rows = cursor.fetchall()
del rows
cursor.close()
conn.close()
import gc
gc.collect()

回答by joeblog

I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.

我遇到了一个类似的问题,经过几个小时的热血、汗水和泪水,我发现答案只需要添加一个参数。

Instead of

代替

cursor = conn.cursor()

write

cursor = conn.cursor(name="my_cursor_name")

or simpler yet

或者更简单

cursor = conn.cursor("my_cursor_name")

The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors

有关详细信息,请访问http://initd.org/psycopg/docs/usage.html#server-side-cursors

I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include "DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor.

我发现这些说明有点令人困惑,因为我虽然需要重写我的 SQL 以包含“DECLARE my_cursor_name ....”然后是“FETCH count 2000 FROM my_cursor_name”,但结果是 psycopg 为你做了这一切如果您在创建游标时简单地覆盖“name=None”默认参数,则该引擎盖。

The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.

上面使用 fetchone 或 fetchmany 的建议并不能解决问题,因为如果你不设置 name 参数,默认情况下 psycopg 将尝试将整个查询加载到 ram 中。如果您的内存仍然太少,您可能需要做的唯一另一件事(除了声明名称参数)是将 cursor.itersize 属性从默认的 2000 更改为 1000。

回答by Craig Ringer

Please see the next answerby @joeblog for the better solution.

请参阅@joeblog的下一个答案以获得更好的解决方案。



First, you shouldn't need all that RAM in the first place. What you should be doing here is fetching chunksof the result set. Don't do a fetchall(). Instead, use the much more efficient cursor.fetchmanymethod. See the psycopg2 documentation.

首先,您不应该首先需要所有这些 RAM。你应该在这里做的是获取结果集的。不要做一个fetchall(). 相反,使用更有效的cursor.fetchmany方法。请参阅psycopg2 文档

Now, the explanation for why it isn't freed, and why that isn't a memory leak in the formally correct use of that term.

现在,解释为什么它没有被释放,以及为什么在该术语的正式正确使用中这不是内存泄漏。

Most processes don't release memory back to the OS when it's freed, they just make it available for re-use elsewhere in the program.

大多数进程在释放内存时不会将内存释放回操作系统,它们只是使其可用于在程序的其他地方重新使用。

Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance.

如果程序可以压缩分散在内存中的剩余对象,则内存只能释放给操作系统。这只有在使用间接句柄引用时才有可能,否则移动对象将使指向该对象的现有指针无效。间接引用的效率相当低,尤其是在现代 CPU 上,在那里追逐指针会对性能造成可怕的影响。

What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk()lands up with a few small pieces still in use.

除非程序格外小心,否则通常会发生的情况是分配的每个大内存块都有brk()一些仍在使用的小块。

The OS can't tell whether the program considers this memory still in use or not, so it can't just claim it back. Since the program doesn't tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. This is one of the reasons you should have swap space.

操作系统无法判断程序是否认为该内存仍在使用中,因此它不能直接收回。由于程序不倾向于访问内存,操作系统通常会随着时间的推移将其换出,释放物理内存以供其他用途。这是您应该拥有交换空间的原因之一。

It's possible to write programs that hand memory back to the OS, but I'm not sure that you can do it with Python.

可以编写将内存交还给操作系统的程序,但我不确定您是否可以用 Python 做到这一点。

See also:

也可以看看:

So: this isn't actually a memory leak. If you do something else that uses lots of memory, the process shouldn't grow much if at all, it'll re-use the previously freed memory from the last big allocation.

所以:这实际上不是内存泄漏。如果您执行其他使用大量内存的操作,则该进程不应该增长太多,它会重新使用上次大分配中先前释放的内存。

回答by Le Droid

Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.

Joeblog 有正确的答案。处理获取的方式很重要,但比必须定义游标的方式要明显得多。这是一个简单的例子来说明这一点,并为您提供一些可以复制粘贴的内容。

import datetime as dt
import psycopg2
import sys
import time

conPG = psycopg2.connect("dbname='myDearDB'")
curPG = conPG.cursor('testCursor')
curPG.itersize = 100000 # Rows fetched at one time from the server

curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
# Warning: curPG.rowcount == -1 ALWAYS !!
cptLigne = 0
for rec in curPG:
   cptLigne += 1
   if cptLigne % 10000 == 0:
      print('.', end='')
      sys.stdout.flush() # To see the progression
conPG.commit() # Also close the cursor
conPG.close()

As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmanyfor performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn't need more ram as it uses temporary table.

正如您将看到的,点按组快速出现,而不是暂停以获得行缓冲区(itersize),因此您不需要使用fetchmany性能。当我用 运行它时/usr/bin/time -v,我在不到 3 分钟的时间内得到结果,仅使用 200MB 的 RAM(而不是 60GB 的客户端游标)处理 1000 万行。服务器不需要更多内存,因为它使用临时表。