python postgres 我可以 fetchall() 100 万行吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17933344/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:31:52  来源:igfitidea点击:

python postgres can I fetchall() 1 million rows?

pythonpostgresqlfetchall

提问by Medya Gh

I am using psycopg2module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.

我在 python 中使用psycopg2模块从 postgres 数据库中读取数据,我需要对超过 100 万行的列中的所有行进行一些操作。

I would like to know would cur.fetchall()fail or cause my server to go down? (since my RAM might not be that big to hold all that data)

我想知道会cur.fetchall()失败还是导致我的服务器宕机?(因为我的 RAM 可能没有那么大,无法容纳所有数据)

q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
    doSomething(row)

what is the smarter way to do this?

什么是更聪明的方法来做到这一点?

采纳答案by Burhan Khalid

fetchall()fetches up to the arraysizelimit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:

fetchall()获取到arraysize限制,因此为了防止对数据库的大量命中,您可以以可管理的批次获取行,或者简单地遍历游标直到其耗尽:

row = cur.fetchone()
while row:
   # do something with row
   row = cur.fetchone()

回答by alecxe

Consider using server side cursor:

考虑使用服务器端游标

When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.

If the dataset is too large to be practically handled on the client side, it is possible to create a server side cursor. Using this kind of cursor it is possible to transfer to the client only a controlled amount of data, so that a large dataset can be examined without keeping it entirely in memory.

当执行数据库查询时,Psycopg 游标通常会获取后端返回的所有记录,并将它们传输到客户端进程。如果查询返回大量数据,客户端将按比例分配大量内存。

如果数据集太大而无法在客户端实际处理,则可以创建服务器端游标。使用这种游标,可以仅将受控数量的数据传输到客户端,以便可以检查大型数据集而无需将其完全保留在内存中。

Here's an example:

下面是一个例子:

cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
    cursor.execute("FETCH 1000 FROM super_cursor")
    rows = cursor.fetchall()

    if not rows:
        break

    for row in rows:
        doSomething(row)

回答by linqu

The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:

Burhan 指出的解决方案是通过仅获取单行来减少大型数据集的内存使用:

row = cursor.fetchone()

row = cursor.fetchone()

However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.

但是,我注意到逐行获取行的速度明显减慢。我通过互联网连接访问外部数据库,这可能是一个原因。

Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:

事实证明,使用服务器端游标并获取大量行是最高效的解决方案。您可以更改 sql 语句(如 alecxe 的答案),但也有使用 psycopg2 提供的功能的纯 python 方法:

cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)

while True:
    rows = cursor.fetchmany(5000)
    if not rows:
        break

    for row in rows:
        # do something with row
        pass

you find more about server side cursors in the psycopg2 wiki

您可以在psycopg2 wiki 中找到有关服务器端游标的更多信息

回答by Le Droid

Here is the code to use for simple server side cursorwith the speed of fetchmanymanagement.

这是用于具有管理速度的简单服务器端光标的代码fetchmany

The principle is to use named cursorin Psycopg2 and give it a good itersizeto load many rows at once like fetchmanywould do but with a single loop of for rec in cursorthat does an implicit fetchnone().

原则是在 Psycopg2 中使用命名游标,并让它像往常itersize一样一次加载多行,fetchmany但使用单个循环for rec in cursor执行隐式fetchnone().

With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.

使用此代码,我在 1 小时内和 200 兆内存中查询了数十亿行表中的 1.5 亿行。

回答by parity3

Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.

不确定命名光标是否适合而无需交互地向前/向后滚动?我在这里可能是错的。

The fetchmanyloop is tedious but I think it's the best solution here. To make life easier, you can use the following:

fetchmany循环是乏味的,但我认为这是最好的解决方案。为了让生活更轻松,您可以使用以下内容:

from functools import partial
from itertools import chain

# from_iterable added >= python 2.7
from_iterable = chain.from_iterable

# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
    if parms is None:
        curs.execute(sql)
    else:
        curs.execute(sql, parms)
    chunks_until_empty = iter(partial(fetchmany, chunksize), [])
    return from_iterable(chunks_until_empty)

# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
    print 'lots of waffles: %s' % (row,)