python postgres 我可以 fetchall() 100 万行吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17933344/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python postgres can I fetchall() 1 million rows?
提问by Medya Gh
I am using psycopg2module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.
我在 python 中使用psycopg2模块从 postgres 数据库中读取数据,我需要对超过 100 万行的列中的所有行进行一些操作。
I would like to know would cur.fetchall()
fail or cause my server to go down? (since my RAM might not be that big to hold all that data)
我想知道会cur.fetchall()
失败还是导致我的服务器宕机?(因为我的 RAM 可能没有那么大,无法容纳所有数据)
q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
doSomething(row)
what is the smarter way to do this?
什么是更聪明的方法来做到这一点?
采纳答案by Burhan Khalid
fetchall()
fetches up to the arraysize
limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:
fetchall()
获取到arraysize
限制,因此为了防止对数据库的大量命中,您可以以可管理的批次获取行,或者简单地遍历游标直到其耗尽:
row = cur.fetchone()
while row:
# do something with row
row = cur.fetchone()
回答by alecxe
Consider using server side cursor:
考虑使用服务器端游标:
When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client side, it is possible to create a server side cursor. Using this kind of cursor it is possible to transfer to the client only a controlled amount of data, so that a large dataset can be examined without keeping it entirely in memory.
当执行数据库查询时,Psycopg 游标通常会获取后端返回的所有记录,并将它们传输到客户端进程。如果查询返回大量数据,客户端将按比例分配大量内存。
如果数据集太大而无法在客户端实际处理,则可以创建服务器端游标。使用这种游标,可以仅将受控数量的数据传输到客户端,以便可以检查大型数据集而无需将其完全保留在内存中。
Here's an example:
下面是一个例子:
cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
cursor.execute("FETCH 1000 FROM super_cursor")
rows = cursor.fetchall()
if not rows:
break
for row in rows:
doSomething(row)
回答by linqu
The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:
Burhan 指出的解决方案是通过仅获取单行来减少大型数据集的内存使用:
row = cursor.fetchone()
row = cursor.fetchone()
However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.
但是,我注意到逐行获取行的速度明显减慢。我通过互联网连接访问外部数据库,这可能是一个原因。
Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:
事实证明,使用服务器端游标并获取大量行是最高效的解决方案。您可以更改 sql 语句(如 alecxe 的答案),但也有使用 psycopg2 提供的功能的纯 python 方法:
cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)
while True:
rows = cursor.fetchmany(5000)
if not rows:
break
for row in rows:
# do something with row
pass
you find more about server side cursors in the psycopg2 wiki
您可以在psycopg2 wiki 中找到有关服务器端游标的更多信息
回答by Le Droid
Here is the code to use for simple server side cursorwith the speed of fetchmany
management.
这是用于具有管理速度的简单服务器端光标的代码fetchmany
。
The principle is to use named cursorin Psycopg2 and give it a good itersize
to load many rows at once like fetchmany
would do but with a single loop of for rec in cursor
that does an implicit fetchnone()
.
原则是在 Psycopg2 中使用命名游标,并让它像往常itersize
一样一次加载多行,fetchmany
但使用单个循环for rec in cursor
执行隐式fetchnone()
.
With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.
使用此代码,我在 1 小时内和 200 兆内存中查询了数十亿行表中的 1.5 亿行。
回答by parity3
Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.
不确定命名光标是否适合而无需交互地向前/向后滚动?我在这里可能是错的。
The fetchmany
loop is tedious but I think it's the best solution here. To make life easier, you can use the following:
该fetchmany
循环是乏味的,但我认为这是最好的解决方案。为了让生活更轻松,您可以使用以下内容:
from functools import partial
from itertools import chain
# from_iterable added >= python 2.7
from_iterable = chain.from_iterable
# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
if parms is None:
curs.execute(sql)
else:
curs.execute(sql, parms)
chunks_until_empty = iter(partial(fetchmany, chunksize), [])
return from_iterable(chunks_until_empty)
# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
print 'lots of waffles: %s' % (row,)