Python 如何在不耗尽内存的情况下从 sql 查询创建大型 Pandas 数据框?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18107953/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create a large pandas dataframe from an sql query without running out of memory?
提问by slizb
I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.
我在从 MS SQL Server 数据库查询超过 500 万条记录的表时遇到问题。我想选择所有记录,但是在将大量数据选择到内存中时,我的代码似乎失败了。
This works:
这有效:
import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
...but this does not work:
...但这不起作用:
sql = "SELECT TOP 2000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
It returns this error:
它返回此错误:
File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandas\lib.c:42733) Memory Error
I have read herethat a similar problem exists when creating a dataframe
from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:
我在这里读到dataframe
从 csv 文件创建时存在类似的问题,解决方法是使用“迭代器”和“块大小”参数,如下所示:
read_csv('exp4326.csv', iterator=True, chunksize=1000)
Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion hereabout working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.
从 SQL 数据库查询是否有类似的解决方案?如果没有,首选的解决方法是什么?我应该使用其他一些方法来读取块中的记录吗?我在这里阅读了一些关于在 Pandas 中处理大型数据集的讨论,但执行 SELECT * 查询似乎需要做很多工作。当然还有更简单的方法。
采纳答案by ThePhysicist
Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
更新:请务必查看下面的答案,因为 Pandas 现在内置了对分块加载的支持。
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
您可以简单地尝试逐块读取输入表,然后从各个部分组装完整的数据帧,如下所示:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.
也有可能整个数据框太大而无法放入内存,在这种情况下,您别无选择,只能限制您选择的行数或列数。
回答by Kamil Sindi
As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql
to read and process the query chunk by chunk:
正如评论中提到的,从 Pandas 0.15 开始,您有一个 chunksize 选项read_sql
来逐块读取和处理查询:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
参考:http: //pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
回答by flying_fluid_four
Code solution and remarks.
代码解决方案和备注。
# Create empty list
dfl = []
# Create empty dataframe
dfs = pd.DataFrame()
# Start Chunking
for chunk in pd.read_sql(query, con=conct, ,chunksize=10000000):
# Start Appending Data Chunks from SQL Result set into List
dfl.append(chunk)
# Start appending data from list to dataframe
dfs = pd.concat(dfl, ignore_index=True)
However, my memory analysis tells me that even though the memory is released after each chunk is extracted, the list is growing bigger and bigger and occupying that memory resulting in a net net no gain on free RAM.
但是,我的内存分析告诉我,即使在提取每个块后释放内存,列表也会越来越大并占用该内存,导致可用 RAM 净净无增益。
Would love to hear what the author / others have to say.
很想听听作者/其他人怎么说。
回答by Dharsanborn98
If you want to limit the number of rows in output, just use:
如果要限制输出中的行数,只需使用:
data = psql.read_frame(sql, cnxn,chunksize=1000000).__next__()