Python Pandas SQL 块大小

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31837979/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:37:53  来源:igfitidea点击:

Pandas SQL chunksize

pythonsql-serverpandaschunks

提问by Nitin Kumar

This is more of a question on understanding than programming. I am quite new to Pandas and SQL. I am using pandas to read data from SQL with some specific chunksize. When I run a sql query e.g. import pandas as pd

这更多是关于理解而不是编程的问题。我对 Pandas 和 SQL 很陌生。我正在使用 Pandas 从具有特定块大小的 SQL 读取数据。当我运行 sql 查询时,例如 import pandas as pd

df = pd.read_sql_query('select name, birthdate from table1', chunksize = 1000)

What I do not understand is when I do not give a chunksize, data is stored in the memory and I can see the memory growing however, when I give a chunksize the memory usage is not that high.

我不明白的是,当我不给块大小时,数据存储在内存中,我可以看到内存在增长,但是当我给块大小时,内存使用率并没有那么高。

I have is that this df now contains a number of arrays which I can access as

我有的是,这个 df 现在包含许多我可以访问的数组

for df_array in df:
    print df.head(5)

What I do not understand here is if the entire result of the SQL statement is kept in memory i.e. df is an object carrying multiple arrays or if these are like pointers pointing towards a temp table created by SQL query.

我在这里不明白的是,SQL 语句的整个结果是否保存在内存中,即 df 是一个携带多个数组的对象,或者这些是否类似于指向 SQL 查询创建的临时表的指针。

I would be very glad to develop some understanding about how this process is actually working.

我很高兴能够对这个过程的实际运作方式有所了解。

回答by prusya

Let's consider two options and what happens in both cases:

让我们考虑两种选择以及两种情况下会发生什么:

  1. chunksize is None(default value):
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize is None
    • pandas tells database that it wants to receive all rows of the result table at once
    • database returns all rows of the result table
    • pandas stores the result table in memory and wraps it into a data frame
    • now you can use the data frame
  2. chunksize in not None:
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize has some value
    • pandas creates a query iterator(usual 'while True' loop which breaks when database says that there is no more data left) and iterates over it each time you want the next chunk of the result table
    • pandas tells database that it wants to receive chunksize rows
    • database returns the next chunksize rows from the result table
    • pandas stores the next chunksize rows in memory and wraps it into a data frame
    • now you can use the data frame
  1. 块大小为无(默认值):
    • pandas 将查询传递给数据库
    • 数据库执行查询
    • pandas 检查并看到 chunksize 为 None
    • pandas 告诉数据库它想一次接收结果表的所有行
    • 数据库返回结果表的所有行
    • pandas 将结果表存储在内存中并包装成数据框
    • 现在您可以使用数据框
  2. 块大小不是无:
    • pandas 将查询传递给数据库
    • 数据库执行查询
    • pandas 检查并看到 chunksize 有一些值
    • pandas 创建一个查询迭代器(通常的 'while True' 循环会在数据库说没有更多数据时中断)并在每次您想要结果表的下一个块时对其进行迭代
    • pandas 告诉数据库它想要接收大块大小的行
    • 数据库从结果表中返回下一个块大小的行
    • pandas 将下一个 chunksize 行存储在内存中并将其包装成一个数据框
    • 现在您可以使用数据框

For more details you can see pandas\io\sql.pymodule, it is well documented

有关更多详细信息,您可以查看pandas\io\sql.py模块,它有据可查

回答by joris

When you do not provide a chunksize, the full result of the query is put in a dataframe at once.

当您不提供 时chunksize,查询的完整结果会立即放入数据框中。

When you do provide a chunksize, the return value of read_sql_queryis an iterator of multiple dataframes. This means that you can iterate through this like:

当您提供 a 时chunksize, 的返回值read_sql_query是多个数据帧的迭代器。这意味着您可以像这样迭代:

for df in result:
    print df

and in each step dfis a dataframe (not an array!) that holds the data of a part of the query. See the docs on this: http://pandas.pydata.org/pandas-docs/stable/io.html#querying

并且在每个步骤中df都有一个数据框(不是数组!),其中包含查询的一部分的数据。请参阅关于此的文档:http: //pandas.pydata.org/pandas-docs/stable/io.html#querying

To answer your question regarding memory, you have to know that there are two steps in retrieving the data from the database: executeand fetch.
First the query is executed (result = con.execute()) and then the data are fetched from this result set as a list of tuples (data = result.fetch()). When fetching you can specify how many rows at once you want to fetch. And this is what pandas does when you provide a chunksize.
But, many database drivers already put all data into memory in the execute step, and not only when fetching the data. So in that regard, it should not matter much for the memory. Apart from the fact the copying of the data into a DataFrame only happens in different steps while iterating with chunksize.

要回答有关内存的问题,您必须知道从数据库检索数据有两个步骤:executefetch
首先执行查询 ( result = con.execute()),然后从该结果集中获取数据作为元组列表 ( data = result.fetch())。获取时,您可以指定一次要获取多少行。这就是 Pandas 在您提供chunksize.
但是,许多数据库驱动程序已经在执行步骤中将所有数据放入内存中,而不仅仅是在获取数据时。所以在这方面,内存应该没有太大关系。除了将数据复制到 DataFrame 仅在使用chunksize.