pandas 将大量数据从远程服务器拉入 DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25633830/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pull large amounts of data from a remote server, into a DataFrame
提问by Marius Butuc
To give as much context as I can / is needed, I'm trying to pull some data stored on a remote postgres server (heroku) into a pandas DataFrame, using psycopg2 to connect.
为了尽可能多地提供上下文,我正在尝试将存储在远程 postgres 服务器 (heroku) 上的一些数据提取到 pandas DataFrame 中,使用 psycopg2 进行连接。
I'm interested in two specific tables, usersand events, and the connection works fine, because when pulling down the user data
我对两个特定的表、users和events感兴趣,并且连接工作正常,因为在下拉用户数据时
import pandas.io.sql as sql
# [...]
users = sql.read_sql("SELECT * FROM users", conn)
after waiting a few seconds, the DataFrame is returned as expected.
等待几秒钟后,DataFrame 按预期返回。
<class 'pandas.core.frame.DataFrame'>
Int64Index: 67458 entries, 0 to 67457
Data columns (total 35 columns): [...]
Yet when trying to pull the bigger, heavier eventsdata straight from ipython, after a long time, it just crashes:
然而,当试图直接从 ipython 中提取更大、更重的事件数据时,经过很长时间,它只是崩溃:
In [11]: events = sql.read_sql("SELECT * FROM events", conn)
vagrant@data-science-toolbox:~$
and when trying from an iPython notebook I get the Dead kernelerror
当我从 iPython 笔记本上尝试时,我收到了死内核错误
The kernel has died, would you like to restart it? If you do not restart the kernel, you will be able to save the notebook, but running code will not work until the notebook is reopened.
内核已死,您要重新启动它吗?如果不重新启动内核,您将能够保存笔记本,但在重新打开笔记本之前,运行代码将无法运行。
Update #1:
更新 #1:
To get a better idea of the size of the eventstable I'm trying to pull in, here are the number of records and the number of attributes for each:
为了更好地了解我尝试引入的事件表的大小,以下是记录数和每个记录的属性数:
In [11]: sql.read_sql("SELECT count(*) FROM events", conn)
Out[11]:
count
0 2711453
In [12]: len(sql.read_sql("SELECT * FROM events LIMIT 1", conn).columns)
Out[12]: 18
Update #2:
更新#2:
Memory is definitely a bottleneck for the current implementation of read_sql: when pulling down the eventsand trying to run another instance of iPython the result is
内存绝对是当前实现的瓶颈read_sql:当拉下事件并尝试运行另一个 iPython 实例时,结果是
vagrant@data-science-toolbox:~$ sudo ipython
-bash: fork: Cannot allocate memory
Update #3:
更新 #3:
I first tried with a read_sql_chunkedimplementation that would just return the array of partial DataFrames:
我首先尝试了一个read_sql_chunked只返回部分 DataFrame 数组的实现:
def read_sql_chunked(query, conn, nrows, chunksize=1000):
start = 0
dfs = []
while start < nrows:
df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), conn)
start += chunksize
dfs.append(df)
print "Events added: %s to %s of %s" % (start-chunksize, start, nrows)
# print "concatenating dfs"
return dfs
event_dfs = read_sql_chunked("SELECT * FROM events", conn, events_count, 100000)
and that works well, but when trying to concatenate the DataFrames, the kernel dies again.
And this is after giving the VM 2GB of RAM.
这很有效,但是当尝试连接 DataFrame 时,内核再次死亡。
这是在给 VM 2GB RAM 之后。
Based on Andy's explanation of read_sqlvs. read_csvdifference in implementation and performance, the next thing I tried was to append the records into a CSV and then read them all into a DataFrame:
基于安迪read_sql对read_csv实现和性能差异的解释,我尝试的下一件事是将记录附加到 CSV 中,然后将它们全部读入数据帧:
event_dfs[0].to_csv(path+'new_events.csv', encoding='utf-8')
for df in event_dfs[1:]:
df.to_csv(path+'new_events.csv', mode='a', header=False, encoding='utf-8')
Again, the writing to CSV completes successfully – a 657MB file – but reading from the CSV never completes.
同样,写入 CSV 成功完成 - 一个 657MB 的文件 - 但从 CSV 读取从未完成。
How can one approximate how much RAM would be sufficient to read say a 657MB CSV file, since 2GB seem not to be enough?
一个 657MB 的 CSV 文件,怎么能估计多少 RAM 就足够了,因为 2GB 似乎还不够?
Feels like I'm missing some fundamental understanding of either DataFrames or psycopg2, but I'm stuck, I can't even pinpoint the bottleneck or where to optimize.
感觉就像我缺少对 DataFrames 或 psycopg2 的一些基本了解,但我被卡住了,我什至无法确定瓶颈或优化的位置。
What's the proper strategy to pull larger amounts of data from a remote (postgres) server?
从远程(postgres)服务器提取大量数据的正确策略是什么?
采纳答案by Andy Hayden
I suspect there's a couple of (related) things at play here causing slowness:
我怀疑这里有一些(相关的)事情在起作用导致缓慢:
read_sqlis written in python so it's a little slow (especially compared toread_csv, which is written in cython - and carefully implemented for speed!) and it relies on sqlalchemy rather than some (potentially much faster) C-DBAPI. The impetus to move to sqlalchmey was to make that move easier in the future (as well as cross-sql-platform support).- You may be running out of memory as too many python objects are in memory (this is related to not using a C-DBAPI), but potentially could be addressed...
read_sql是用 python 编写的,所以它有点慢(特别是与read_csv用 cython 编写的 . 迁移到 sqlalchmey 的动力是为了让未来的迁移更容易(以及跨 sql 平台支持)。- 您可能会因为内存中的 Python 对象过多而耗尽内存(这与不使用 C-DBAPI 相关),但可能可以解决...
I think the immediate solution is a chunk-based approach (and there is a feature requestto have this work natively in pandas read_sqland read_sql_table).
我认为直接的解决方案是基于块的方法(并且有一个功能要求在 pandasread_sql和read_sql_table.
EDIT: As of Pandas v0.16.2 this chunk based approach is natively implemented in read_sql.
编辑:从 Pandas v0.16.2 开始,这种基于块的方法在read_sql.
Since you're using postgres you have access the the LIMIT and OFFSET queries, which makes chunking quite easy. (Am I right in thinking these aren't available in all sql languages?)
由于您使用的是 postgres,您可以访问LIMIT 和 OFFSET 查询,这使得分块变得非常容易。(我认为这些并非在所有 sql 语言中都可用吗?)
First, get the number of rows (or an estimate) in your table:
首先,获取表中的行数(或估计值):
nrows = con.execute('SELECT count(*) FROM users').fetchone()[0] # also works with an sqlalchemy engine
Use this to iterate through the table (for debugging you could add some print statements to confirm that it was working/not crashed!) and then combine the result:
使用它来遍历表(为了调试,您可以添加一些打印语句以确认它正在工作/没有崩溃!)然后组合结果:
def read_sql_chunked(query, con, nrows, chunksize=1000):
start = 1
dfs = [] # Note: could probably make this neater with a generator/for loop
while start < nrows:
df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), con)
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
Note: this assumes that the database fits in memory! If it doesn't you'll need to work on each chunk (mapreduce style)... or invest in more memory!
注意:这假设数据库适合内存!如果不是,您将需要处理每个块(mapreduce 样式)……或者投资更多内存!
回答by Ch HaXam
try to use pandas:
尝试使用Pandas:
mysql_cn = mysql.connector.connect(host='localhost', port=123, user='xyz', passwd='****', db='xy_db')**
data= pd.read_sql('SELECT * FROM table;', con=mysql_cn)
mysql_cn.close()
It worked for me.
它对我有用。
回答by Hillary Murefu
Here is a basic cursor example that might be of help:
这是一个可能有帮助的基本游标示例:
import psycopg2
导入 psycopg2
note that we have to import the Psycopg2 extras library!
请注意,我们必须导入 Psycopg2 附加库!
import psycopg2.extras
导入 psycopg2.extras
import sys
导入系统
def main(): conn_string = "host='localhost' dbname='my_database' user='postgres' password='secret'" ### print the connection string we will use to connect
def main(): conn_string = "host='localhost' dbname='my_database' user='postgres' password='secret'" ### 打印我们将用于连接的连接字符串
conn = psycopg2.connect(conn_string)
### HERE IS THE IMPORTANT PART, by specifying a name for the cursor
### psycopg2 creates a server-side cursor, which prevents all of the
### records from being downloaded at once from the server.
cursor = conn.cursor('cursor_unique_name', cursor_factory=psycopg2.extras.DictCursor)
cursor.execute('SELECT * FROM my_table LIMIT 1000')
### Because cursor objects are iterable we can just call 'for - in' on
### the cursor object and the cursor will automatically advance itself
### each iteration.
### This loop should run 1000 times, assuming there are at least 1000
### records in 'my_table'
row_count = 0
for row in cursor:
row_count += 1
print "row: %s %s\n" % (row_count, row)
if name== "main": main()
if name== " main": main()

