从 MySQL 将数值数据加载到 python/pandas/numpy 数组的最快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22173872/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:46:13  来源:igfitidea点击:

Fastest way to load numeric data into python/pandas/numpy array from MySQL

pythonmysqlnumpypandasmysql-python

提问by Fermion Portal

I want to read some numeric (double, i.e. float64) data from a MySQL table. The size of the data is ~200k rows.

我想从 MySQL 表中读取一些数字(双精度,即 float64)数据。数据的大小约为 200k 行。

MATLAB reference:

MATLAB 参考:

tic;
feature accel off;
conn = database(...);
c=fetch(exec(conn,'select x,y from TABLENAME'));
cell2mat(c.data);
toc

Elapsed time is ~1 second.

已用时间约为 1 秒。

Doing the same in python, using the several examples found in here (I have tried them all, i.e. using pandas read_frame, frame_query and the __processCursor function): How to convert SQL Query result to PANDAS Data Structure?

在 python 中做同样的事情,使用这里找到的几个例子(我已经尝试了所有这些,即使用 Pandas read_frame、frame_query 和 __processCursor 函数): 如何将 SQL 查询结果转换为 PANDAS 数据结构?

Reference python code:

参考python代码:

import pyodbc
import pandas.io.sql as psql
import pandas
connection_info = "DRIVER={MySQL ODBC 3.51 \
Driver};SERVER=;DATABASE=;USER=;PASSWORD=;OPTION=3;"
cnxn = pyodbc.connect(connection_info)
cursor = cnxn.cursor()
sql = "select x,y from TABLENAME"
#cursor.execute(sql)
#dataframe = __processCursor(cursor, dataframe=True)
#df = psql.frame_query(sql, cnxn, coerce_float=False)
df = psql.read_frame(sql, cnxn)
cnxn.close()

Takes ~6 seconds. Profiler says all the time spent was in read_frame. I was wondering if anyone could give me some hints how could this be speedup to at least match the MATLAB code. And if that is possible at all in python.

大约需要 6 秒。Profiler 说所有的时间都花在了 read_frame 上。我想知道是否有人可以给我一些提示,这如何加速以至少匹配 MATLAB 代码。如果这在 python 中是可能的。

EDIT:

编辑:

The bottleneck seems to be inside the cursor.execute (in pymysql library) or cursor.fetchall() in pyodbc library. The slowest part is reading the returned MySQL data element by element (row by row, column by column) and converting it to the data type which it inferred previously by the same library.

瓶颈似乎在 pyodbc 库中的 cursor.execute(在 pymysql 库中)或 cursor.fetchall() 内部。最慢的部分是逐个元素(逐行、逐列)读取返回的 MySQL 数据,并将其转换为之前由同一个库推断的数据类型。

So far I have managed to speed this up to close to MATLAB by doing this really dirty solution:

到目前为止,我已经设法通过执行这个非常肮脏的解决方案来加快速度以接近 MATLAB:

import pymysql
import numpy

conn = pymysql.connect(host='', port=, user='', passwd='', db='')
cursor = conn.cursor()
cursor.execute("select x,y from TABLENAME")
rez = cursor.fetchall()
resarray = numpy.array(map(float,rez))
finalres = resarray.reshape((resarray.size/2,2))

The above cur.execute IS NOT THE pymysql EXECUTE!I have modified it, inside the file "connections.py". First, the function def _read_rowdata_packet, now has instead of:

上面的 cur.execute 不是 pymysql EXECUTE!我已经修改了它,在文件“connections.py”中。首先,函数 def _read_rowdata_packet,现在代替了:

rows.append(self._read_row_from_packet(packet))

substituted with

替换为

self._read_string_from_packet(rows,packet)

Here _read_string_from_packet is a simplified version of _read_row_from_packet with the code:

这里 _read_string_from_packet 是 _read_row_from_packet 的简化版本,代码如下:

def _read_string_from_packet(self, rows, packet):
    for field in self.fields:
        data = packet.read_length_coded_string()
        rows.append(data)

This is an uber-dirty solution which gives a speedup down from 6 seconds to 2.5 seconds. I was wondering, if all of this could somehow be avoided by using a different library/passing some parameters?

这是一个超级肮脏的解决方案,可将加速从 6 秒降低到 2.5 秒。我想知道,是否可以通过使用不同的库/传递一些参数来避免所有这些?

Hence the solution would be to bulk-read the entire MySQL reply to a list of strings and then bulk-type converting to numerical data types, instead of doing that element-by-element. Does something like that already exist in python?

因此,解决方案是批量读取整个 MySQL 对字符串列表的回复,然后批量类型转换为数字数据类型,而不是逐个元素地进行。python中是否已经存在类似的东西?

采纳答案by Fermion Portal

The "problem" seems to have been the type conversion which occurs from MySQL's decimal type to python's decimal.Decimal that MySQLdb, pymysql and pyodbc does on the data. By changing the converters.py file (at the very last lines) in MySQLdb to have:

“问题”似乎是从 MySQL 的十进制类型到 python 的十进制类型发生的类型转换。MySQLdb、pymysql 和 pyodbc 对数据所做的十进制。通过将 MySQLdb 中的 converters.py 文件(在最后一行)更改为:

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float

instead of decimal.Decimal seems to completely solve the problem and now the following code:

而不是decimal.Decimal似乎完全解决了问题,现在以下代码:

import MySQLdb
import numpy
import time

t = time.time()
conn = MySQLdb.connect(host='',...)
curs = conn.cursor()
curs.execute("select x,y from TABLENAME")
data = numpy.array(curs.fetchall(),dtype=float)
print(time.time()-t)

Runs in less than a second! What is funny, decimal.Decimal never appeared to be the problem in the profiler.

运行不到一秒钟!有趣的是,decimal.Decimal 在分析器中从未出现过问题。

Similar solution should work in pymysql package. pyodbc is more tricky: it is all written in C++, hence you would have to recompile the entire package.

类似的解决方案应该适用于 pymysql 包。pyodbc 更棘手:它都是用 C++ 编写的,因此您必须重新编译整个包。

UPDATE

更新

Here is a solution not requiring to modify the MySQLdb source code: Python MySQLdb returns datetime.date and decimalThe solution then to load numeric data into pandas:

这是一个不需要修改 MySQLdb 源代码 的解决方案: Python MySQLdb 返回 datetime.date 和十进制然后将数字数据加载到Pandas的解决方案:

import MySQLdb
import pandas.io.sql as psql
from MySQLdb.converters import conversions
from MySQLdb.constants import FIELD_TYPE

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = MySQLdb.connect(host='',user='',passwd='',db='')
sql = "select * from NUMERICTABLE"
df = psql.read_frame(sql, conn)

Beats MATLAB by a factor of ~4 in loading 200k x 9 table!

在加载 200k x 9 表时,比 MATLAB 高出约 4 倍!

回答by Michael Koenig

Also check out this way of doing things using the turbodbcpackage. To transform your result set into an OrderedDict of NumPy arrays, just do this:

还可以使用turbodbc包查看这种处理方式。要将结果集转换为 NumPy 数组的 OrderedDict,只需执行以下操作:

import turbodbc
connection = turbodbc.connect(dsn="My data source name")
cursor = connection.cursor()
cursor.execute("SELECT 42")
results = cursor.fetchallnumpy()

Transforming these results to a dataset should require a few additional milliseconds. I don't know the speedup for MySQL, but I have seen factor 10 for other databases.

将这些结果转换为数据集应该需要额外的几毫秒。我不知道 MySQL 的加速比,但我已经看到其他数据库的 10 倍。

The speedup is mainly achieved by using bulk operations instead of row-wise operations.

加速主要是通过使用批量操作而不是逐行操作来实现的。