Python 为什么通过 ORM 加载 SQLAlchemy 对象比通过原始 MySQLdb 游标加载行慢 5-8 倍？

Question

提问by Herbert

I noticed that SQLAlchemy was slow fetching (and ORMing) some data, which was rather fast to fetch using bare bone SQL. First off, I created a database with a million records:

我注意到 SQLAlchemy 获取（和 ORM）一些数据的速度很慢，而使用裸骨 SQL 获取数据相当快。首先，我创建了一个包含一百万条记录的数据库：

mysql> use foo
mysql> describe Foo;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id    | int(11) | NO   | PRI | NULL    |       |
| A     | int(11) | NO   |     | NULL    |       |
| B     | int(11) | NO   |     | NULL    |       |
| C     | int(11) | NO   |     | NULL    |       |
+-------+---------+------+-----+---------+-------+
mysql> SELECT COUNT(*) FROM Foo;
+----------+
| COUNT(*) |
+----------+
|  1000000 |
+----------+
mysql>

As a crude test, querying all Foo's takes approximately 2 seconds:

作为粗略的测试，查询所有 Foo 大约需要 2 秒：

herbert@dev0 ~ $ date; echo 'use foo; select * from Foo;' | mysql -uroot -pxxx > /dev/null; date
zo apr 20 18:48:49 CEST 2014
zo apr 20 18:48:51 CEST 2014

If I do this in python using MySQLdb this takes a approximately 3 seconds, including the construction of Foo objects:

如果我在 python 中使用 MySQLdb 执行此操作，这大约需要 3 秒，包括 Foo 对象的构造：

herbert@dev0 ~ $ python BareORM.py 
query execution time:  0:00:02.198986
total time:  0:00:03.403084

Which is the output of:

这是输出：

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb
import sys
import time
import datetime

class Foo:
    def __init__(self, a, b, c):
        self.a=a; self.b=b; self.c=c;

try:
    start = datetime.datetime.now()
    con = MySQLdb.connect('localhost', 'root', 'xxx', 'foo')
    cur = con.cursor();

    cur.execute("""SELECT * FROM Foo LIMIT 1000000""")
    print "query execution time: ", datetime.datetime.now()-start
    foos = [];
    for elem in cur:
        foos.append(Foo(elem[1], elem[2], elem[3]))
    con.commit()

except MySQLdb.Error, e:
    print "Error %d: %s" % (e.args[0], e.args[1])
    sys.exit(1)

finally:
    if con: con.close()
    print "total time: ",  datetime.datetime.now()-start

However, using SQLAlchemy to reduce boilerplate code, it needed approximately 25 seconds to do the same job:

然而，使用 SQLAlchemy 来减少样板代码，它需要大约 25 秒才能完成相同的工作：

herbert@dev0 ~ $ python AlchemyORM.py 
total time:  0:00:24.649279

Using this code:

使用此代码：

import sqlalchemy
import datetime
import MySQLdb

from sqlalchemy import Column, Integer, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship, backref

Base = declarative_base()

class Foo(Base):
    __tablename__ = 'Foo'
    id = Column(Integer, primary_key=True)
    A  = Column(Integer(unsigned=False), nullable=False)
    B  = Column(Integer(unsigned=False), nullable=False)
    C  = Column(Integer(unsigned=False), nullable=False)

engine  = create_engine('mysql+mysqldb://root:xxx@localhost/foo')
Session = sessionmaker(bind=engine)
session = Session()
start = datetime.datetime.now()
foos  = session.query(Foo).limit(1000000).all()
print "total time: ", datetime.datetime.now()-start

Why does SQLAlchemy operate ~10x slower than the bare SQL solution, assuming that SQLAlchemy should do approximately the same thing? Can I speed things up somehow?

为什么 SQLAlchemy 的运行速度比裸 SQL 解决方案慢 10 倍，假设 SQLAlchemy 应该做大致相同的事情？我可以以某种方式加快速度吗？

This is a minimal working example of a more complicated query, which joins several tables using eager loading. I was considering just doing simple queries on a single table, and then using dictionaries to create id->object maps and collate one-to-N relations. But before doing so, I want to be sure that SQLAlchemy is unable to perform better, because writing your own ORM is a bad idea from a software design point of view. Imho, a 2x slowdown would be acceptable (maybe).

这是一个更复杂的查询的最小工作示例，它使用预先加载连接多个表。我正在考虑只对单个表进行简单的查询，然后使用字典来创建 id->object 映射并整理一对 N 关系。但在这样做之前，我想确定 SQLAlchemy 无法表现得更好，因为从软件设计的角度来看，编写自己的 ORM 是一个坏主意。恕我直言，2 倍的减速是可以接受的（也许）。

If you know about other (faster) python-SQL ORM's, or maybe BigTable-alike solutions (that already are the ORM), feel free to mention them as a comment.

如果您了解其他（更快的）python-SQL ORM，或者类似 BigTable 的解决方案（已经是 ORM），请随时将它们作为评论提及。

EDIT:Also tried this with Peewee, which resulted in ~15 s.

编辑：也用 Peewee 试过这个，结果是大约 15 秒。

from peewee import *
import datetime;
database = MySQLDatabase("foo", host="localhost", port=3306, user="root", passwd="xxx")

class Foo(Model):
        id = IntegerField()
        A  = IntegerField()
        B  = IntegerField()
        C  = IntegerField()

        class Meta:
                db_table = 'Foo'
                database = database

start = datetime.datetime.now()
foos = Foo.select()
cnt=0;
for i in foos: cnt=cnt+1
print "total time: ", datetime.datetime.now() - start

EDIT:As a response to Matthias I tried to do the same thing in Java with Hibernate, the result is approximately 8 to 10 seconds, not exactly fast, but a lot faster than 25 seconds. The code, starting with some classes and ending with some configuration:

编辑：作为对 Matthias 的回应，我尝试使用 Hibernate 在 Java 中做同样的事情，结果大约是 8 到 10 秒，不是很快，但比 25 秒快很多。代码，从一些类开始，以一些配置结束：

package herbert.hibernateorm;

import java.util.List;

import org.hibernate.Session; 
import org.hibernate.Transaction;
import org.hibernate.SessionFactory;
import org.hibernate.cfg.Configuration;

public class App {
   public static void main(String[] args) throws Exception {
      SessionFactory factory = new Configuration().configure().buildSessionFactory();
      Session session = factory.openSession();
      Transaction tx = session.beginTransaction();
      long start = System.currentTimeMillis();
      List foos = session.createQuery("FROM Foo").list(); 
      System.out.println(foos.size());
      System.out.printf("total time: %d\n", System.currentTimeMillis() - start);
      session.close();
   }
}

package herbert.hibernateorm;

public class Foo {
    private int id, a, b, c;
    public Foo() {}
    public Foo(int A, int B, int C) { this.a=A; this.b=B; this.c=C; }

    public int getId() { return id; }
    public void setId(int id) { this.id = id; }
    public int getA() { return a; }
    public void setA(int a) { this.a = a; }
    public int getB() { return b; }
    public void setB(int b) { this.b = b; }
    public int getC() { return c; }
    public void setC(int c) { this.c = c; }
}

The configuration (hibernate.cfg.xml and hibernate.hbm.xml respectively)

配置（分别为 hibernate.cfg.xml 和 hibernate.hbm.xml）

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-configuration PUBLIC "-//Hibernate/Hibernate Configuration DTD 3.0//EN" "http://hibernate.sourceforge.net/hibernate-configuration-3.0.dtd">
<hibernate-configuration>
  <session-factory>
    <property name="hibernate.dialect">org.hibernate.dialect.MySQLDialect</property>
    <property name="hibernate.connection.driver_class">com.mysql.jdbc.Driver</property>
    <property name="hibernate.connection.url">jdbc:mysql://localhost:3306/foo?zeroDateTimeBehavior=convertToNull</property>
    <property name="hibernate.connection.username">root</property>
    <property name="hibernate.connection.password">xxx</property>
    <mapping resource="hibernate.hbm.xml"/>
  </session-factory>
</hibernate-configuration>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-mapping PUBLIC "-//Hibernate/Hibernate Mapping DTD 3.0//EN" "http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd">
<hibernate-mapping>
    <class name="herbert.hibernateorm.Foo" table="Foo" catalog="foo">
        <id name="id" type="int">
            <column name="id" />
            <generator class="assigned" />
        </id>
        <property name="a" type="int">
            <column name="A" not-null="true" />
        </property>
        <property name="b" type="int">
            <column name="B" not-null="true" />
        </property>
        <property name="c" type="int">
            <column name="C" not-null="true" />
        </property>
    </class>
</hibernate-mapping>

And finally the pom file to run it all in maven:

最后是在 maven 中运行它的 pom 文件：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>herbert</groupId>
    <artifactId>hibernateORM</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>hibernateORM</name>
    <url>http://maven.apache.org</url>
    <repositories>
        <repository>
            <id>unknown-jars-temp-repo</id>
            <name>A temporary repository created by NetBeans for libraries and jars it could not identify. Please replace the dependencies in this repository with correct ones and delete this repository.</name>
            <url>file:${project.basedir}/lib</url>
        </repository>
    </repositories>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.21</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-core</artifactId>
            <version>4.0.1.Final</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-entitymanager</artifactId>
            <version>4.0.1.Final</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate.common</groupId>
            <artifactId>hibernate-commons-annotations</artifactId>
            <version>4.0.1.Final</version>
        </dependency>   
        <dependency>
            <groupId>nz.ac.waikato.cms.weka</groupId>
            <artifactId>weka-dev</artifactId>
            <version>3.7.10</version>
        </dependency>
        <dependency>
            <groupId>commons-configuration</groupId>
            <artifactId>commons-configuration</artifactId>
            <version>1.9</version>
        </dependency>
        <dependency>
            <groupId>commons-net</groupId>
            <artifactId>commons-net</artifactId>
            <version>3.1</version>
            <classifier>examples</classifier>
        </dependency>
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.2.2</version>
        </dependency>
        <dependency>
            <groupId>maven</groupId>
            <artifactId>maven-jetty-plugin</artifactId>
            <version>1.1</version>
            <type>plugin</type>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <dependency>
                <groupId>com.kenai.nbpwr</groupId>
                <artifactId>org-slf4j-jdk14</artifactId>
                <version>1.6.1-201106101300</version>
                <type>nbm</type>
        </dependency>

    </dependencies>
</project>

Answer 1

回答by Matthias Urlichs

SQLAlchemy is complicated. It has to deal with converting types to Python which the underlying database does not support natively, tables with inheritance, JOINs, caching the objects, maintaining consistency, translated rows, partial results, and whatnot. Check out sqlalchemy/orm/loading.py:instance_processor-- it's insane.

SQLAlchemy 很复杂。它必须处理将底层数据库本身不支持的类型转换为 Python、具有继承性的表、JOIN、缓存对象、保持一致性、转换行、部分结果等等。看看sqlalchemy/orm/loading.py:instance_processor——这太疯狂了。

The solution would be to piece together and compile Python code to process the results of a specific query, like Jinja2 does for templates. So far, nobody has done this work, possibly because the common case is a couple of rows (where this kind of optimization would be pessimal) and people who need to process bulk data do that by hand, like you did.

解决方案是拼凑并编译 Python 代码来处理特定查询的结果，就像 Jinja2 对模板所做的那样。到目前为止，没有人做过这项工作，可能是因为常见的情况是几行（这种优化会很糟糕），而需要处理大量数据的人会像您一样手动完成这项工作。

Answer 2

回答by Herbert

This is not an answer to my question, but may help the general public with speed issues on large data sets. I found that selecting a million records can typically be done in about 3 seconds, however JOINS may slow down the process. In this case that one has approximately 150k Foo's which has a 1-many relation to 1M Bars, then selecting those using a JOIN may be slow as each Foo is returned approximately 6.5 times. I found that selecting both tables seperately and joining them using dicts in python is approximately 3 times faster than SQLAlchemy (approx 25 sec) and 2 times faster than 'bare' python code using joins(approx 17 sec). The code took 8 sec in my use case. Selecting 1M records without relations, like the Bar-example above, took 3 seconds. I used this code:

这不是我的问题的答案，但可以帮助公众解决大型数据集的速度问题。我发现选择一百万条记录通常可以在大约 3 秒内完成，但是 JOINS 可能会减慢该过程。在这种情况下，一个大约有 150k Foo's 与 1M Bars 有 1-many 关系，然后使用 JOIN 选择那些可能会很慢，因为每个 Foo 返回大约 6.5 次。我发现分别选择两个表并在 python 中使用字典连接它们比 SQLAlchemy 快大约 3 倍（大约 25 秒），比使用连接的“裸”python 代码快 2 倍（大约 17 秒）。在我的用例中，代码用了 8 秒。选择 1M 没有关系的记录，就像上面的 Bar-example 一样，需要 3 秒。我使用了这个代码：

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb
import sys
import time
import datetime
import inspect
from operator import itemgetter, attrgetter

# fetch all objects of class Class, where the fields are determined as the
# arguments of the __init__ constructor (not flexible, but fairly simple ;))
def fetch(Class, cursor, tablename, ids=["id"], where=None):
    arguments = inspect.getargspec(Class.__init__).args; del arguments[0];
    fields = ", ".join(["`" + tablename + "`.`" + column + "`" for column in arguments])
    sql = "SELECT " + fields + " FROM `" + tablename + "`"
    if where != None: sql = sql + " WHERE " + where
    sql=sql+";"
    getId = itemgetter(*[arguments.index(x) for x in ids])
    elements = dict()

    cursor.execute(sql)
    for record in cursor:
        elements[getId(record)] = Class(*record)
    return elements

# attach the objects in dict2 to dict1, given a 1-many relation between both
def merge(dict1, fieldname, dict2, ids):
    idExtractor = attrgetter(*ids)
    for d in dict1: setattr(dict1[d], fieldname, list())
    for d in dict2:
        dd = dict2[d]
        getattr(dict1[idExtractor(dd)], fieldname).append(dd)

# attach dict2 objects to dict1 objects, given a 1-1 relation
def attach(dict1, fieldname, dict2, ids):
    idExtractor = attrgetter(*ids)
    for d in dict1: dd=dict1[d]; setattr(dd, fieldname, dict2[idExtractor(dd)])

It helped me speed up my querying, however I am more than happy to hear from the experts about possible improvements to this approach.

它帮助我加快了查询速度，但是我很高兴听到专家关于这种方法可能的改进。

Answer 3

回答by zzzeek

Here is the SQLAlchemy version of your MySQL script that performs in four seconds, compared to three for MySQLdb:

这是 MySQL 脚本的 SQLAlchemy 版本，它在四秒内执行，而 MySQLdb 为三秒：

from sqlalchemy import Integer, Column, create_engine, MetaData, Table
import datetime

metadata = MetaData()

foo = Table(
    'foo', metadata,
    Column('id', Integer, primary_key=True),
    Column('a', Integer(), nullable=False),
    Column('b', Integer(), nullable=False),
    Column('c', Integer(), nullable=False),
)


class Foo(object):
    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)
start = datetime.datetime.now()

with engine.connect() as conn:
    foos = [
        Foo(row['a'], row['b'], row['c'])
        for row in
        conn.execute(foo.select().limit(1000000)).fetchall()
    ]


print "total time: ", datetime.datetime.now() - start

runtime:

运行：

total time:  0:00:04.706010

Here is a script that uses the ORM to load object rows fully; by avoiding the creation of a fixed list with all 1M objects at once using yield per, this runs in 13 secondswith SQLAlchemy master (18 seconds with rel 0.9):

这是一个使用 ORM 完全加载对象行的脚本；通过避免使用 yield per 一次创建包含所有 1M 个对象的固定列表，SQLAlchemy master运行时间为13 秒（rel 0.9 为 18 秒）：

import time
from sqlalchemy import Integer, Column, create_engine, Table
from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class Foo(Base):
    __table__ = Table(
        'foo', Base.metadata,
        Column('id', Integer, primary_key=True),
        Column('a', Integer(), nullable=False),
        Column('b', Integer(), nullable=False),
        Column('c', Integer(), nullable=False),
    )


engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)

sess = Session(engine)

now = time.time()

# avoid using all() so that we don't have the overhead of building
# a large list of full objects in memory
for obj in sess.query(Foo).yield_per(100).limit(1000000):
    pass

print("Total time: %d" % (time.time() - now))

We can then split the difference between these two approaches, and load just individual columns with the ORM:

然后我们可以拆分这两种方法之间的差异，并使用 ORM 仅加载单个列：

for obj in sess.query(Foo.id, Foo.a, Foo.b, Foo.c).yield_per(100).limit(1000000):
    pass

The above again runs in 4 seconds.

上面再次运行4 秒。

The comparison of SQLAlchemy Core is the more apt comparison to a raw MySQLdb cursor. If you use the ORM but query for individual columns, it's about four seconds in most recent versions.

SQLAlchemy Core 的比较是与原始 MySQLdb 游标的更贴切的比较。如果您使用 ORM 但查询单个列，则在最新版本中大约需要 4 秒。

At the ORM level, the speed issues are because creating objects in Python is slow, and the SQLAlchemy ORM applies a large amount of bookkeeping to these objects as it fetches them, which is necessary in order for it to fulfill its usage contract, including unit of work, identity map, eager loading, collections, etc.

在 ORM 级别，速度问题是因为在 Python 中创建对象很慢，并且 SQLAlchemy ORM 在获取这些对象时对这些对象应用了大量簿记，这是它履行其使用合同所必需的，包括单元工作、身份地图、急切加载、收藏等。

To speed up the query dramatically, fetch individual columns instead of full objects. See the techniques at http://docs.sqlalchemy.org/en/latest/faq/performance.html#result-fetching-slowness-ormwhich describe this.

要显着加快查询速度，请获取单个列而不是完整对象。请参阅http://docs.sqlalchemy.org/en/latest/faq/performance.html#result-fetching-slowness-orm中描述这一点的技术。

For your comparison to PeeWee, PW is a much simpler system with a lot less features, including that it doesn't do anything with identity maps. Even with PeeWee, about as simple of an ORM as is feasible, it still takes 15 seconds, which is evidence that cPython is really really slowcompared to the raw MySQLdb fetch which is in straight C.

为了与 PeeWee 进行比较，PW 是一个更简单的系统，功能少得多，包括它对身份映射没有任何作用。即使使用 PeeWee，尽可能简单的 ORM，它仍然需要15 秒，这证明与直接使用 C 语言的原始 MySQLdb 获取相比，cPython 确实非常慢。

For comparison to Java, the Java VM is way way way faster than cPython. Hibernate is ridiculouslycomplicated, yet the Java VM is extremely fast due to the JIT and even all that complexity ends up running faster. If you want to compare Python to Java, use Pypy.

与 Java 相比，Java VM比 cPython 快得多。Hibernate复杂得离谱，但由于 JIT，Java VM 非常快，甚至所有这些复杂性最终都会运行得更快。如果要将 Python 与 Java 进行比较，请使用 Pypy。

Python 为什么通过 ORM 加载 SQLAlchemy 对象比通过原始 MySQLdb 游标加载行慢 5-8 倍？

提问by Herbert

回答by Matthias Urlichs

回答by Herbert

回答by zzzeek

相关推荐

最近更新

标签

Python 为什么通过 ORM 加载 SQLAlchemy 对象比通过原始 MySQLdb 游标加载行慢 5-8 倍？

提问by Herbert

回答by Matthias Urlichs

回答by Herbert

回答by zzzeek

相关推荐

Python 熊猫打印所有 dtypes

Python SQLAlchemy 十进制精度

小于 400 万的偶数斐波那契数的总和 - Python

Python pip install: 请检查该目录的权限和所有者

相关推荐

最近更新

标签