postgresql SQLAlchemy“默认”与“server_default”性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/52431208/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 02:43:56  来源:igfitidea点击:

SQLAlchemy "default" vs "server_default" performance

pythonsqlpostgresqlperformancesqlalchemy

提问by benvc

Is there a performance advantage (or disadvantage) when using defaultinstead of server_defaultfor mapping table column default values when using SQLAlchemy with PostgreSQL?

在将 SQLAlchemy 与 PostgreSQL 一起使用时,使用default而不是server_default映射表列默认值是否有性能优势(或劣势)?

My understanding is that defaultrenders the expression in the INSERT(usually) and that server_defaultplaces the expression in the CREATE TABLEstatement. Seems like server_defaultis analogous to typical handling of defaults directly in the db such as:

我的理解是defaultINSERT(通常)中呈现表达式server_default并将表达式置于CREATE TABLE语句中。似乎server_default类似于直接在 db 中对默认值的典型处理,例如:

CREATE TABLE example (
    id serial PRIMARY KEY,
    updated timestamptz DEFAULT now()
);

...but it is not clear to me if it is more efficient to handle defaults on INSERTor via table creation.

...但我不清楚在INSERT表创建时或通过表创建处理默认值是否更有效。

Would there be any performance improvement or degradation for row inserts if each of the defaultparameters in the example below were changed to server_default?

如果default将下面示例中的每个参数更改为,行插入的性能是否会有所提高或降低server_default

from uuid import uuid4
from sqlalchemy import Column, Boolean, DateTime, Integer
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.sql import func

Base = declarative_base()

class Item(Base):
    __tablename__ = 'item'

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
    count = Column(Integer, nullable=False, default=0)
    flag = Column(Boolean, nullable=False, default=False)
    updated = Column(DateTime(timezone=True), nullable=False, default=func.now())

NOTE: The best explanation I found so far for when to use defaultinstead of server_defaultdoes not address performance (see Mike Bayer's SO answer on the subject). My oversimplified summary of that explanation is that defaultis preferred over server_defaultwhen...

注意:到目前为止,我找到的关于何时使用default而不是使用的最佳解释server_default并没有解决性能问题(请参阅 Mike Bayer 关于该主题的 SO 回答)。我对该解释的过于简单化的总结是,default它比server_default什么时候更受欢迎。

  • The db can't handle the expression you need or want to use for the default value.
  • You can't or don't want to modify the schema directly.
  • db 无法处理您需要或想要用于默认值的表达式。
  • 您不能或不想直接修改架构。

...so the question remains as to whether performance should be considered when choosing between defaultand server_default?

...所以问题仍然是在default和之间进行选择时是否应该考虑性能server_default

回答by Martijn Pieters

It is impossible to give you a 'this is faster' answer, because performance per default value expressioncan vary widely, both on the server and in Python. A function to retrieve the current time behaves differently from a scalar default value.

不可能给你一个“这更快”的答案,因为每个默认值表达式的性能可能会有很大差异,无论是在服务器上还是在 Python 中。检索当前时间的函数的行为与标量默认值不同。

Next, you must realise that defaults can be provided in fivedifferent ways:

接下来,您必须意识到可以通过五种不同的方式提供默认值:

  • Client-side scalar defaults. A fixed value, such a 0or True. The value is used in an INSERTstatement.
  • Client-side Python function. Called each time a default is needed, produces the value to insert, used the same way as a scalar default from there on out. These can be context sensitive(have access to the current execution context with values to be inserted).
  • Client-side SQL expression; this generates an extra piece of SQL expressionthat is then used in the query and executed on the server to produce a value.
  • Server-side DLL expressionare SQL expressions that are then stored in the table definition, so are part of the schema. The server uses these to fill a value for any columns omitted from INSERTstatements, or when a column value is set to DEFAULTin an INSERTor UPDATEstatement.
  • Server-side implicit defaults or triggers, where other DLL such as triggers or specific database features provide a default value for columns.
  • 客户端标量默认值。一个固定值,例如 a0True。该值在INSERT语句中使用。
  • 客户端 Python 函数。每次需要默认值时调用,生成要插入的值,从那时起使用与标量默认值相同的方式。这些可以是上下文敏感的(可以使用要插入的值访问当前执行上下文)。
  • 客户端SQL表达式;这会生成一个额外的SQL 表达式,然后在查询中使用它并在服务器上执行以生成一个值。
  • 服务器端 DLL 表达式是 SQL 表达式,然后存储在表定义中,因此是架构的一部分。服务器使用这些来填充INSERT语句中省略的任何列的值,或者当DEFAULTINSERTorUPDATE语句中将列值设置为时。
  • 服务器端隐式默认值或触发器,其中其他 DLL(例如触发器或特定数据库功能)为列提供默认值。

Note that when it comes to a SQL expression determining the default value, be that a client-side SQL expression, a server-side DLL expression, or a trigger, it makes very little difference to a database where the default value expression is coming from. The query executor will need to know how to produce values for a given column, once that's parsed out of the DML statement or the schema definition, the server still has to execute the expression for each row.

请注意,当涉及确定默认值的 SQL 表达式时,无论是客户端 SQL 表达式、服务器端 DLL 表达式还是触发器,它与默认值表达式来自的数据库几乎没有区别. 查询执行器需要知道如何为给定的列生成值,一旦从 DML 语句或模式定义中解析出来,服务器仍然必须为每一行执行表达式。

Choosing between these options is rarely going to be based on performance alone, performance should at mostbe but one of multiple aspects you consider. There are many factors involved here:

在这些选项之间进行选择很少会仅基于性能,性能最多应该只是您考虑的多个方面之一。这里涉及到很多因素:

  • defaultwith a scalar or Python function directly produces a Python default value, then sends the new value to the server when inserting. Python code can access the default value before the data is inserted into the database.
  • A client-side SQL expression, a server_defaultvalue, and server-side implicit defaults and triggers all have the server generate the default, which then must be fetched by the client if you want to be able to access it in the same SQLAlchemy session. You can't access the value until the object has been inserted into the database.

    Depending on the exact query and database support, SQLAlchemy may have to make extraSQL queries to either generate a default before the INSERTstatement or run a separate SELECTafterwards to fetch the defaults that have been inserted. You can control when this happens (directly when inserting or on first access after flushing, with the eager_defaultsmapper configuration).

  • If you have multiple clients on different platforms accessing the same database, a server_defaultor other default attached to the schema (such as a trigger) ensures that all clients will use the same defaults, regardless, while defaults implemented in Python can't be accessed by other platforms.
  • default使用标量或 Python 函数直接生成 Python 默认值,然后在插入时将新值发送到服务器。在将数据插入数据库之前,Python 代码可以访问默认值。
  • 客户端 SQL 表达式、server_default值和服务器端隐式默认值和触发器都让服务器生成默认值,如果您希望能够在同一个 SQLAlchemy 会话中访问它,那么必须由客户端获取。在将对象插入数据库之前,您无法访问该值。

    根据确切的查询和数据库支持,SQLAlchemy 可能需要进行额外的SQL 查询以在INSERT语句之前生成默认值或在语句SELECT之后运行单独的以获取已插入的默认值。您可以控制何时发生这种情况(直接在插入时或在刷新后首次访问时,使用eager_defaults映射器配置)。

  • 如果您在不同平台上有多个客户端访问同一个数据库,则server_default附加到模式的一个或其他默认值(例如触发器)可确保所有客户端将使用相同的默认值,无论如何,而在 Python 中实现的默认值无法通过其他平台。

When using PostgreSQL, SQLAlchemy can make use of the RETURNINGclause for DML statements, which gives a client access to server-side generated defaults in a single step.

使用 PostgreSQL 时,SQLAlchemy 可以使用DML 语句RETURNING子句,这让客户端只需一步即可访问服务器端生成的默认值。

So when using a server_defaultcolumn default that calculates a new value for each row (not a scalar value), you save a small amount of Python-side time, and save a small amount of network bandwidth as you are not sending data for that column over to the database. The database couldbe faster creating that same value, or it could be slower; it largely depends on the type of operation. If you need to have access to the generated default value from Python, in the same transaction, you do then have to wait for a return stream of data, parsed out by SQLAlchemy. All these details canbecome insignificant compared to everything else that happens around inserting or updating rows, however.

因此,当使用server_default为每行计算一个新值(不是标量值)的列默认值时,您可以节省少量 Python 端时间,并节省少量网络带宽,因为您不会为该列发送数据到数据库。数据库可以更快地创建相同的值,也可以更慢;这在很大程度上取决于操作的类型。如果您需要从 Python 访问生成的默认值,则在同一事务中,您必须等待由 SQLAlchemy 解析出的返回数据流。然而,与围绕插入或更新行发生的所有其他事情相比,所有这些细节都可能变得微不足道。

Do understand that a ORM is not suitable to be used for high-performance bulk row inserts or updates; quoting from the SQAlchemy PerformanceFAQ entry:

请了解 ORM不适合用于高性能批量行插入或更新;引用自SQAlchemy PerformanceFAQ 条目

The SQLAlchemy ORM uses the unit of work pattern when synchronizing changes to the database. This pattern goes far beyond simple “inserts” of data. It includes that attributes which are assigned on objects are received using an attribute instrumentation system which tracks changes on objects as they are made, includes that all rows inserted are tracked in an identity map which has the effect that for each row SQLAlchemy must retrieve its “last inserted id” if not already given, and also involves that rows to be inserted are scanned and sorted for dependencies as needed. Objects are also subject to a fair degree of bookkeeping in order to keep all of this running, which for a very large number of rows at once can create an inordinate amount of time spent with large data structures, hence it's best to chunk these.

Basically, unit of work is a large degree of automation in order to automate the task of persisting a complex object graph into a relational database with no explicit persistence code, and this automation has a price.

ORMs are basically not intended for high-performance bulk inserts - this is the whole reason SQLAlchemy offers the Core in addition to the ORM as a first-class component.

SQLAlchemy ORM 在将更改同步到数据库时使用工作单元模式。这种模式远远超出了简单的数据“插入”。它包括使用属性检测系统接收分配给对象的属性,该系统在对象创建时跟踪对象的更改,包括在身份映射中跟踪插入的所有行,该映射具有对于每一行 SQLAlchemy 必须检索其“最后插入的 id”(如果尚未给出),并且还涉及根据需要扫描和排序要插入的行以获得依赖关系。对象也受到一定程度的簿记以保持所有这些运行,这对于大量的行来说可能会导致花费在大型数据结构上的时间过长,因此最好将这些分块。

基本上,工作单元是很大程度的自动化,以便在没有显式持久性代码的情况下将复杂对象图持久化到关系数据库中的任务自动化,并且这种自动化是有代价的。

ORM 基本上不用于高性能批量插入——这就是 SQLAlchemy 提供 Core 以及 ORM 作为一流组件的全部原因。

Because an ORM like SQLAlchemy comes with a hefty overhead price, any performance differences between a server-side or Python-side default quickly disappears in the noise of ORM operations.

因为像 SQLAlchemy 这样的 ORM 带来了高昂的开销,服务器端或 Python 端默认值之间的任何性能差异都会在 ORM 操作的噪音中迅速消失。

So if you are concerned about performance for large-quantity insert or update operations, you would want to use bulk operationsfor those, and enable the psycopg2batch execution helpersto really get a speed boost. When using these bulk operations, I'd expect server-side defaults to improve performance just by saving bandwidth moving row data from Python to the server, but how much depends on the exact nature of the default values.

因此,如果您担心大量插入或更新操作的性能,您可能希望对这些操作使用批量操作,并启用psycopg2批处理执行助手以真正提高速度。使用这些批量操作时,我希望服务器端默认值仅通过节省将行数据从 Python 移动到服务器的带宽来提高性能,但多少取决于默认值的确切性质。

If ORM insert and update performance outside of bulk operations is a big issue for you, you need to test your specific options. I'd start with the SQLAlchemy examples.performancepackageand add your own test suiteusing two models that differ only in a single server_defaultand defaultconfiguration.

如果批量操作之外的 ORM 插入和更新性能对您来说是一个大问题,您需要测试您的特定选项。我将从SQLAlchemyexamples.performance开始,并使用两个模型添加您自己的测试套件,这些模型仅在单个server_defaultdefault配置上有所不同。

回答by u7768490

There's something else important rather than just comparing the performance of the two

除了比较两者的性能之外,还有其他重要的事情

If you needed to add a new Column create_at (Not Null)to an existing Table Userwith some data in it, defaultwill not work.

如果您需要将新列添加create_at (Not Null)User包含一些数据的现有表中,default将不起作用。

If used default, during upgrading the database, the error will occur saying cannot insert Null value to existing data in the table. And this will cause significant troubles if you want to maintain your data, even just for testing.

如果使用default,在升级数据库时,会出现无法向表中现有数据插入Null值的错误。如果你想维护你的数据,即使只是为了测试,这也会带来很大的麻烦。

And when used server_default, during upgrading the DB, database will insert the current DateTime value to all previous existing testing data.

并且在使用时server_default,在升级数据库时,数据库会将当前的 DateTime 值插入到所有先前存在的测试数据中。

So in this case, only server_defaultwill work.

所以在这种情况下,只会server_default起作用。