SQL PostgreSQL 中的批量/批量更新/更新插入

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7019831/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 11:38:53  来源:igfitidea点击:

Bulk/batch update/upsert in PostgreSQL

sqldatabasepostgresqlinsertsql-update

提问by julx

I'm writing a Django-ORM enchancement that attempts to cache models and postpone model saving until the end of the transaction. It's all almost done, however I came across an unexpected difficulty in SQL syntax.

我正在编写一个 Django-ORM 增强功能,它尝试缓存模型并将模型保存推迟到事务结束。一切都差不多完成了,但是我在 SQL 语法中遇到了一个意想不到的困难。

I'm not much of a DBA, but from what I understand, databases don't really work efficiently for many small queries. Few bigger queries are much better. For example it's better to use large batch inserts (say 100 rows at once) instead of 100 one-liners.

我不是一个 DBA,但据我所知,对于许多小查询,数据库并不能真正有效地工作。几个更大的查询要好得多。例如,最好使用大批量插入(比如一次 100 行)而不是 100 个单行插入。

Now, from what I can see, SQL doesn't really supply any statement to perform a batch update on a table. The term seems to be confusingso, I'll explain what I mean by that. I have an array of arbitrary data, each entry describing a single row in a table. I'd like to update certain rows in the table, each using data from its corresponding entry in the array. The idea is very similar to a batch insert.

现在,据我所知,SQL 并没有真正提供任何语句来对表执行批量更新。这个术语似乎令人困惑,所以我将解释我的意思。我有一组任意数据,每个条目描述表中的一行。我想更新表中的某些行,每行都使用来自数组中相应条目的数据。这个想法与批量插入非常相似。

For example: My table could have two columns "id"and "some_col". Now the array describing the data for a batch update consists of three entries (1, 'first updated'), (2, 'second updated'), and (3, 'third updated'). Before the update the table contains rows: (1, 'first'), (2, 'second'), (3, 'third').

例如:我的表可以有两列"id""some_col". 现在描述用于在批处理更新的数据阵列由三个条目的(1, 'first updated')(2, 'second updated')(3, 'third updated')。在更新之前,该表包含行:(1, 'first'), (2, 'second'), (3, 'third')

I came accross this post:

我遇到了这个帖子:

Why are batch inserts/updates faster? How do batch updates work?

为什么批量插入/更新更快?批量更新如何工作?

which seems to do what I want, however I can't really figure out the syntax at the end.

这似乎做我想要的,但是我无法真正弄清楚最后的语法。

I could also delete all the rows that require updating and reinsert them using a batch insert, however I find it hard to believe that this would actually perform any better.

我也可以删除所有需要更新的行并使用批量插入重新插入它们,但是我发现很难相信这实际上会表现得更好。

I work with PostgreSQL 8.4, so some stored procedures are also possible here. However as I plan to open source the project eventually, any more portable ideas or ways to do the same thing on a different RDBMS are most welcome.

我使用 PostgreSQL 8.4,所以这里也可以使用一些存储过程。然而,当我计划最终开源该项目时,任何在不同 RDBMS 上做同样事情的更便携的想法或方法都是最受欢迎的。

Follow up question:How to do a batch "insert-or-update"/"upsert" statement?

跟进问题:如何执行批处理“插入或更新”/“更新插入”语句?

Test results

检测结果

I've performed 100x times 10 insert operations spread over 4 different tables (so 1000 inserts in total). I tested on Django 1.3 with a PostgreSQL 8.4 backend.

我已经在 4 个不同的表上执行了 100 次 10 次插入操作(总共 1000 次插入)。我在带有 PostgreSQL 8.4 后端的 Django 1.3 上进行了测试。

These are the results:

这些是结果:

  • All operations done through Django ORM - each pass ~2.45 seconds,
  • The same operations, but done without Django ORM - each pass ~1.48 seconds,
  • Only insert operations, without querying the database for sequence values ~0.72 seconds,
  • Only insert operations, executed in blocks of 10 (100 blocks in total) ~0.19 seconds,
  • Only insert operations, one big execution block ~0.13 seconds.
  • Only insert operations, about 250 statements per block, ~0.12 seconds.
  • 所有操作都通过 Django ORM 完成 - 每次大约 2.45 秒
  • 相同的操作,但没有 Django ORM 完成 - 每次通过~1.48 秒
  • 仅插入操作,无需查询数据库中的序列值~0.72 秒
  • 仅插入操作,以 10 个块为单位执行(共 100 个块)~0.19 秒
  • 只有插入操作,一个大的执行块~0.13 秒
  • 仅插入操作,每块约 250 条语句,约 0.12 秒

Conclusion: execute as many operations as possible in a single connection.execute(). Django itself introduces a substantial overhead.

结论:在单个 connection.execute() 中执行尽可能多的操作。Django 本身引入了大量开销。

Disclaimer: I didn't introduce any indices apart from default primary key indices, so insert operations could possibly run faster because of that.

免责声明:除了默认主键索引之外,我没有引入任何索引,因此插入操作可能会运行得更快。

采纳答案by atrain

I've used 3 strategies for batch transactional work:

我使用了 3 种策略进行批处理事务性工作:

  1. Generate SQL statements on the fly, concatenate them with semicolons, and then submit the statements in one shot. I've done up to 100 inserts in this way, and it was quite efficient (done against Postgres).
  2. JDBC has batching capabilities built in, if configured. If you generate transactions, you can flush your JDBC statements so that they transact in one shot. This tactic requires fewer database calls, as the statements are all executed in one batch.
  3. Hibernate also supports JDBC batching along the lines of the previous example, but in this case you execute a flush()method against the Hibernate Session, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.
  1. 即时生成 SQL 语句,用分号连接它们,然后一次性提交语句。我以这种方式完成了多达 100 次插入,而且效率很高(针对 Postgres 完成)。
  2. JDBC 具有内置的批处理功能(如果已配置)。如果生成事务,则可以刷新 JDBC 语句,以便它们一次性完成事务。这种策略需要较少的数据库调用,因为语句都是在一批中执行的。
  3. Hibernate 还支持前面示例中的 JDBC 批处理,但在这种情况下,您flush()针对 Hibernate执行一个方法Session,而不是底层 JDBC 连接。它完成与 JDBC 批处理相同的事情。

Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize, when fetching associations, Hibernate will use INinstead of =, leading to fewer SELECTstatements to load up the collections.

顺便说一下,Hibernate 还支持集合获取中的批处理策略。如果您使用 注释集合@BatchSize,则在获取关联时,Hibernate 将使用IN代替=,从而减少SELECT加载集合的语句。

回答by hagello

Bulk insert

批量插入

You can modify the bulk insert of three columns by Ketema:

可以通过Ketema修改三列的批量插入:

INSERT INTO "table" (col1, col2, col3)
  VALUES (11, 12, 13) , (21, 22, 23) , (31, 32, 33);

It becomes:

它成为了:

INSERT INTO "table" (col1, col2, col3)
  VALUES (unnest(array[11,21,31]), 
          unnest(array[12,22,32]), 
          unnest(array[13,23,33]))

Replacing the values with placeholders:

用占位符替换值:

INSERT INTO "table" (col1, col2, col3)
  VALUES (unnest(?), unnest(?), unnest(?))

You have to pass arrays or lists as arguments to this query. This means you can do huge bulk inserts without doing string concatenation (and all its hazzles and dangers: sql injection and quoting hell).

您必须将数组或列表作为参数传递给此查询。这意味着您可以在不进行字符串连接的情况下进行大量的批量插入(以及它的所有麻烦和危险:sql 注入和引用地狱)。

Bulk update

批量更新

PostgreSQL has added the FROM extension to UPDATE. You can use it in this way:

PostgreSQL 为 UPDATE 添加了 FROM 扩展。你可以这样使用它:

update "table" 
  set value = data_table.new_value
  from 
    (select unnest(?) as key, unnest(?) as new_value) as data_table
  where "table".key = data_table.key;

The manual is missing a good explanation, but there is an example on the postgresql-admin mailing list. I tried to elaborate on it:

该手册缺少一个很好的解释,但postgresql-admin 邮件列表中有一个示例。我试图详细说明它:

create table tmp
(
  id serial not null primary key,
  name text,
  age integer
);

insert into tmp (name,age) 
values ('keith', 43),('leslie', 40),('bexley', 19),('casey', 6);

update tmp set age = data_table.age
from
(select unnest(array['keith', 'leslie', 'bexley', 'casey']) as name, 
        unnest(array[44, 50, 10, 12]) as age) as data_table
where tmp.name = data_table.name;

There are also otherpostson StackExchange explaining UPDATE...FROM..using a VALUESclause instead of a subquery. They might by easier to read, but are restricted to a fixed number of rows.

StackExchange上还有其他帖子解释UPDATE...FROM..使用VALUES子句而不是子查询。它们可能更容易阅读,但仅限于固定数量的行。

回答by Ketema

Bulk inserts can be done as such:

批量插入可以这样完成:

INSERT INTO "table" ( col1, col2, col3)
  VALUES ( 1, 2, 3 ) , ( 3, 4, 5 ) , ( 6, 7, 8 );

Will insert 3 rows.

将插入 3 行。

Multiple updating is defined by the SQL standard, but not implemented in PostgreSQL.

多次更新是由 SQL 标准定义的,但在 PostgreSQL 中没有实现。

Quote:

引用:

"According to the standard, the column-list syntax should allow a list of columns to be assigned from a single row-valued expression, such as a sub-select:

UPDATE accounts SET (contact_last_name, contact_first_name) = (SELECT last_name, first_name FROM salesmen WHERE salesmen.id = accounts.sales_id);"

“根据标准,列列表语法应该允许从单个行值表达式分配列列表,例如子选择:

更新帐户 SET (contact_last_name, contact_first_name) = (SELECT last_name, first_name FROM salesmen WHERE salesmen.id = accounts.sales_id);"

Reference: http://www.postgresql.org/docs/9.0/static/sql-update.html

参考:http: //www.postgresql.org/docs/9.0/static/sql-update.html

回答by nogus

it is pretty fast to populate json into recordset (postgresql 9.3+)

将 json 填充到记录集中非常快(postgresql 9.3+)

big_list_of_tuples = [
    (1, "123.45"),
    ...
    (100000, "678.90"),
]

connection.execute("""
    UPDATE mytable
    SET myvalue = Q.myvalue
    FROM (
        SELECT (value->>0)::integer AS id, (value->>1)::decimal AS myvalue 
        FROM json_array_elements(%s)
    ) Q
    WHERE mytable.id = Q.id
    """, 
    [json.dumps(big_list_of_tuples)]
)

回答by aliasmrchips

Turn off autocommit and just do one commit at the end. In plain SQL, this means issuing BEGIN at the start and COMMIT at the end. You would need to create a functionin order to do an actual upsert.

关闭自动提交,最后只做一次提交。在普通 SQL 中,这意味着在开始时发出 BEGIN,在结束时发出 COMMIT。您需要创建一个函数才能进行实际的 upsert。