postgresql Django bulk_create 忽略导致 IntegrityError 的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12451053/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 00:18:50  来源:igfitidea点击:

Django bulk_create with ignore rows that cause IntegrityError?

djangopostgresqlbulk-load

提问by Meitham

I am using bulk_create to loads thousands or rows into a postgresql DB. Unfortunately some of the rows are causing IntegrityError and stoping the bulk_create process. I was wondering if there was a way to tell django to ignore such rows and save as much of the batch as possible?

我正在使用 bulk_create 将数千或行加载到 postgresql 数据库中。不幸的是,有些行导致 IntegrityError 并停止 bulk_create 进程。我想知道是否有办法告诉 django 忽略这些行并尽可能多地保存批处理?

回答by Cesar Canassa

This is now possible on Django 2.2

这现在可以在 Django 2.2 上实现

Django 2.2 adds a new ignore_conflictsoption to the bulk_createmethod, from the documentation:

Django 2.2ignore_conflicts为该bulk_create方法添加了一个新选项,来自文档

On databases that support it (all except PostgreSQL < 9.5 and Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values. Enabling this parameter disables setting the primary key on each model instance (if the database normally supports it).

在支持它的数据库上(除 PostgreSQL < 9.5 和 Oracle 之外的所有数据库),将 ignore_conflicts 参数设置为 True 会告诉数据库忽略插入失败的任何行,这些行不符合约束(例如重复的唯一值)。启用此参数将禁用在每个模型实例上设置主键(如果数据库通常支持它)。

Example:

例子:

Entry.objects.bulk_create([
    Entry(headline='This is a test'),
    Entry(headline='This is only a test'),
], ignore_conflicts=True)

回答by Craig Ringer

(Note: I don't use Django, so there may be more suitable framework-specific answers)

(注意:我不使用Django,所以可能有更合适的特定于框架的答案)

It is not possible for Django to do this by simply ignoring INSERTfailures because PostgreSQL aborts the whole transaction on the first error.

Django 不可能通过简单地忽略INSERT失败来做到这一点,因为 PostgreSQL 在第一个错误时中止整个事务。

Django would need one of these approaches:

Django 需要以下方法之一:

  1. INSERTeach row in a separate transaction and ignore errors (very slow);
  2. Create a SAVEPOINTbefore each insert (can have scaling problems);
  3. Use a procedure or query to insert only if the row doesn't already exist (complicated and slow); or
  4. Bulk-insert or (better) COPYthe data into a TEMPORARYtable, then merge that into the main table server-side.
  1. INSERT单独事务中的每一行并忽略错误(非常慢);
  2. SAVEPOINT在每次插入之前创建一个(可能有缩放问题);
  3. 仅当行不存在时才使用过程或查询插入(复杂且缓慢);或者
  4. COPY将数据批量插入或(更好)到一个TEMPORARY表中,然后将其合并到服务器端的主表中。

The upsert-like approach (3) seems like a good idea, but upsert and insert-if-not-exists are surprisingly complicated.

类似 upsert 的方法 (3) 看起来是个好主意,但upsert 和 insert-if-not-exists 出奇地复杂

Personally, I'd take (4): I'd bulk-insert into a new separate table, probably UNLOGGEDor TEMPORARY, then I'd run some manual SQL to:

就个人而言,我会采取(4):我会批量插入到一个新的单独表中,可能UNLOGGED或者TEMPORARY,然后我会运行一些手动 SQL 以:

LOCK TABLE realtable IN EXCLUSIVE MODE;

INSERT INTO realtable 
SELECT * FROM temptable WHERE NOT EXISTS (
    SELECT 1 FROM realtable WHERE temptable.id = realtable.id
);

The LOCK TABLE ... IN EXCLUSIVE MODEprevents a concurrent insert that creates a row from causing a conflict with an insert done by the above statement and failing. It does notprevent concurrent SELECTs, only SELECT ... FOR UPDATE, INSERT,UPDATEand DELETE, so reads from the table carry on as normal.

LOCK TABLE ... IN EXCLUSIVE MODE可以防止创建行的并发插入与由上述语句完成的插入发生冲突并失败。它并不能防止并发SELECTS,只是SELECT ... FOR UPDATEINSERTUPDATEDELETE,所以从表中随身携带读取正常。

If you can't afford to block concurrent writes for too long you could instead use a writable CTE to copy ranges of rows from temptableinto realtable, retrying each block if it failed.

如果您无法承受长时间阻塞并发写入,则可以改为使用可写 CTE 将行范围从temptableinto复制realtable,如果失败则重试每个块。

回答by Ivan

One quick-and-dirty workaround for this that doesn't involve manual SQL and temporary tables is to just attempt to bulk insert the data. If it fails, revert to serial insertion.

一种不涉及手动 SQL 和临时表的快速解决方法是尝试批量插入数据。如果失败,则恢复为串行插入。

objs = [(Event), (Event), (Event)...]

try:
    Event.objects.bulk_create(objs)

except IntegrityError:
    for obj in objs:
        try:
            obj.save()
        except IntegrityError:
            continue

If you have lots and lots of errors this may not be so efficient (you'll spend more time serially inserting than doing so in bulk), but I'm working through a high-cardinality dataset with few duplicates so this solves most of my problems.

如果你有很多很多错误,这可能不是那么有效(你会花更多的时间连续插入而不是批量插入),但我正在处理一个几乎没有重复的高基数数据集,所以这解决了我的大部分问题问题。

回答by Aryeh Leib Taurog

Or 5. Divide and conquer

或 5. 分而治之

I didn't test or benchmark this thoroughly, but it performs pretty well for me. YMMV, depending in particular on how many errors you expect to get in a bulk operation.

我没有对此进行彻底的测试或基准测试,但它对我来说表现非常好。YMMV,特别取决于您希望在批量操作中获得多少错误。

def psql_copy(records):
    count = len(records)
    if count < 1:
        return True
    try:
        pg.copy_bin_values(records)
        return True
    except IntegrityError:
        if count == 1:
            # found culprit!
            msg = "Integrity error copying record:\n%r"
            logger.error(msg % records[0], exc_info=True)
            return False
    finally:
        connection.commit()

    # There was an integrity error but we had more than one record.
    # Divide and conquer.
    mid = count / 2
    return psql_copy(records[:mid]) and psql_copy(records[mid:])
    # or just return False

回答by Noortheen Raja

Even in Django 1.11 there is no way to do this. I found a better option than using Raw SQL. It using djnago-query-builder. It has an upsertmethod

即使在 Django 1.11 中也没有办法做到这一点。我找到了比使用原始 SQL 更好的选择。它使用djnago-query-builder。它有一个upsert方法

from querybuilder.query import Query
q = Query().from_table(YourModel)
# replace with your real objects
rows = [YourModel() for i in range(10)] 
q.upsert(rows, ['unique_fld1', 'unique_fld2'], ['fld1_to_update', 'fld2_to_update'])

Note: The library only support postgreSQL

注意:该库仅支持 postgreSQL

Here is a gistthat I use for bulk insert that supports ignoring IntegrityErrors and returns the records inserted.

这是我用于批量插入的要点,它支持忽略 IntegrityErrors 并返回插入的记录。

回答by Sencer H.

Late answer for pre Django 2.2 projects :

Django 2.2 之前项目的迟到答案:

I ran into this situation recently and I found my way out with a seconder list array for check the uniqueness.

我最近遇到了这种情况,我找到了使用辅助列表数组检查唯一性的方法。

In my case, the model has that unique together check, and bulk create is throwing Integrity Error exception because of the array of bulk create has duplicate data in it.

在我的情况下,模型具有唯一的联合检查,并且由于批量创建的数组中有重复数据,因此批量创建会引发完整性错误异常。

So I decided to create checklist besides bulk create objects list. Here is the sample code; The unique keys are ownerand brand, and in this example owner is an user object instance and brand is a string instance:

所以我决定创建除批量创建对象列表之外的清单。这是示例代码;唯一键是ownerbrand,在这个例子中 owner 是一个用户对象实例,brand 是一个字符串实例:

create_list = []
create_list_check = []
for brand in brands:
    if (owner.id, brand) not in create_list_check:
        create_list_check.append((owner.id, brand))
        create_list.append(ProductBrand(owner=owner, name=brand))

if create_list:
    ProductBrand.objects.bulk_create(create_list)