database 从 DynamoDB 中删除大量项目的推荐方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9154264/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 08:27:21  来源:igfitidea点击:

What is the recommended way to delete a large number of items from DynamoDB?

databasenosqlamazon-web-servicescloudamazon-dynamodb

提问by Tyler

I'm writing a simple logging service in DynamoDB.

我正在 DynamoDB 中编写一个简单的日志服务。

I have a logs table that is keyed by a user_id hash and a timestamp (Unix epoch int) range.

我有一个由 user_id 哈希和时间戳(Unix epoch int)范围键控的日志表。

When a user of the service terminates their account, I need to delete all items in the table, regardless of the range value.

当服务的用户终止他们的帐户时,我需要删除表中的所有项目,无论范围值如何。

What is the recommended way of doing this sort of operation (Keeping in mind there could be millions of items to delete)?

执行此类操作的推荐方法是什么(请记住可能有数百万个项目要删除)?

My options, as far as I can see are:

据我所知,我的选择是:

A: Perform a Scan operation, calling delete on each returned item, until no items are left

A:执行一次Scan操作,对每个返回的item调用delete,直到没有item为止

B: Perform a BatchGet operation, again calling delete on each item until none are left

B: 执行 BatchGet 操作,再次对每个项目调用 delete 直到没有剩余

Both of these look terrible to me as they will take a long time.

这两个在我看来都很糟糕,因为它们需要很长时间。

What I ideally want to do is call LogTable.DeleteItem(user_id) - Without supplying the range, and have it delete everything for me.

我理想中想要做的是调用 LogTable.DeleteItem(user_id) - 不提供范围,并让它为我删除所有内容。

采纳答案by Steffen Opel

What I ideally want to do is call LogTable.DeleteItem(user_id) - Without supplying the range, and have it delete everything for me.

我理想中想要做的是调用 LogTable.DeleteItem(user_id) - 不提供范围,并让它为我删除所有内容。

An understandable request indeed; I can imagine advanced operations like these might get added over time by the AWS team (they have a history of starting with a limited feature set first and evaluate extensions based on customer feedback), but here is what you should do to avoid the cost of a full scan at least:

确实是可以理解的要求;我可以想象 AWS 团队可能会随着时间的推移添加此类高级操作(他们有先从有限的功能集开始并根据客户反馈评估扩展的历史),但您应该采取以下措施来避免成本至少完整扫描:

  1. Use Queryrather than Scanto retrieve all items for user_id- this works regardless of the combined hash/range primary key in use, because HashKeyValueand RangeKeyConditionare separate parameters in this API and the former only targets the Attribute value of the hash component of the composite primary key..

    • Please note that you''ll have to deal with the query API paging here as usual, see the ExclusiveStartKeyparameter:

      Primary key of the item from which to continue an earlier query. An earlier query might provide this value as the LastEvaluatedKey if that query operation was interrupted before completing the query; either because of the result set size or the Limit parameter. The LastEvaluatedKey can be passed back in a new query request to continue the operation from that point.

  2. Loop over all returned items and either facilitate DeleteItemas usual

    • Update: Most likely BatchWriteItemis more appropriate for a use case like this (see below for details).
  1. 使用Query而不是Scan来检索所有项目user_id- 无论使用的组合散列/范围主键如何,这都有效,因为HashKeyValueRangeKeyCondition在此 API 中是单独的参数,而前者仅针对组合散列组件Attribute 值首要的关键。.

    • 请注意,您必须像往常一样在此处处理查询 API 分页,请参阅ExclusiveStartKey参数:

      从中继续先前查询的项目的主键。如果该查询操作在完成查询之前中断,则较早的查询可能会将此值作为 LastEvaluatedKey 提供;由于结果集大小或限制参数。LastEvaluatedKey 可以在新的查询请求中传回,以从该点继续操作。

  2. 循环所有返回的项目,或者像往常一样促进DeleteItem

    • 更新:很可能BatchWriteItem更适合这样的用例(有关详细信息,请参见下文)。


Update

更新

As highlighted by ivant, the BatchWriteItemoperation enables you to put or deleteseveral items across multiple tables in a single API call [emphasis mine]:

正如ivant所强调的那样BatchWriteItem操作使您能够在单个 API 调用中跨多个表放置或删除多个项目 [强调我的]

To upload one item, you can use the PutItem API and to delete one item, you can use the DeleteItem API. However, when you want to upload or delete large amounts of data, such as uploading large amounts of data from Amazon Elastic MapReduce (EMR) or migrate data from another database in to Amazon DynamoDB, this API offers an efficient alternative.

上传一项,您可以使用 PutItem API,删除一项,您可以使用 DeleteItem API。但是,当您想要上传或删除大量数据时,例如从 Amazon Elastic MapReduce (EMR) 上传大量数据或将数据从另一个数据库迁移到 Amazon DynamoDB,此 API 提供了一种高效的替代方案。

Please note that this still has some relevant limitations, most notably:

请注意,这仍然有一些相关的限制,最明显的是:

  • Maximum operations in a single request— You can specify a total of up to 25 put or delete operations; however, the total request size cannot exceed 1 MB (the HTTP payload).

  • Not an atomic operation— Individual operations specified in a BatchWriteItem are atomic; however BatchWriteItem as a whole is a "best-effort" operation and not an atomic operation. That is, in a BatchWriteItem request, some operations might succeed and others might fail. [...]

  • 单个请求中的最大操作数——您可以指定最多 25 个放置或删除操作;但是,总请求大小不能超过 1 MB(HTTP 负载)。

  • 不是原子操作——在 BatchWriteItem 中指定的单个操作是原子的;但是 BatchWriteItem 整体上是“尽力而为”的操作,而不是原子操作。也就是说,在 BatchWriteItem 请求中,某些操作可能会成功,而其他操作可能会失败。[...]

Nevertheless this obviously offers a potentially significant gain for use cases like the one at hand.

尽管如此,这显然为手头这样的用例提供了潜在的显着收益。

回答by jonathan

According to the DynamoDB documentation you could just delete the full table.

根据 DynamoDB 文档,您可以删除整个表。

See below:

见下文:

"Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations"

“删除整个表比逐个删除项目要高效得多,因为删除操作与放置操作一样多,因此写入吞吐量实际上增加了一倍”

If you wish to delete only a subset of your data, then you could make separate tables for each month, year or similar. This way you could remove "last month" and keep the rest of your data intact.

如果您只想删除数据的一个子集,那么您可以为每个月、每年或类似的时间制作单独的表格。这样您就可以删除“上个月”并保持其余数据完好无损。

This is how you delete a table in Java using the AWS SDK:

这是使用 AWS 开发工具包在 Java 中删除表的方法:

DeleteTableRequest deleteTableRequest = new DeleteTableRequest()
  .withTableName(tableName);
DeleteTableResult result = client.deleteTable(deleteTableRequest);

回答by Lukas Liesis

If you want to delete items after some time, e.g. after a month, just use Time To Live option. It will notcount write units.

如果您想在一段时间后删除项目,例如一个月后,只需使用 Time To Live 选项。它不会计算写入单位。

In your case, I would add ttl when logs expire and leave those after a user is deleted. TTL would make sure logs are removed eventually.

在您的情况下,我会在日志过期时添加 ttl 并在用户被删除后保留它们。TTL 将确保最终删除日志。

When Time To Live is enabled on a table, a background job checks the TTL attribute of items to see if they are expired.

DynamoDB typically deletes expired items within 48 hours of expiration. The exact duration within which an item truly gets deleted after expiration is specific to the nature of the workload and the size of the table. Items that have expired and not been deleted will still show up in reads, queries, and scans. These items can still be updated and successful updates to change or remove the expiration attribute will be honored.

在表上启用生存时间后,后台作业会检查项目的 TTL 属性以查看它们是否已过期。

DynamoDB 通常会在过期后 48 小时内删除过期项目。项目在到期后真正被删除的确切持续时间特定于工作负载的性质和表的大小。已过期且未删除的项目仍将显示在读取、查询和扫描中。这些项目仍然可以更新,并且成功更改或删除过期属性的更新将得到认可。

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.htmlhttps://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html

回答by Iman Sedighi

The answer of this question depends on the number of items and their size and your budget. Depends on that we have following 3 cases:

这个问题的答案取决于物品的数量、大小和您的预算。取决于我们有以下 3 种情况:

1- The number of items and size of items in the table are not very much. then as Steffen Opel said you can Use Query rather than Scan to retrieve all items for user_id and then loop over all returned items and either facilitate DeleteItemor BatchWriteItem. But keep in mind you may burn a lot of throughput capacity here. For example, consider a situation where you need delete 1000 items from a DynamoDB table. Assume that each item is 1 KB in size, resulting in Around 1MB of data. This bulk-deleting task will require a total of 2000 write capacity units for query and delete. To perform this data load within 10 seconds (which is not even considered as fast in some applications), you would need to set the provisioned write throughput of the table to 200 write capacity units. As you can see its doable to use this way if its for less number of items or small size items.

1-表中的项目数量和项目大小不是很多。然后正如 Steffen Opel 所说,您可以使用 Query 而不是 Scan 来检索 user_id 的所有项目,然后遍历所有返回的项目,并促进DeleteItemBatchWriteItem. 但请记住,您可能会在这里消耗大量吞吐量。例如,假设您需要从 DynamoDB 表中删除 1000 个项目。假设每个项目的大小为 1 KB,导致大约 1MB 的数据。此批量删除任务将需要总共 2000 个写入容量单位进行查询和删除。要在 10 秒内执行此数据加载(这在某些应用程序中甚至不算快),您需要将表的预配置写入吞吐量设置为 200 个写入容量单位。正如您所看到的,如果它用于较少数量的物品或小尺寸的物品,则可以使用这种方式。

2- We have a lot of items or very large items in the table and we can store them according to the time into different tables. Then as jonathan Said you can just delete the table. this is much better but I don't think it is matched with your case. As you want to delete all of users data no matter what is the time of creation of logs, so in this case you can't delete a particular table. if you wanna have a separate table for each user then I guess if number of users are high then its so expensive and it is not practical for your case.

2-我们有很多项目或非常大的项目在表中,我们可以根据时间将它们存储到不同的表中。然后就像乔纳森说的那样,您可以删除该表。这好多了,但我认为它与您的情况不符。由于无论何时创建日志,您都希望删除所有用户数据,因此在这种情况下您无法删除特定表。如果你想为每个用户有一个单独的表,那么我猜如果用户数量很高,那么它太贵了,这对你的情况不切实际。

3- If you have a lot of data and you can't divide your hot and cold data into different tables and you need to do large scale delete frequently then unfortunately DynamoDB is not a good option for you at all. It may become more expensive or very slow(depends on your budget). In these cases I recommend to find another database for your data.

3- 如果您有大量数据并且无法将热数据和冷数据划分到不同的表中,并且需要经常进行大规模删除,那么不幸的是,DynamoDB 根本不是您的好选择。它可能会变得更贵或很慢(取决于您的预算)。在这些情况下,我建议为您的数据找到另一个数据库。

回答by Shraavan Hebbar

We don't have option to truncate dynamo tables. we have to drop the table and create again . DynamoDB Charges are based on ReadCapacityUnits & WriteCapacityUnits . If we delete all items using BatchWriteItem function, it will use WriteCapacityUnits.So better to delete specific records or delete the table and start again .

我们没有截断发电机表的选项。我们必须删除表并重新创建。DynamoDB 费用基于 ReadCapacityUnits 和 WriteCapacityUnits。如果我们使用 BatchWriteItem 函数删除所有项目,它将使用 WriteCapacityUnits。所以最好删除特定记录或删除表并重新开始。

回答by Mohammad

My approach to delete all rows from a table i DynamoDb is just to pull all rows out from the table, using DynamoDbs ScanAsync and then feed the result list to DynamoDbs AddDeleteItems. Below code in C# works fine for me.

我从 DynamoDb 表中删除所有行的方法只是从表中提取所有行,使用 DynamoDbs ScanAsync,然后将结果列表提供给 DynamoDbs AddDeleteItems。下面的 C# 代码对我来说很好用。

        public async Task DeleteAllReadModelEntitiesInTable()
    {
        List<ReadModelEntity> readModels;

        var conditions = new List<ScanCondition>();
        readModels = await _context.ScanAsync<ReadModelEntity>(conditions).GetRemainingAsync();

        var batchWork = _context.CreateBatchWrite<ReadModelEntity>();
        batchWork.AddDeleteItems(readModels);
        await batchWork.ExecuteAsync();
    }

Note: Deleting the table and then recreating it again from the web console may cause problems if using YAML/CloudFront to create the table.

注意:如果使用 YAML/CloudFront 创建表,删除表然后从 Web 控制台重新创建它可能会导致问题。