Python 使用 boto3 对 dynamoDb 进行完整扫描
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36780856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Complete scan of dynamoDb with boto3
提问by CJ_Spaz
My table is around 220mb with 250k records within it. I'm trying to pull all of this data into python. I realize this needs to be a chunked batch process and looped through, but I'm not sure how I can set the batches to start where the previous left off.
我的桌子大约有 220mb,里面有 250k 条记录。我正在尝试将所有这些数据提取到 python 中。我意识到这需要一个分块的批处理过程并循环执行,但我不确定如何将批处理设置为从上一个停止的地方开始。
Is there some way to filter my scan? From what I read that filtering occurs after loading and the loading stops at 1mb so I wouldn't actually be able to scan in new objects.
有什么方法可以过滤我的扫描吗?从我读到的过滤发生在加载后,加载在 1mb 处停止,所以我实际上无法扫描新对象。
Any assistance would be appreciated.
任何援助将不胜感激。
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token = aws_session_token,
aws_access_key_id = aws_access_key_id,
aws_secret_access_key = aws_secret_access_key,
region_name = region
)
table = dynamodb.Table('widgetsTableName')
data = table.scan()
回答by Tay B
I think the Amazon DynamoDB documentationregarding table scanning answers your question.
我认为有关表扫描的Amazon DynamoDB 文档可以回答您的问题。
In short, you'll need to check for LastEvaluatedKey
in the response. Here is an example using your code:
简而言之,您需要LastEvaluatedKey
在响应中检查。这是使用您的代码的示例:
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token=aws_session_token,
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region
)
table = dynamodb.Table('widgetsTableName')
response = table.scan()
data = response['Items']
while 'LastEvaluatedKey' in response:
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
回答by Jordon Phillips
boto3 offers paginators that handle all the pagination details for you. Hereis the doc page for the scan paginator. Basically, you would use it like so:
boto3 提供分页器来处理所有分页细节。这是扫描分页器的文档页面。基本上,你会像这样使用它:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
for page in paginator.paginate():
# do something
回答by Abe Voelker
Riffing off of Jordon Phillips's answer, here's how you'd pass a FilterExpression
in with the pagination:
从 Jordon Phillips 的回答中取笑,以下是您如何通过FilterExpression
分页传递信息:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': 'foo',
'FilterExpression': 'bar > :x AND bar < :y',
'ExpressionAttributeValues': {
':x': {'S': '2017-01-31T01:35'},
':y': {'S': '2017-01-31T02:08'},
}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
# do something
回答by Vincent
Code for deleting dynamodb format type as @kungphu mentioned.
删除@kungphu 提到的dynamodb 格式类型的代码。
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('query')
service_model = client._service_model.operation_model('Query')
trans = TransformationInjector(deserializer = TypeDeserializer())
for page in paginator.paginate():
trans.inject_attribute_value_output(page, service_model)
回答by Richard
DynamoDB limits the scan
method to 1mb of data per scan.
DynamoDB 将该scan
方法限制为每次扫描 1mb 的数据。
Documentation:https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan
文档:https : //boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan
Here is an example loop to get all the data from a DynamoDB table using LastEvaluatedKey
:
下面是一个示例循环,它使用LastEvaluatedKey
以下命令从 DynamoDB 表中获取所有数据:
from boto3 import resource
_dynamo = resource('dynamodb')
_database = _dynamo.Table('Database')
last_evaluated_key = None
while True:
if last_evaluated_key:
response = _database.scan(ExclusiveStartKey=last_evaluated_key)
else:
response = _database.scan()
last_evaluated_key = response.get('LastEvaluatedKey')
if not last_evaluated_key:
break
回答by CJ_Spaz
Turns out that Boto3 captures the "LastEvaluatedKey" as part of the returned response. This can be used as the start point for a scan:
结果证明 Boto3 捕获了“LastEvaluatedKey”作为返回响应的一部分。这可以用作扫描的起点:
data= table.scan(
ExclusiveStartKey=data['LastEvaluatedKey']
)
I plan on building a loop around this until the returned data is only the ExclusiveStartKey
我计划围绕这个建立一个循环,直到返回的数据只是 ExclusiveStartKey
回答by Dan Hook
I had some problems with Vincent's answer related to the transformation being applied to the LastEvaluatedKey and messing up the pagination. Solved as follows:
我对 Vincent 的回答有一些问题,涉及到应用于 LastEvaluatedKey 的转换并弄乱了分页。解决方法如下:
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_model = client._service_model.operation_model('Scan')
trans = TransformationInjector(deserializer = TypeDeserializer())
operation_parameters = {
'TableName': 'tablename',
}
items = []
for page in paginator.paginate(**operation_parameters):
has_last_key = 'LastEvaluatedKey' in page
if has_last_key:
last_key = page['LastEvaluatedKey'].copy()
trans.inject_attribute_value_output(page, operation_model)
if has_last_key:
page['LastEvaluatedKey'] = last_key
items.extend(page['Items'])
回答by YitzikC
The 2 approaches suggested above both have problems: Either writing lengthy and repetitive code that handles paging explicitly in a loop, or using Boto paginators with low-level sessions, and foregoing the advantages of higher-level Boto objects.
上面建议的 2 种方法都有问题:要么编写冗长且重复的代码,在循环中显式处理分页,要么在低级会话中使用 Boto 分页器,并放弃高级 Boto 对象的优势。
A solution using Python functional code to provide a high-level abstraction allows higher-level Boto methods to be used, while hiding the complexity of AWS paging:
使用 Python 函数代码提供高级抽象的解决方案允许使用更高级别的 Boto 方法,同时隐藏 AWS 分页的复杂性:
import itertools
import typing
def iterate_result_pages(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Generator:
"""A wrapper for functions using AWS paging, that returns a generator which yields a sequence of items for
every response
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
A generator which yields the 'Items' field of the result for every response
"""
response = function_returning_response(*args, **kwargs)
yield response["Items"]
while "LastEvaluatedKey" in response:
kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
response = function_returning_response(*args, **kwargs)
yield response["Items"]
return
def iterate_paged_results(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Iterator:
"""A wrapper for functions using AWS paging, that returns an iterator of all the items in the responses.
Items are yielded to the caller as soon as they are received.
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
An iterator which yields one response item at a time
"""
return itertools.chain.from_iterable(iterate_result_pages(function_returning_response, *args, **kwargs))
# Example, assuming 'table' is a Boto DynamoDB table object:
all_items = list(iterate_paged_results(ProjectionExpression = 'my_field'))