Python 使用 boto3 对 dynamoDb 进行完整扫描
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36780856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Complete scan of dynamoDb with boto3
提问by CJ_Spaz
My table is around 220mb with 250k records within it. I'm trying to pull all of this data into python. I realize this needs to be a chunked batch process and looped through, but I'm not sure how I can set the batches to start where the previous left off.
我的桌子大约有 220mb,里面有 250k 条记录。我正在尝试将所有这些数据提取到 python 中。我意识到这需要一个分块的批处理过程并循环执行,但我不确定如何将批处理设置为从上一个停止的地方开始。
Is there some way to filter my scan? From what I read that filtering occurs after loading and the loading stops at 1mb so I wouldn't actually be able to scan in new objects.
有什么方法可以过滤我的扫描吗?从我读到的过滤发生在加载后,加载在 1mb 处停止,所以我实际上无法扫描新对象。
Any assistance would be appreciated.
任何援助将不胜感激。
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token = aws_session_token,
aws_access_key_id = aws_access_key_id,
aws_secret_access_key = aws_secret_access_key,
region_name = region
)
table = dynamodb.Table('widgetsTableName')
data = table.scan()
回答by Tay B
I think the Amazon DynamoDB documentationregarding table scanning answers your question.
我认为有关表扫描的Amazon DynamoDB 文档可以回答您的问题。
In short, you'll need to check for LastEvaluatedKeyin the response. Here is an example using your code:
简而言之,您需要LastEvaluatedKey在响应中检查。这是使用您的代码的示例:
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token=aws_session_token,
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region
)
table = dynamodb.Table('widgetsTableName')
response = table.scan()
data = response['Items']
while 'LastEvaluatedKey' in response:
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
回答by Jordon Phillips
boto3 offers paginators that handle all the pagination details for you. Hereis the doc page for the scan paginator. Basically, you would use it like so:
boto3 提供分页器来处理所有分页细节。这是扫描分页器的文档页面。基本上,你会像这样使用它:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
for page in paginator.paginate():
# do something
回答by Abe Voelker
Riffing off of Jordon Phillips's answer, here's how you'd pass a FilterExpressionin with the pagination:
从 Jordon Phillips 的回答中取笑,以下是您如何通过FilterExpression分页传递信息:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': 'foo',
'FilterExpression': 'bar > :x AND bar < :y',
'ExpressionAttributeValues': {
':x': {'S': '2017-01-31T01:35'},
':y': {'S': '2017-01-31T02:08'},
}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
# do something
回答by Vincent
Code for deleting dynamodb format type as @kungphu mentioned.
删除@kungphu 提到的dynamodb 格式类型的代码。
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('query')
service_model = client._service_model.operation_model('Query')
trans = TransformationInjector(deserializer = TypeDeserializer())
for page in paginator.paginate():
trans.inject_attribute_value_output(page, service_model)
回答by Richard
DynamoDB limits the scanmethod to 1mb of data per scan.
DynamoDB 将该scan方法限制为每次扫描 1mb 的数据。
Documentation:https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan
文档:https : //boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan
Here is an example loop to get all the data from a DynamoDB table using LastEvaluatedKey:
下面是一个示例循环,它使用LastEvaluatedKey以下命令从 DynamoDB 表中获取所有数据:
from boto3 import resource
_dynamo = resource('dynamodb')
_database = _dynamo.Table('Database')
last_evaluated_key = None
while True:
if last_evaluated_key:
response = _database.scan(ExclusiveStartKey=last_evaluated_key)
else:
response = _database.scan()
last_evaluated_key = response.get('LastEvaluatedKey')
if not last_evaluated_key:
break
回答by CJ_Spaz
Turns out that Boto3 captures the "LastEvaluatedKey" as part of the returned response. This can be used as the start point for a scan:
结果证明 Boto3 捕获了“LastEvaluatedKey”作为返回响应的一部分。这可以用作扫描的起点:
data= table.scan(
ExclusiveStartKey=data['LastEvaluatedKey']
)
I plan on building a loop around this until the returned data is only the ExclusiveStartKey
我计划围绕这个建立一个循环,直到返回的数据只是 ExclusiveStartKey
回答by Dan Hook
I had some problems with Vincent's answer related to the transformation being applied to the LastEvaluatedKey and messing up the pagination. Solved as follows:
我对 Vincent 的回答有一些问题,涉及到应用于 LastEvaluatedKey 的转换并弄乱了分页。解决方法如下:
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_model = client._service_model.operation_model('Scan')
trans = TransformationInjector(deserializer = TypeDeserializer())
operation_parameters = {
'TableName': 'tablename',
}
items = []
for page in paginator.paginate(**operation_parameters):
has_last_key = 'LastEvaluatedKey' in page
if has_last_key:
last_key = page['LastEvaluatedKey'].copy()
trans.inject_attribute_value_output(page, operation_model)
if has_last_key:
page['LastEvaluatedKey'] = last_key
items.extend(page['Items'])
回答by YitzikC
The 2 approaches suggested above both have problems: Either writing lengthy and repetitive code that handles paging explicitly in a loop, or using Boto paginators with low-level sessions, and foregoing the advantages of higher-level Boto objects.
上面建议的 2 种方法都有问题:要么编写冗长且重复的代码,在循环中显式处理分页,要么在低级会话中使用 Boto 分页器,并放弃高级 Boto 对象的优势。
A solution using Python functional code to provide a high-level abstraction allows higher-level Boto methods to be used, while hiding the complexity of AWS paging:
使用 Python 函数代码提供高级抽象的解决方案允许使用更高级别的 Boto 方法,同时隐藏 AWS 分页的复杂性:
import itertools
import typing
def iterate_result_pages(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Generator:
"""A wrapper for functions using AWS paging, that returns a generator which yields a sequence of items for
every response
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
A generator which yields the 'Items' field of the result for every response
"""
response = function_returning_response(*args, **kwargs)
yield response["Items"]
while "LastEvaluatedKey" in response:
kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
response = function_returning_response(*args, **kwargs)
yield response["Items"]
return
def iterate_paged_results(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Iterator:
"""A wrapper for functions using AWS paging, that returns an iterator of all the items in the responses.
Items are yielded to the caller as soon as they are received.
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
An iterator which yields one response item at a time
"""
return itertools.chain.from_iterable(iterate_result_pages(function_returning_response, *args, **kwargs))
# Example, assuming 'table' is a Boto DynamoDB table object:
all_items = list(iterate_paged_results(ProjectionExpression = 'my_field'))

