Can page size be set with dynamodb.ScanPages? - amazon-web-services

The documentation for working with dynamodb scans, found here, makes reference to a page-size parameter for the AWS CLI.
In looking at the documentation for the go AWS SDK, found here, there is function ScanPages. There is an example of how to use the function, but no where in the documentation is there a way to specify something like page-size as the AWS CLI has. I can't determine how the paging occurs other than assuming if the results exceed 1MB, then that would be considered a page based on the go documentation and the general scan documentation.
I'm also aware of the Limit value that can be set on the ScanInput, but the documentation indicates that value would function as a page size only if every item processed matched the filter expression of the scan:
The maximum number of items to evaluate (not necessarily the number of matching items)
Is there a way to set something equivalent to page-size with the go SDK?

How Pagination Works in AWS?
DynamoDB paginates the results from Scan operations. With pagination,
the Scan results are divided into "pages" of data that are 1 MB in
size (or less). An application can process the first page of results,
then the second page, and so on.
So for each request if you have more items in the result you will always get the LastEvaluatedKey. You will have re-issue scan request using this LastEvaluatedKey to get the complete result.
For example for a sample query you have 400 results and each result fetches to the upper limit 100 results, you will have to re-issue the scan request till the lastEvaluatedKey is returned empty. You will do something like below. documentation
var result *ScanOutput
for{
if(len(resultLastEvaluatedKey) == 0){
break;
}
input := & ScanInput{
ExclusiveStartKey= LastEvaluatedKey
// Copying all parameters of original scanInput request
}
output = dynamoClient.Scan(input)
}
What page-size on AWS-CLI does?
The scan operation scan's all the dynamoDB and returns result according to filter. Ordinarily, the AWS CLI handles pagination automatically.The AWS CLI keeps on re-issuing scan request for us. This request and response pattern continues, until the final response.
The page-size tells specifically to scan only the page-size number of rows in the DB table at a time and filter on those. If the complete table is not scanned or the result is more than 1MB the result will send out lastEvaluatedKey and cli will re-issue the request.
Here is a sample request response from documentation.
aws dynamodb scan \
--table-name Movies \
--projection-expression "title" \
--filter-expression 'contains(info.genres,:gen)' \
--expression-attribute-values '{":gen":{"S":"Sci-Fi"}}' \
--page-size 100 \
--debug
b'{"Count":7,"Items":[{"title":{"S":"Monster on the Campus"}},{"title":{"S":"+1"}},
{"title":{"S":"100 Degrees Below Zero"}},{"title":{"S":"About Time"}},{"title":{"S":"After Earth"}},
{"title":{"S":"Age of Dinosaurs"}},{"title":{"S":"Cloudy with a Chance of Meatballs 2"}}],
"LastEvaluatedKey":{"year":{"N":"2013"},"title":{"S":"Curse of Chucky"}},"ScannedCount":100}'
We can clearly see that the scannedCount:100 and the filtered count Count:7, so out of 100 items scanned only 7 items are filtered. documentation
From Limit's Documentation
// The maximum number of items to evaluate (not necessarily the number of matching
// items). If DynamoDB processes the number of items up to the limit while processing
// the results, it stops the operation and returns the matching values up to
// that point, and a key in LastEvaluatedKey to apply in a subsequent operation,
// so that you can pick up where you left off.
So basically, page-size and limit are same. Limit will limit the number of rows to scan in one Scan request.

Related

DynamoDB: get list of items that were overwritten in batch write operation

I am trying to get the list of attributes that were overwritten in a DynamoDB batch operation, but it doesn't have this information in the response.
Is there anyway to get list of items that were overwritten when using batch write?
No: The API docs clearly list which information is returned and that's not among them.
If you need the before state, you have to do individual PutItem requests with the ReturnValues parameter set to ALL_OLD - docs.

How to rate limit scan to AWS DynamoDB for AWS CLI?

I have created the following query to query my table :
aws dynamodb scan --table-name TableName --scan-filter '{
"attributeName" : {
"AttributeValueList" : [ {"S" : "StringToQuery"}],
"ComparisonOperator" : "CONTAINS"
}
}'
This is causing a spike in read capacity for that table, which will probably lead to throttling of customer requests. I couldn't find any command line option to limit the rate in https://docs.aws.amazon.com/cli/latest/reference/dynamodb/scan.html, but I did find a java script with rate limit : https://aws.amazon.com/blogs/developer/rate-limited-scans-in-amazon-dynamodb/
Is there any way to do it from AWS CLI?
You can disable pagination and manually make the paginated calls with a bash loop. This way you can delay a certain amount based on the time the previous call took and the consumed read capacity.
Went ahead with creating a new index over a value which i knew was almost always "Y", like isActive, and added a filter on top of the query. Since it was a new index, it didn't affect existing index capacity.
Answer by cementblocks would reduce the RCU consumed too, but I needed a guarantee that customers would not be impacted.

Is it always necessary to check isTruncated in S3 ListObjects / ListObjectsV2 responses?

S3's ListObjects and ListObjectsV2 API responses both include an IsTruncated response element, which (according to the V1 API docs)
Specifies whether (true) or not (false) all of the results were returned. If the number of results exceeds that specified by MaxKeys, all of the results might not be returned.
According to the Listing Objects Keys section of the S3 documentation:
As buckets can contain a virtually unlimited number of keys, the complete results of a list query can be extremely large. To manage large result sets, the Amazon S3 API supports pagination to split them into multiple responses. Each list keys response returns a page of up to 1,000 keys with an indicator indicating if the response is truncated. You send a series of list keys requests until you have received all the keys. AWS SDK wrapper libraries provide the same pagination.
Clearly we need to check isTruncated if there's a possibility that the listing could match more than 1000 keys. Similarly, if we explicitly set MaxKeys then we definitely need to check isTruncated if there's ever the possibility that a listing could match more than MaxKeys keys.
However, do we need to check isTruncated if we never expect there to be more than min(1000, MaxKeys) matching keys?
I think that the weakest possible interpretation of the S3 API docs is that S3 will return at most min(1000, MaxKeys) keys per listing call but technically can return fewer keys even if more matching keys exist and would fit in the response. For example, if there are 10 matchings keys and MaxKeys == 1000 then it would be technically valid for S3 to return, say, 3 keys in the first API response and 7 in the second. (Technically I suppose it could even return zero keys and set isTruncated = true, but that behavior seems unlikely).
With these weak semantics I think we always need to check isTruncated, even if we're listing what we expect to be a very small number of keys. As a corollary, any code which doesn't check isTruncated is (most likely) buggy.
In the past, I've observed this listing semantic from other AWS APIs (including the EC2 Reserved Instance Marketplace API).
Is this a correct interpretation of the S3 API semantics? Or does S3 actually guarantee (but not document) stronger semantics (e.g. "if more than MaxKeys keys match the listing then the listing will contain exactly MaxKeys)?
I'm especially interested in answers which cite official AWS sources (such as AWS forum responses, SDK issues, etc).
In my experience it will always return the maximum number of values, which is as you state it: min(1000, MaxKeys)
So, if you know you will always have under 1000 results, you would not need to check isTruncated.
Mind you, it's fairly easy to construct a while loop to do so. (Probably easier than writing this question!)

How to change or get all keys using the maxkeys returned by the listbucket xml?

I am trying to list all my files in my public bucket using the url http://gameexperiencesurvey.s3.amazonaws.com/
You can visit the url to see the xml.
The XML contains an element called MaxKeys with value 1000 which is the maximum number of keys returned in the response body. What if I want to list all the keys that I have, how do I do that?
Also, what is the max limit for number of keys and their size on on a free aws s3 account?
It is called S3 pagination. See: Iterating Through Multi-Page Results
Iterating Through Multi-Page Results
As buckets can contain a virtually unlimited number of keys, the
complete results of a list query can be extremely large. To manage
large result sets, the Amazon S3 API supports pagination to split them
into multiple responses. Each list keys response returns a page of up
to 1,000 keys with an indicator indicating if the response is
truncated. You send a series of list keys requests until you have
received all the keys. AWS SDK wrapper libraries provide the same
pagination.
You need to have sufficient privileges to list the object keys.
AWS Free Tier for S3

What is the AWS Dynamo DB write consistence?

I know that AWS Dynamo DB read has eventually and strongly consistence. And I read a document it says that The individual PutItem and DeleteItem operations specified in BatchWriteItem are atomic; however BatchWriteItem as a whole is not.
But I still don't understand how is the write method behavior is synchronized or not.
If this is an awkward question, please tell me.
BatchWriteItem is a batch API - meaning it allows you to specify a number of different operations to be submitted to Dynamo for execution in the same request. So when you submit a BatchItemRequest you are asking DynamoDB to perform a number of either PutItem or DeleteItem requests for you.
The claim that the individual PutItem and DeleteItem requests are atomic means that each of those is atomic with respect to other requests that may be wanting to modify the same item (identified by it's partition/sort keys) - meaning it's not possible for data corruption to occur within the item because two PutItem requests executed at the same time each modifying some part of the item and thus leaving it in an inconsistent state.
But then, the claim is that the whole BatchWriteItem request is not atomic. That just means that the sequence of PutItem and/or DeleteItem requests is not guaranteed to be isolated, so you could have other PutItem or DeleteItem requests - whether single or batch execute at the same time as the BatchWriteItem request which could affect the state of the table(s) in-between the individual PutItem/DeleteItem requests that make up the batch.
To Illustrate the point, let's say you have a BatchItemRequest that consists of the following 2 calls:
PutItem (partitionKey = 1000; name = 'Alpha'; value = 100)
DeleteItem (partitionKey = 1000)
And that at approximately the same time you've submitted this request there is another request that has the following operation in it:
DeleteItem (partitionKey = 1000)
It is possible that the second delete item request might delete the item before the first request executes and so while the PutItem succeeds, the DeleteItem in the first request would fail with a not found because the item has been deleted by the other delete request. This is one example of how the whole batch operation is not atomic.