How to rate limit scan to AWS DynamoDB for AWS CLI? - amazon-web-services

I have created the following query to query my table :
aws dynamodb scan --table-name TableName --scan-filter '{
"attributeName" : {
"AttributeValueList" : [ {"S" : "StringToQuery"}],
"ComparisonOperator" : "CONTAINS"
}
}'
This is causing a spike in read capacity for that table, which will probably lead to throttling of customer requests. I couldn't find any command line option to limit the rate in https://docs.aws.amazon.com/cli/latest/reference/dynamodb/scan.html, but I did find a java script with rate limit : https://aws.amazon.com/blogs/developer/rate-limited-scans-in-amazon-dynamodb/
Is there any way to do it from AWS CLI?

You can disable pagination and manually make the paginated calls with a bash loop. This way you can delay a certain amount based on the time the previous call took and the consumed read capacity.

Went ahead with creating a new index over a value which i knew was almost always "Y", like isActive, and added a filter on top of the query. Since it was a new index, it didn't affect existing index capacity.
Answer by cementblocks would reduce the RCU consumed too, but I needed a guarantee that customers would not be impacted.

Related

DyanmoDB not adding every item? Stops at 900 items?

I am uploading a csv that has 15,000 items to dynamodb via the 'import from s3' button on the dynamodb console. However, when it is finished and I do a 'get live item count', it says 900 when it should be 15000. Anyone know why is it not adding everything to my new table?
Probably you had a format error. Look in CloudWatch Logs for details.
https://aws.amazon.com/blogs/database/amazon-dynamodb-can-now-import-amazon-s3-data-into-a-new-table/ says:
"During the import, DynamoDB might encounter errors while parsing your data. For each error, DynamoDB creates a log entry in Amazon CloudWatch Logs and keeps a count of the total number of errors encountered."

What is Concurrently error in AWS in start query execution operation and how to solve it?

I am currently facing issue in the project where S3 buckets contain avg 50 tables and after running glue job I see this following error. I think its not the issue of memory/ or worker nodes.
{
"Error":"States.TaskFailed",
"Cause":"{\"AllocatedCapacity\":5,\"Arguments\":{\"--quotes_col_list\":\"Null\",\"--processed_prefix\":\"processed/cat2/uber/\",\"--replicated_prefix\":\"replicated/cat2/uber/\",\"--table_folder\":\"SALES_ORDER_DOCUMENT_TYPE/\",\"--devops_prefix\":\"uber_processing/glue_configuration/rename_glue_file/replicated/uber/\",\"--tablename\":\"sales_order_document_type\",\"--companies\":\"uber\",\"--metadata_path\":\"cat2/cat2_metadata.csv\",\"--reject_prefix\":\"reject/cat2/uber/\"},\"Attempt\":0,\"CompletedOn\":1641759367801,\"ErrorMessage\":\"TooManyRequestsException: An error occurred (TooManyRequestsException) when calling the StartQueryExecution operation: You have exceeded the limit for the number of queries you can run concurrently. Please reduce the number of concurrent queries submitted by this account. Contact customer support to request a concurrent query limit increase.\",\"ExecutionTime\":51,\"GlueVersion\":\"2.0\",\"Id\":\"jr_b8haonpeno503no0n3020
\",\"JobName\":\"uber_job\",\"JobRunState\":\"FAILED\",\"LastModifiedOn\":1641759367801,\"LogGroupName\":\"/aws-glue/jobs\",\"MaxCapacity\":5.0,\"NumberOfWorkers\":5,\"PredecessorRuns\":[],\"StartedOn\":1641759312689,\"Timeout\":2880,\"WorkerType\":\"G.1X\"}"
}
When I checked the query funtion it doesn't show me any query running in glue job.
response = athena_client.start_query_execution(
QueryString='msck repair table '+args['audit_table'],
ResultConfiguration={
'OutputLocation': args['athena_resultpath'] }
)
Can someone help me in QueryString='msck repair table '+args['audit_table'] what is the argument?
You mentioned the word "concurrency" bit didn't mentioned exactly what the error message is:
"ErrorMessage":"TooManyRequestsException: An error occurred (TooManyRequestsException) when calling the StartQueryExecution operation: You have exceeded the limit for the number of queries you can run concurrently
Athena has some built in soft limits, it also mentions in their docs:
A DML or DDL query quota includes both running and queued queries. For example, if you are using the default DML quota and your total of running and queued queries exceeds 25, query 26 will result in a TooManyRequestsException error.
You are simply going over the limits so your query fails, specifically the "DML query quota" i'm assuming, these soft limits are somewhat flexible and can be increased by submitting a reqest via the service quotas console

Can page size be set with dynamodb.ScanPages?

The documentation for working with dynamodb scans, found here, makes reference to a page-size parameter for the AWS CLI.
In looking at the documentation for the go AWS SDK, found here, there is function ScanPages. There is an example of how to use the function, but no where in the documentation is there a way to specify something like page-size as the AWS CLI has. I can't determine how the paging occurs other than assuming if the results exceed 1MB, then that would be considered a page based on the go documentation and the general scan documentation.
I'm also aware of the Limit value that can be set on the ScanInput, but the documentation indicates that value would function as a page size only if every item processed matched the filter expression of the scan:
The maximum number of items to evaluate (not necessarily the number of matching items)
Is there a way to set something equivalent to page-size with the go SDK?
How Pagination Works in AWS?
DynamoDB paginates the results from Scan operations. With pagination,
the Scan results are divided into "pages" of data that are 1 MB in
size (or less). An application can process the first page of results,
then the second page, and so on.
So for each request if you have more items in the result you will always get the LastEvaluatedKey. You will have re-issue scan request using this LastEvaluatedKey to get the complete result.
For example for a sample query you have 400 results and each result fetches to the upper limit 100 results, you will have to re-issue the scan request till the lastEvaluatedKey is returned empty. You will do something like below. documentation
var result *ScanOutput
for{
if(len(resultLastEvaluatedKey) == 0){
break;
}
input := & ScanInput{
ExclusiveStartKey= LastEvaluatedKey
// Copying all parameters of original scanInput request
}
output = dynamoClient.Scan(input)
}
What page-size on AWS-CLI does?
The scan operation scan's all the dynamoDB and returns result according to filter. Ordinarily, the AWS CLI handles pagination automatically.The AWS CLI keeps on re-issuing scan request for us. This request and response pattern continues, until the final response.
The page-size tells specifically to scan only the page-size number of rows in the DB table at a time and filter on those. If the complete table is not scanned or the result is more than 1MB the result will send out lastEvaluatedKey and cli will re-issue the request.
Here is a sample request response from documentation.
aws dynamodb scan \
--table-name Movies \
--projection-expression "title" \
--filter-expression 'contains(info.genres,:gen)' \
--expression-attribute-values '{":gen":{"S":"Sci-Fi"}}' \
--page-size 100 \
--debug
b'{"Count":7,"Items":[{"title":{"S":"Monster on the Campus"}},{"title":{"S":"+1"}},
{"title":{"S":"100 Degrees Below Zero"}},{"title":{"S":"About Time"}},{"title":{"S":"After Earth"}},
{"title":{"S":"Age of Dinosaurs"}},{"title":{"S":"Cloudy with a Chance of Meatballs 2"}}],
"LastEvaluatedKey":{"year":{"N":"2013"},"title":{"S":"Curse of Chucky"}},"ScannedCount":100}'
We can clearly see that the scannedCount:100 and the filtered count Count:7, so out of 100 items scanned only 7 items are filtered. documentation
From Limit's Documentation
// The maximum number of items to evaluate (not necessarily the number of matching
// items). If DynamoDB processes the number of items up to the limit while processing
// the results, it stops the operation and returns the matching values up to
// that point, and a key in LastEvaluatedKey to apply in a subsequent operation,
// so that you can pick up where you left off.
So basically, page-size and limit are same. Limit will limit the number of rows to scan in one Scan request.

Is there a way to easily get only the log entries for a specific AWS Lambda execution?

Lambda obviously tracks executions, since you can see data points in the Lambda Monitoring tab.
Lambda also saves the logs in log groups, however I get the impression that Lambda launches are reused if happening in a shorter interval (say 5 minutes between launches), so the output from multiple executions gets written to the same log stream.
This makes logs a lot harder to follow, especially due to other limitations (the CloudWatch web console is super slow and cumbersome to navigate, aws log get-log-events has a 1MB/10k message limitation which makes it cumbersome to use).
Is there some way to only get Lambda log entries for a specific Lambda execution?
You can filter by the RequestId. Most loggers will include this in the log, and it is automatically included in the START, END, and REPORT entries.
My current approach is to use CloudWatch Logs Insights to query for the specific logs that I'm looking for. Here is the sample query:
fields #timestamp, #message
| filter #requestId = '5a89df1a-bd71-43dd-b8dd-a2989ab615b1'
| sort #timestamp
| limit 10000

Why playing with AWS DynamoDb "Hello world" produces read/write alarms?

I'v started to play with DynamoDb and I'v created "dynamo-test" table with hash PK on userid and couple more columns (age, name). Read and write capacity is set to 5. I use Lambda and API Gateway with Node.js. Then I manually performed several API calls through API gateway using similar payload:
{
"userId" : "222",
"name" : "Test",
"age" : 34
}
I'v tried to insert the same item couple times (which didn't produce error but silently succeeded.) Also, I used DynamoDb console and browsed for inserted items several times (currently there are 2 only). I haven't tracked how many times exactly I did those actions, but that was done completely manually. And then after an hour, I'v noticed 2 alarms in CloudWatch:
INSUFFICIENT_DATA
dynamo-test-ReadCapacityUnitsLimit-BasicAlarm
ConsumedReadCapacityUnits >= 240 for 12 minutes
No notifications
And the similar alarm with "...WriteCapacityLimit...". Write capacity become OK after 2 mins, but then went back again after 10 mins. Anyway, I'm still reading and learning how to plan and monitor these capacities, but this hello world example scared me a bit if I'v exceeded my table's capacity :) Please, point me to the right direction if I'm missing some fundamental part!
It's just an "INSUFFICIENT_DATA" message. It means that your table hasn't had any reads or writes in a while, so there is insufficient data available for the CloudWatch metric. This happens with the CloudWatch alarms for any DynamoDB table that isn't used very often. Nothing to worry about.
EDIT: You can now change a setting in CloudWatch alarms to ignore missing data, which will leave the alarm at its previous state instead of changing it to the "INSUFFICIENT_DATA" state.