DynamoDB on-demand mode suddenly stops working - amazon-web-services

I have a table that is incrementally populated with a lambda function every hour. The write capacity metric is full of predictable spikes and throttling was normally avoided by relying on the burst capacity.
The first three loads after turning on-demand mode on kept working. Thereafter it stopped loading new entries into the table and began to time-out (from ~10 seconds to the current limit of 4 minutes). The lambda function was not modified at all.
Does anyone know why might this be happening?
EDIT: I just see timeouts in the logs.
Logs before failure
Logs after failure
Errors and availability (%)

Since you are using Lambda to perform incremental writes, this issue is more than likely on Lambda side. That is where I would start looking for this. Do you have CW logs to look through? If you cannot find it, open a case with AWS support.

Unless this was recently fixed, there is a known bug in Lambda where you can get a series of timeouts. We encountered it on a project I worked on: a lambda would just start up and sit there doing nothing, quite like yours.
So like Kirk, I'd guess the problem is with the Lambda, not DynamoDB.
At the time there was no fix. As a workaround, we had another Lambda checking the one that suffered from failures and rerunning those. Not sure if there are other solutions. Maybe deleting everything and setting it back up again (with your fingers crossed :))? Should be easy enough if everything is in Cloudformation.

Related

Driftctl ThrottlingException: Rate exceeded on AWS

I face a rate limit error from AWS, any idea on how to fix it? is there an option to throttle the requests from driftctl?
ThrottlingException: Rate exceeded
status code: 400
Tried driftctl on GitHub action, I expected it to work properly
AWS rate limiting isn't really controllable directly, and can't be increased through AWS support. However, all of the AWS SDKs do automatic backoff and retry for throttling errors. It does partly depend on how driftctl is implemented too, and how it uses the AWS clients in the SDK.
Not having used the tool itself, but reading up on what it does, I suspect that it is just making a lot of API calls in a short period to try to scan all of your AWS infrastructure. I would start by configuring it not to do deep scans, and try it on a smaller terraform state file to see if you still get the problem.
It looks like it's written in go, and probably uses the go AWS SDK. If it uses version 2.x then there are some standard environment variables you can see for that to increase the number of retries it performs by default, particularly setting AWS_MAX_ATTEMPTS, which usually defaults to 3.
https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html
Bear in mind that when you hit these rate limits, often something is happening that may not be desirable. It's worth turning on verbose logging for driftctl if possible, to see what the AWS API calls it's making actually are, and if they are ones you would expect to see.
If you continue to get the problem, it's worth logging an issue on their Github project, and trying to get someone who knows the code to help you debug it: https://github.com/snyk/driftctl

AWS CloudWatch rule schedule has irregular intervals (when it shouldn't)

There is an Elastic Container Service cluster running an application internally referred to as Deltaload. It checks the data in Oracle production database and in dev database in Amazon RDS and loads whatever is missing into RDS. A CloudWatch rule is set up to trigger this process every hour.
Now, for some reason, every 20-30 hours there is one interval of a different length. Normally, it is a ~25 min gap, but on other occasions it can be 80-90 min instead of 60. I could understand a difference of 1-2 minutes, but being off by 30 min from an hourly schedule sounds really problematic, especially given the full run takes ~45 min. Does anyone have any ideas on what could be the reason for that? Or at least how can I figure out why it is so?
The interesting part is that this glitch in schedule either breaks or fixes the Deltaload app. What I mean is, if it is running successfully every hour for a whole day and then the 20 min interval happens, it will then be crashing every hour for the next day until the next glitch arrives, after which it will work again (the very same process, same container, same everything). It crashes, because the connection to RDS times out. This 'day of crashes, day of runs' thing has been going on since early February. I am not too proficient with AWS. This Deltaload app is written in C#, which I don't know. The only thing I managed to do is to increase the RDS connection timeout to 10 min, which did not fix the problem. The guy that wrote the app has left the company a time ago and is unavailable. There are no other developers on this project, as everyone got fired, because of corona. So far, the best alternative I see, is to just rewrite the whole thing in Python (which I know). If anyone has any other thoughts on how understand/fix it, I'd greatly appreciate any input.
To restate my actual question: why is it that CloudWatch rule drops in irregular intervals in a regular schedule? How to prevent this from happening?

Is there an alternative way to know when DynamoDb table descreased?

I have a weird problem with one of my tables in DynamoDb. When I make a petition to describe it I find that it was decreased three times today, while in the AWS console I can see only one scale down, that coincides with the one returned by LastDecreaseDateTime when performing a describe_table(TableName="tableName") on boto3 library.
Is there any other way to check when were the other decresing actions executed?
Also, is it possible that DynamoDb is fooling me someway? I am a little bit lost with this, because all what I can see from the metrics tab in the console is that it was just decreased once. I have other tables configured exactly the same way and they work like a charm.
CloudTrail will record all UpdateTable api calls. Enable CloudTrail, then when this happens again you will be able to see all the api calls.
If you have scaled down multiple times with in 5 minutes you will not see that reflected in the Provisioned Capacity metrics since they have a 5 minute resolution,.

"Cold" start of S3, DynamoDB, KMS or whatever

I use NodeJs AWS Lambdas. If I don't do calls to my S3, or DynamoDB, or KMS for some time (approx. 8h or more) the first call I make is usually painfully slow - up-to 5sec. There's nothing complex in the queries themselves - i.e. get a 0.2Kb S3 object, query a DynamoDB table by index.
So, it looks like AWS "hibernates" these resources when they aren't in active use and when I call them for the 1st time after a while they spend some time to return from "hibernated" state. This is my assumption, but I couldn't find any information about it in docs. So, the questions are the following two:
Is my assumption about "hibernation" correct?
If 1st point is correct, then is there any way to mitigate these "cold" calls to AWS services except keeping those services "warm" by calling them every X minutes?
Edit
Just to avoid confusions - this is not about Lambda's cold starts. I'm aware of them, they exist, they have their own share in functions' latency. Times I measure are the exact times of calls to S3/DynamoDB etc. - after the lambda is started.
It all likelihood it is the lambda function that is hibernating, not the other services:
A cold start occurs when an AWS Lambda function is invoked after not
being used for an extended period of time resulting in increased
invocation latency.
https://medium.com/#lakshmanLD/resolving-cold-start%EF%B8%8F-in-aws-lambda-804512ca9b61
and yes, you could setup a cloudwatch event to keep your lambda function warm.
We have experienced the same issue for calls to SSM and DynamoDB. It's probably not these services that go into hibernation, but the parameters for calling them are cached on the lambda container, which means they need to be recreated when a new container is spawned.
Unfortunately, we have not found a solution other than pinging the lambda from time to time. In this case, you should execute a call to your services in the ping in order to see an improvement in the loading times. See also below benchmark.
AWS (zoewangg) acknowledged the slow startup issue in 1.11.x Java SDK1.
One of the main reasons is that 1.11.x SDK uses ApacheHttpClient under
the hood and initializing it can be expensive.
Check out https://aws.amazon.com/blogs/developer/tuning-the-aws-java-sdk-2-x-to-reduce-startup-time/

How "Real-Time" DynamoDB stream is?

We are experimenting with a new serverless solution where external provider writes to DynamoDB, DynamoDB Stream reacts to a new write event, and triggers AWS Lambda function which propagates changes down the road?
So far it works well, however, sometimes we notice that data is being delayed e.g. no updates would come from Lambda for a few minutes.
After going through a lot of DynamoDB Stream documentation the only term they use is "near real-time stream record" but what is generally "near real-time"? What are the possible delays we are looking at here?
In my experience, most of the time it is near real-time. However, on a rare occasion you might have to wait a while (in my case, up to half an hour). I assume this was because of hardware or network issues in AWS infrastructure.
In most cases, Lambda functions are triggered within half a second after you make an update to a small item in a Streams-enabled DynamoDB table. But event source changes, updates to the Lambda function, changing the Lambda execution role, etc. may introduce additional latency when the Lambda function is run for the first-time.