Api Gateway Api Key immediate use upon creation giving forbidden - amazon-web-services

Application creates an api key on a per user basis, meaning the process is as follows:
Lambda function creates api key and adds to a usage plan
Api key value is returned from lambda function
Api key is then immediately used to call an Api Gateway end point
Forbidden message is returned
If I delay execution between api key creation and the http request to the api gateway end point (by around 5 seconds), then it works as intended, but less than that I get an error.
I suspect that the api key takes a few seconds to propagate to the endpoint but I can't find an AWS API method that correctly lets me know when it has done so. Has anyone come across this problem before and how did you solve it?
The best solution I have at the moment is to retry the api call on a sliding timeout until an unreasonable amount of time has passed.

How long should I wait after applying an AWS IAM policy before it is valid? is not the same question but seems likely to be similar in its underlying explanation -- it's not so much a case of the API key taking time to exist but rather taking time to propagate and become visible at every possible place where it might need to exist before being valid for any subsequent request.
If those assumptions are correct, there is no mechanism for authoritatively determining whether the key is ready for use or not, because for some period of time after the key creation request succeeds, it's in a situation arguably reminiscent of Schrödinger's cat -- the key both exists and doesn't exist -- you don't know until you try it, and (unlike the cat) even a successful test does not necessarily prove that it is fully ready for use, because of the possibility (however unlikely) of a result such as fail fail fail fail pass fail pass pass pass. Such is the characteristic behavior of many large-scale, distributed systems.
From comments:
If an API call returns the api key value then I would expect it to be able to be used instantly, or at least return only when the key has been propagated fully to the end points.
That makes sense on the surface, but it becomes problematic in implementation. What if one of the endpoints is failed, offline for maintenance, or in the middle of recovering from an outage and lagging... what then? Fail the request? Delay the response waiting for something statistically unlikely to impact you?
The resource cost of observing replication tends to outweigh the benefits in many cases and can destabilize the control plane of a system if a replication issue causes a sufficient backlog, and is often not implemented except in cases where it has a high value, viz. the GetChange action in Route 53 which allows you to verify the propagation of a change through the system -- and note that even in this case, the change request itself succeeds without waiting -- if you need to verify the sync state, you have to ask separately.

A lot of AWS services take time to create. Usually there is a way to detect if the job has been completed. In this case it looks like you get a forbidden response until the key is created.
I think you will have to handle this in your client.

Related

Is there a way to pass along values between retries in Lambda?

For instance Try one failed, Can we pass few parameters to to event object of the next retry by something like below?
event.somevariable = somevalue
If we want do something like that what could be our options?
I'm not a fan of Lambda retries. They are run exactly the same as the initial call and if it failed the first time, it will fail on both of the subsequent retries. What changes?
I'm going to assume that you want to pass along a variable to track which retry is being executed and potentially make changes so that the subsequent retries do succeed - this does make sense. However, unfortunately, you need to look outside of lambda to make this happen.
DynamoDB is one method which is commonly used, to track the event ID and number of executions however I personally find that to be a faff.
I'd rather use Amazon SNS to ping a HTTP endpoint on failure, then re-execute my lambda function with different parameters. Just be mindful (in all cases) of idempotency. You should be able to re-execute a lambda multiple times without it causing issues or overwriting what was intended to happen.
There's no way to do that directly in AWS.
You could use the request ID as a primary key in a DynamoDB table where you store that value, and always look for those values in DynamoDB at the start of a request.

How to clear an AWS Lambda cache (or force a cold start)

The short version:
If I am caching values in my lambda container, how can I clear this cache? I guess I could redeploy the lambda, which will force all new requests to initiate a new cold start, but this doesn't seem like a nice solution.
The long version:
I am writing a custom authorizer for AWS API Gateway (in Python) that does two things:
It gets an api-key from an http header and looks it up in a dynamo table to verify it is valid (and get some attributes attached to it).
It verifies a JWT token (using some of the attributes from #1).
After following some code (this code), I learnt that I can cache values "globally" that can be re-used across invocations of the lambda, great! But if I cache say, the dynamodb response when looking up the api key, what if I have to revoke / issue a new api key at some point?
I'd like to be able to ensure that my lambda cache gets wiped somehow.
Short answer: You can force a new container for each invoke by calling the UpdateFunctionCode or UpdateFunctionConfiguration before exiting the execution for the same function. You can keep changing function time out before returning the response and the next invoke will spin up a new execution environment (container/sandbox) with a cold start penalty.
The right approach: If you are caching the function variables, you can clear them off inside the handler and continue with the execution logic. This will ensure you are not facing cold start penalties for subsequent invocations and you can in control of choosing the "right" values.
This can be better explained in using database clients. You can create the client outside the handler, but for every invoke verify if the client is valid. Recreate the client inside the handler if the original is now invalid. This will save you some processing time - as the CPU is throttled when the function hits the handler.
Since you are working with API Gateway, the cold start penalties will contribute towards API's Integration timeout (hard limit of 29 seconds for auth and backend combined); and I will try to avoid forcing cold start as much as possible.

AWS lambda execution fails only first time I run it with 'customer function error'

I trigger a lambda function via API gateway and everything works perfectly with the one exception that the very first time I trigger it on a given day it fails.
Strangely, the lambda function logs don't show any errors. I get my usual START log statement and then the request and context of the trigger, then after 5s, it ends unexpectedly.
When I look into the API gateway logs this is the error it returns:
Lambda execution failed with status 200 due to customer function error: 2018-12-10T11:00:31.208Z cc233168-fc9n-11fc-a05a-577bb4sd2b2ccc Task timed out after 5.01 seconds.
Has anyone encountered a similar problem? What is customer function error and how may I resolve this?
without knowing much of the background code you are using, i would termed this a Cold Start. Cold start happens for the first request where your function has not be called for a very long time. If you notice error message says "Time Out after 5.01 seconds. which is default set. you can increase a time out.
Alternatively, you could consider reducing the impact of cold starts by reducing the length of cold starts reference :
by authoring your Lambda functions in a language that doesn’t incur a high cold start time — i.e. Node.js, Python, or Go
choose a higher memory setting for functions on the critical path of handling user requests (i.e. anything that the user would have to wait for a response from, including intermediate APIs)
optimizing your function’s dependencies, and package size
You can also explore by putting a cron job through Cloud Watch after every specific interval to call your API through PING
Adding to Yash's answer:
I've only seen Lambda execution failed with status 200 in API Gateway execution logs, though in case it can manifest in other ways: ensure you have execution logging enabled for the endpoint. If you didn't already have it enabled you'll need to wait for the problem to manifest again.
You can verify it's a cold start problem as follows:
In the log entry with the error grab the #logStream value and the timestamp for the event; it'll be a long string of alphanumerics like a4f8115980dc83a511eeedc493a78741
Open the log group for that endpoint's execution log -> find the log stream with the identifier you just grabbed
Narrow the date/time range to a window around the time where the event occurred
If you chose a narrow window and if it's a cold start problem: I would expect the offending request to be the first one in the list. Click the There are older events to load. Load more. at the top of the list.
You should now see a gap of time between the last request received and the offending request.
In my case the error says connection reset by peer which leads me to think it's behaving as though a virtual machine were put to sleep then awoken in the sense that it believes TCP connections it previously had open are still valid.
In the short term the solution we're going with is to implement a retry strategy.
Besides the cold-start problem, there's another potential aspect of this problem: your API Gateway access log format.
Do the following:
Find the access log entries that correspond to the offending request in the execution log.
Is the HTTP status == 502?
502s in the API Gateway access log usually (always?) indicate the Lambda responded with malformed JSON.
The most obvious reason for it returning malformed JSON is a bug in your code. One of the less obvious reasons: a mistake in the access log format.
If you suspect that's the case, look for the following:
Quoted fields that shouldn't be; eg $context.error.messageString
Un-quoted fields that should be. A common idiom is to leave numeric fields un-quoted because it makes insights queries like this work: | filter #status >= 500. As convenient as that is, if the field isn't guaranteed to produce a numeric result then the JSON response will be malformed.
Trailing commas in {} bodies
Here's the documentation for many of the the context variables, though one thing to keep in mind: the context variables that are available differ between the different API Gateway endpoint types (lambda, websocket, etc).

Is it possible to detect an AWS account is nearing the Lambda concurrency limit?

Lambda has some concurrency limits that when hit, cause subsequent invocations to get throttled.
This makes sense, but is it possible to detect this situation ahead of time and start applying backpressure?
The problem is that (according to the docs) the concurrency limit is per-account, which means a single runaway microservice can block ALL unrelated services.
For example: a lambda fn with an s3 event source could easily lead to API Gateway handlers being throttled and unhappy API users.
Is there any QoS for lambda functions? It'd be great to be able to give public-facing functions priority. (I know the answer is no, but I wish there were.)
Short of that, is it possible to detect that you're nearing this concurrency limit and build backpressure in?
I'm not seeing anything, and the only solution I can think of at this moment is to create a metric that watches for Throttles and as soon as one happens, toggle some flag somewhere? This adds significant complexity though...
Any ideas?

The rate of control plane requests made by this account is too high

I'm using AWS Dynamo DB and it keeps giving me the following error when trying to create DB by https://www.npmjs.org/package/dynamodb:
The rate of control plane requests made by this account is too high
Does anyone know what the reason is?
Thanks
Could you share your code that is calling the create? And does this happen every time, or only sometimes? If you can get insight into whether the CreateTable API call is failing, or a DescribeTable API call is failing, that would be helpful too. If you can log the request ids of all of the requests you're making, and share them on this post, we (the DynamoDB folks) can see if we can get more details on our side.
This error may occur when you create, update, or delete many tables simultaneously (as in call the API with many operations simultaneously). This is easy to do in Node.js because of its non-blocking programming model. The error may also happen if you CreateTable and then immediately call DescribeTable simultaneously or immediately after (this typically doesn't happen though).