How to avoid downtime when updating a stage variable on a API gateway deployment? - amazon-web-services

I have an API example_api currently deployed on stage DEV in AWS API gateway.
I want to update one of its stage variable and make sure the change is deployed. The API is provisioned by CloudFormation and the stage variables are mapped to template parameters.
I update the stack with boto3 and CloudFormation (using the UsePreviousTemplate flag) and provide the new value.
I then use boto3 to call create_deployment for example_api on DEV (to update already deployed example_api on DEV).
At this point, my API becomes unavailable for around 15-20 seconds. I keep receiving {"message":"Missing Authentication Token"} responses.
I guess I am doing something wrong here. I do I avoid such a downtime and make sure the new API is available ASAP?
Note: my API is accessed through a custom domain name in API gateway. The base path is mapped to the DEV stage.
Thanks

The problem was that the cloud formation template had created the stage using the StageDescription property of the Deployment resource and I did not understand the deployment/stage relationship properly.
Resource Stage DEV was initially binded with Deployment Named000.
My first update_stack call used to update stage variable but also rebind the stage DEV to the initial deployment (Named000). Losing any changes applied since (any new routes).
I was able to update stage variables and properly deploy with no downtime by creating a Deployment resource and appending a timestamp to its name to make sure a new resource is created every time the stack is generated with Troposphere. Then updating the stack with new stage variables would keep stage binded with latest deployment and avoid introducing downtime.

Related

Terraform has detected that the configuration specified for the backend has changed

TL;DR: I have accidentally deployed via terraform to multiple regions using back-end state management in multiple regions. Now I want to clean it all up with terraform. Is this possible? How should I approach the problem?
A few months ago I created a solution which I pushed to AWS via terraform, using back-end state management with S3 and dynamo table lock. It deployed successfully, but upon returning to it recently I have discovered that I had apparently changed both the terraform init back-end parameters and the provider region values between deployments.
What I believe I was left with, back then, was two separate deployments - one in one region and one in another. The problem, now, is that I'm not sure which region's state is used to manage which region's resources.
My documented terraform init is setup to use us-east-1 to manage back-end state. Looking at the versioning of the .tf files, I can see that at some point I had resources deployed to eu-central-1. I don't know if I have erroneously deployed to one region while managing state in another, but I suspect so.
In an attempt to destroy the eu-central-1 resources, I have run the below init locally. The result below is what followed.
> terraform init -backend-config="bucket=app-us-east-1-lambda-state" -backend-config="key=modules/app-lambda-function/terraform.tfstate" -backend-config="region=us-east-1" -backend-config="dynamodb_table=app-us-east-1-lambda-lock"
Initializing modules...
Initializing the backend...
Backend configuration changed!
Terraform has detected that the configuration specified for the backend
has changed. Terraform will now check for existing state in the backends.
Acquiring state lock. This may take a few moments...
Acquiring state lock. This may take a few moments...
Do you want to copy existing state to the new backend?
Pre-existing state was found while migrating the previous "s3" backend to the
newly configured "s3" backend. An existing non-empty state already exists in
the new backend. The two states have been saved to temporary files that will be
removed after responding to this query.
Previous (type "s3"): C:\Users\USER\AppData\Local\Temp\terraform123456798\1-s3.tfstate
New (type "s3"): C:\Users\USER\AppData\Local\Temp\terraform123456789\2-s3.tfstate
Do you want to overwrite the state in the new backend with the previous state?
Enter "yes" to copy and "no" to start with the existing state in the newly
configured "s3" backend.
Enter a value:
Now, unfortunately (I suspect), at this point I typed no, hit return and the following is what happened...
Enter a value: no
Releasing state lock. This may take a few moments...
Releasing state lock. This may take a few moments...
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
Initializing provider plugins...
- Reusing previous version of hashicorp/archive from the dependency lock file
- Reusing previous version of hashicorp/aws from the dependency lock file
- Using previously-installed hashicorp/archive v2.1.0
- Using previously-installed hashicorp/aws v3.30.0
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
At this point I tried to destroy the resources but this failed for various reasons. I have trimmed most of them, but what you see is a good example of them all...
> terraform destroy --auto-approve --var-file=../../local.tfvars
Acquiring state lock. This may take a few moments...
Error: error deleting CloudWatch Events Rule (my-app-rule): ValidationException: Rule can't be deleted since it has targets.
status code: 400, request id: 1234567-1234-1234-1234-12345679012
So what I believe I have done in the past is:
Deploy with both back-end management and providers set to eu-central-1
Changed provider to us-east-1 and re-deployed
Changed back-end management to us-east-1 and re-deployed
More recently, changed the back-end management back to eu-central-1 and attempted destroy
Now, I understand that this is all on my personal account and I can manually destroy all the resources using the console. However, I would like to understand what I should have done when I realised that (months ago) I had been repeatedly deploying while also changing the back-end and provider regions.

How to update AWS Fargate service outside AWS code deploy in order to change desired task count

When set up AWS code deploy to deploy an AWS service we have to provide 2 target groups lets say
TargetGroupBlue and TargetGroupGreen.
In the cloudformation template we use the TargetGroupBlue when linking the Service to Loadbalancer.
TargetGroupGreen is created only to be used by AWS during code deploy.
Step 1 : We executed create stack command in order to create the service and loadbalancer. We have a workable service now. Traffic is routed via TargetGroupBlue.
Step 2 : Then use code deploy to do another deploy which will the swap the target group to TargetGroupGreen once done.
Step 3 : Now we need to update the desired task count in service so use cloudformation update stack command. This fails because the targetgroup is TargetGroupGreen (as Code deploy changed it in step 2) and out cloud formation templates has used TargetGroupBlue for linking the service to Loadbalancer.
The workaround could be do all service related updates outside code deploy in a even numbered release (so always have to do code deploy twice so that we know traffic is always routed TargetGroupBlue)
Is this the way we should work with service updates via cloudformation and Code Deploy?
Please help to get this figured out.
Even though AWS provides many cool ways to work with when it comes to BlueGreen deploys with CodeDeploy or CloudFormation it really sucks.
The work around they suggested was to use Custom Resources in cloudformation which will actually trigger a lambda function to get the services updated cheating the cloudformation stack updates. Sample.
But there are no proper samples to do that so it would take lot of time to get it to work the way you need.
Furthermore, the cloudforamtion with hooks does not really work for bigger projects as the LBs cannot be shared.
So here is the open ticket, please help to put a thumbs up so the AWS will prioritize this in their roadmap.
https://github.com/aws-cloudformation/aws-cloudformation-coverage-roadmap/issues/483

CloudWatch alert when a new deploy is made AWS

There are some issues in a company I'm working for. Basically the dev team is pushing new deploys to the API Gateway before consulting with the security guy.
This leads to the security person noticing a new endpoint on the application was released when security issues start to arise.
I was wondering if there's any simple way of creating an alert that pops up on AWS CloudWatch when a new deploy is created. If I recall corectly, these are called "alarms".
I have looked a bit into alarms but they seem to be based on metrics and I was not able to find a metric that shows a new endpoint being created on deploy.
This is certainly not the best approach to the problem, but It should work for now until the deploy process is changed.
I was thinking you could come up with a script that runs aws cloudformation list-stacks and check whether the output has more number of stacks than last time. But this method will only work for new stacks, not for stack modifications.

Serverless keeps trying to create user pool domain when it already exist

I have a aws cognito user group configured to my serverless.yml. Whenever I do a serverless deploy, it will try to create the same user pool domain even though it already exist, hence returning me the error of:
[aws-cognito-idp-userpool] domain already exist
The only workaround is for me to delete the user pool domain every time I want to do a serverless deploy from the AWS UI. Anyone faced this issue before?
I believe there's no way to skip it,
Check this - https://github.com/serverless/serverless/issues/3183
You can try to break the serverless.yaml file into multiple files and deploy them separately for easier management,
So use the file only to create/deploy resources you need to freshly create.
The serverless.yaml will get converted into the vendor-specific Code to Infra service file,
eg. CloudFormation for AWS
Hope this helps
This is actually a CloudFormation issue vs. a Serverless issue. I ran into it in my Serverless app, BUT had my UserPool* resources independently defined in the resources section of the serverless.yml file. I changed the Domain Prefix and that requires the resource to be recreated. Here's the issue: CloudFormation always creates a resource first before deleting the old one, which blocks the new domain from being associated with the User Pool.
I've seen this behavior with other resources and the recommended behavior is to:
1. Blank out the resource from the template
2. Update the stack (deletes resource)
3. Restore the resource in template
4. Update the stack (creates a new one vs. replace).
This way you still leverage your automation tools without going to the console. Not perfect, and it'd be more preferable if there was a way to force the replacement sequence in CloudFormation. If your setup has Serverless generating the resource, then deleting via the console may be your only option.

Unable to make Canary Deployment with API Gateway and Lambda

I am trying to use the Canary deployment option of API Gateway, but I am not being able to do so. It looks like all the configurations are done properly, but when I try making calls I only get the latest code version responses. To sum up, this is what I am doing:
I have a API Gateway stage called dev, in which I have a Lambda already deployed. I have added to my base code an endpoint that returns the version of the code currently running.
I enable the Canary deployment option in the target stage (dev) in the API Gateway Console.
I make changes in the code and update the version number in the previously created endpoint.
I make a new deploy (Lambda) with the expected Canary settings. In my case, I'm using a percentTraffic of 50%.
Everything looks fine, even the percent is changed automatically in the Canary tab in the API Gateway Console. But once I start making calls to my endpoint, I only get the latest version. So it looks like I am missing something, but I don't know what.
Any ideas? :)