Jenkins triggered code deploy is failing at ApplicationStop step even though same deployment group via code deploy directly is running successfully - amazon-web-services

When I trigger via Jenkins (code deploy plugin), I get the following error -
No such file or directory - /opt/codedeploy-agent/deployment-root/edbe4bd2-3999-4820-b782-42d8aceb18e6/d-8C01LCBMG/deployment-archive/appspec.yml
However, if I trigger deployment into the same deployment group via code deploy directly, and specify the same zip in S3 (obtained via Jenkins trigger), this step passes.
What does this mean, and how do I find a workaround to this? I am currently working on integrating a few things and so, will need to deploy via code deploy and via Jenkins simultaneously. I will run the code deploy triggered deployment when I will need to ensure that the smaller unit is functioning well.
Update
Just mentioning another point, in case it applies. I was previously using a different codedeploy "application" and "deployment group" on the same ec2 instances, and deplying using jenkins and code deploy directly as well. In order to fix some issue (not allowing to overwrite existing files due to failed deployments, allegedly), I had deleted everything inside the /opt/codedeploy-agent/deployment-root/<directory containing deployments> directory, trying to follow what was mentioned in this answer. However, note that I deleted only items inside that directory. Thereafter, I started getting this error appspec.yml not found in deployment archive. So, then I created a new application and deployment group and since then, I am working on it.
So, another point to consider is whether I should do some further cleanup, if the jenkins triggered deployment is somehow still affected by those deletions (even though it is referring to the new application and deployment group).

As part of its process, CodeDeploy needs to reference previous deployments for Redeployments and Deployment Rollbacks operations. These references are maintained outside of the deployment archive folders. If you delete these archives manually as you indicate, then a CodeDeploy install can get fatally corrupted: the references left to previous deployments are no longer correct or consistent, and deploys will fail.
The best thing at this point is to remove the old installation completely, and re-install. This will allow the code deploy agent to work correctly again.
I have learned the hard way not to remove/modify any of the CodeDeploy install folders or files manually. Even if you change apps or deployment groups, CodeDeploy will figure it out itself, without the need for any manual cleanup.

In order to do a deployment, the bundle needs to contain a appspec.yml file, and the file needs to be put at the top directory. Seems the error message is due to the host agent can't find the appspec.yml file.

Related

How to manage automatic deployment to ECS using Terraform Cloud and CircleCI?

I have an ECS task which has 2 containers using 2 different images, both hosted in ECR. There are 2 GitHub repos for the two images (app and api), and a third repo for my IaC code (infra). I am managing my AWS infrastructure using Terraform Cloud. The ECS task definition is defined there using Cloudposse's ecs-alb-service-task, with the containers defined using ecs-container-definition. Presently I'm using latest as the image tag in the task definition defined in Terraform.
I am using CircleCI to build the Docker containers when I push changes to GitHub. I am tagging each image with latest and the variable ${CIRCLE_SHA1}. Both repos also update the task definition using the aws-ecs orb's deploy-service-update job, setting the tag used by each container image to the SHA1 (not latest). Example:
container-image-name-updates: "container=api,tag=${CIRCLE_SHA1}"
When I push code to the repo for e.g. api, a new version of the task definition is created, the service's version is updated, and the existing task is restarted using the new version. So far so good.
The problem is that when I update the infrastructure with Terraform, the service isn't behaving as I would expect. The ecs-alb-service-task has a boolean called ignore_changes_task_definition, which is true by default.
When I leave it as true, Terraform Cloud successfully creates a new version whenever I Apply changes to the task definition. (A recent example was to update environment variables.) BUT it doesn't update the version used by the service, so the service carries on using the old version. Even if I stop a task, it will respawn using the old version. I have to manually go in and use the Update flow, or push changes to one of the code repos. Then CircleCI will create yet aother version of the task definition and update the service.
If I instead set this to false, Terraform Cloud will undo the changes to the service performed by CircleCI. It will reset the task definition version to the last version it created itself!
So I have three questions:
How can I get Terraform to play nice with the task definitions created by CircleCI, while also updating the service itself if I ever change it via Terraform?
Is it a problem to be making changes to the task definition from THREE different places?
Is it a problem that the image tag is latest in Terraform (because I don't know what the SHA1 is)?
I'd really appreciate some guidance on how to properly set up this CI flow. I have found next to nothing online about how to use Terraform Cloud with CI products.
I have learned a bit more about this problem. It seems like the right solution is to use a CircleCI workflow to manage Terraform Cloud, instead of having the two services effectively competing with each other. By default Terraform Cloud will expect you to link a repo with it and it will auto-plan every time you push. But you can turn that off and use the terraform orb instead to run plan/apply via CircleCI.
You would still leave ignore_changes_task_definition set to true. Instead, you'd add another step to the workflow after the terraform/apply step has made the change. This would be aws-ecs/run-task, which should relaunch the service using the most recent task definition, which was (possibly) just created by the previous step. (See the task-definition parameter.)
I have decided that this isn't worth the effort for me, at least not at this time. The conflict between Terraform Cloud and CircleCI is annoying, but isn't that acute.

AWS CodeDeploy - deploy using the incorrect revision files

been banging my head against the wall trying to get an unbelievably simple CodeDeploy run going. The behavior I'm seeing suggests either a configuration issue or an issue local to the running agent. Basically, deployments are not using the files explicitly supplied to them - they're stuck implicitly using a prior version.
Having created an application and deployment group (and ensuring all prerequisites are in place such as the agent and roles are correctly assigned), I'm creating a deployment with via zipping up my code folder (at the root, not including the code's containing folder). There were a few issues to fix in a few of the hook steps, but I was able to fix a couple of them by changing the code, re-zipping and re-uploading before things got particularly weird. There was a syntax issue in my ApplicationStart hook script (when I finally got that far), so I fixed it and re-uploaded as before. However, the same syntax error occurred. I tried re-uploading, deleting all my S3 files and re-uploading, downloading the listed revision files and checking their contents (changes were reflected), but the same syntax error occurred. I even deleted the script and hook step completely from the yml file and it still happened, so clearly the deployment system is "stuck" in some sense. I went as far as feeding it a completely empty text file, told it it was a tar file, and it's still running my old revision. It's as though the runner agent's local files are stale and it's failing to clear local contents.
What's the deal? I feel like I've missed something fundamental.
edit - I created an entirely identical but new deployment group and re-tried the deployment with my new files and it worked. So that deployment group itself is stuck.

Cloud Function build error - failed to get OS from config file for image

I'm seeing this Cloud Build error when I try to deploy a Cloud Function:
"Step #2 - "analyzer": [31;1mERROR: [0mfailed to initialize cache: failed to create image cache: accessing cache image "us.gcr.io/MY_PROJECT/gcf/us-central1/SOME_KEY/cache:latest": failed to get OS from config file for image 'us.gcr.io/MY_PROJECT/gcf/us-central1/SOME_KEY/cache:latest'"
I'm able to build and emulate the cloud function locally, but I can't deploy it due to this error. I was able to deploy just fine until now. I've looked everywhere and I can't find any discussion about this. Anyone know what's going on here?
UPDATE: I deployed a new function 3 days ago and now I can't seem to deploy an update to it. I get the same error. I'm fairly sure this is happening due to the lifecycle rule I set up to ensure I don't keep storing images of functions: Firebase storage artifacts is huge and keeps increasing. This rule is important to keep around because I don't want to pay for unnecessary storage, but it seems like it might be the source of our problem here. Can someone from Google look into this?
I got the same error, even for code that deployed successfully before.
A workaround is to delete the Docker images for the failing Firebase functions inside Container Registry and re-deploying the functions. (The images will be re-created upon deploying.)
The error still occurs sporadically, so I suspect this may be a bug introduced in Firebase's deployment process. Thankfully for now, the workaround above resolves the issue every time the error comes up.
I also encountered the same problem, and solved it by deleting the images in the Container Registry of Firebase Project.
I made a Script at that time, and I'll put it here. The usage is as follows. Please use it if you like.
Install the Google Cloud SDK.
Download the Script
Edit CONTAINER_REGISTRY to your registry name. For example: CONTAINER_REGISTRY=asia.gcr.io/project-name/gcf/asia-northeast1
Grant execute permission. - $ chmod +x script.sh
Execute it. - $ sh script.sh
Deploy your functions.
I'm having the same problem for the last few days and in contact with the support. I had the same log and in my case it wasn't connected to the artifacts because the artifacts rebuild themselves automatically on deploy (read below about a subtle case related to the artifacts and how to fix it), but deleting the functions and redeploying solved it for me.
Artifacts auto cleanup
Note that if the artifacts bucket is empty, then the problem is somewhere else.
But if it's not empty, what you can do to resolve any possible problems related to the artifacts auto cleanup, is to delete the whole "container" folder manually in the artifacts which should solve it. Then just redeploy again.
Make sure not to delete the artifacts bucket itself!
Dough from firebase confirmed in the question you referring to that removing the artifacts content is safe.
So, here is how to delete it:
go to the google cloud console, select your project -> storage -> browser https://console.cloud.google.com/storage/browser
Select the "artifacts" bucket
Choose "containers" and delete it
If the problem was here, it should work fine after that.
This happens because the deletion rule you refer to in your question checks the "last updated" timestamp of each file while on redeploy only some files are updated. So the next day the rule will delete some of the files while leaving the others which will lead to the inconsistent state of the bucket in this case. So you just remove everything manually.

Aws Code pipeline is failing at Deployment stage by timing out

I am trying to work my way to have a ci/cd for the Api part of the application.
I have 3 steps:
1: Source (git hub version2)
2: Build (currently has no commands)
3: Deploy(provider is code deploy(application))
Here is the screenshot of the events in code deploy.
.
While creating the Deployment Group. I chose the option of downloading the code deploy provider from the option(though it was necessary).
While setting up the code pipeline chose
Felt that was appropriate.
This code pipeline has put an object into the S3 bucket for this pipeline.
Code deploy is acting on that source artifact.
Note:
We have nothing on this Ec2 image it's just a place where we have our API.
Currently, Ec2 is empty.
What would be the proper way to implement this? How can I overcome the issues I am facing.
Without appspec.yml your deployment will fail. From docs:
An AppSpec file must be a YAML-formatted file named appspec.yml and it must be placed in the root of the directory structure of an application's source code. Otherwise, deployments fail.

How come Google Cloud Build trigger can't find custom named .yaml files?

The Problem
When using, cloudbuild.yaml files, specifically named for their build environment, such as cloudbuild-dev.yaml and cloudbuild-prod.yaml, and configured/targeted in the Trigger settings they aren't found/recognized when GCB reacts to a GitHub event (push etc).
However, it's working just fine when manually running the Trigger from GCB console.
When using an ordinarily named cloudbuild.yaml, in the root of the project, Cloud Build correctly runs the expected steps.
The Workaround
In short, there isn't an easy one (imo). But to get it run you need to use just a single _cloudbuild.yaml).
However, to effectively re-use that for both dev and prod environments one is blocked by this issue