I'm incredibly confused about how health checks work for a docker container running in ECS using AWS Fargate. I think what makes this confusing is that there's three core components working in tandem, each of which I've seen have its own "health check" concerns:
ECS
EC2
ALB
First, if I check the health check docs, it makes it very clear that the built-in HEALTHCHECK in my docker image won't be used. However, I've seen comments from others on SO that they are used, so which is it?
Concerning the health check setup for ECS, I'm not seeing any way to configure health check commands when I create a Task Definition for my ECS service via Fargate in the AWS dashboard (web interface). I'm setting up the infrastructure using the CDK in C#, but for learning purposes I look at the AWS dashboard and edit things from there. I figure I need to learn how to set things up manually before I try to automate it.
I'll mention what I do see, but I'm not sure how it all pieces together.
ECS -> Clusters -> Click cluster name -> Click service name: I see "Healthy Targets" and "Unhealthy Targets"
ECS -> Clusters -> Click cluster name -> Click service name -> Deployments and events tab: There's a log that says "service X port 80 is unhealthy in target-group Y due to (reason Health checks failed with these codes: [404]). If I click the link for Y, it takes me to "EC2 -> Target groups -> Y (fargate)", which has a "Health checks" tab. There, I can click "Edit" and specify the health check "Path". This seems to eliminate the error.
ECS -> Task definitions -> Click task def name -> Click revision name -> JSON tab: No mention of "health" anywhere in this file
From the CDK, it looks like you can set up health checks after creating ApplicationLoadBalancedFargateService, at which point you can invoke ApplicationLoadBalancedFargateService.TargetGroup.ConfigureHealthCheck(), which takes an IHealthCheck that I haven't figured out how to create yet.
Also in the CDK there is QueueProcessingFargateService (not sure how that's different from the ALB version of FargateService) that has a HealthCheck property I can initialize, whereas the ALB version does not. Just adds more confusion. I don't necessarily care about QueueProcessingFargateService itself, but it does show up in the code example for HealthCheck in the CDK docs
All of this is very confusing. The AWS web UI is absolutely horrid and difficult to navigate. I'm seeing a lot of conflicting information on SO and google search results in general about how to set up health checks. Can someone please help make sense of all of this?
Concerning the health check setup for ECS, I'm not seeing any way to configure health check commands when I create a Task Definition for my ECS service via Fargate in the AWS dashboard
You would have to do that by editing the Task Definition JSON manually, instead of using the point-and-click features of the ECS web console. The ECS web console is currently missing a lot of features.
All of this is very confusing. The AWS web UI is absolutely horrid and difficult to navigate. I'm seeing a lot of conflicting information on SO and web search results in general about how to set up health checks. What can I try next?
I recommend not using the web UI at all. Use the CDK, or use Terraform for creation of resources. Use the web UI for just looking at what was created.
As for exactly how to setup health checks, it depends on what you are trying to do. If you are using a load balancer, then the target group health checks are required. You set those up on the Load Balancer's Target Group, and you could do that through the UI since that is over in the EC2 web UI and is fully featured. Target Group health checks will perform a network request to the ECS task periodically, and ensure that is is receiving a proper response.
If you are not using a load balancer, or if you just want extra health checks in addition to the Target Group checks, you can setup health check commands in the ECS task definition. These run a command inside the container periodically. You can't really setup these via the web UI and even the higher level CDK constructs probably mask this or make it less than obvious. This is an optional, and advanced feature of ECS that most people don't use, and I believe you would have to drop down to lower-level CDK constructs if you were using the CDK.
Related
CONTEXT:
We have a platform where users can create their own projects - multiple projects per user. We need to provide them with a browser-based IDE to edit those projects.
We decided to go with coder-server. For this we need to configure an auto-scalable cluster on AWS. When the user clicks "Edit Project" we will bring up a new container each time.
https://hub.docker.com/r/codercom/code-server
QUESTION:
How to pass parameters from the url query (my-site.com/edit?project=1234) into a startup script to pre-configure the workspace in a docker container when it starts?
Let's say the stack is AWS + ECS + Fargate. We could use kubernetes instead of ECS if it helps.
I don't have any experience in cluster configuration. Will appreciate any help or at least a direction where to dig further.
The above can be achieved using multiple ways in AWS ECS. The basic requirements for such systems are to launch and terminate containers on the fly while persisting the changes in the files. (I will focus on launching the containers)
Using AWS SDK's:
The task can be easily achieved using AWS SDKs, Using a base task definition. AWS SDK allows starting tasks with overrides on the base task definition.
E.G. If task definition has a memory of 2GB then the SDK can override the memory to parameterised value while launching a task from task def.
Refer to the boto3 (AWS SDK for Python) docs.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs.html#ECS.Client.run_task
Overall Solution
Now that we know how to run custom tasks with python SDK (on demand). The overall flow for your application is your API calling AWS lambda function whit parameters to spin up and wait to keep checking task status and update and rout traffic to it once the status is healthy.
API calls AWS lambda functions with parameters
Lambda function using AWS SDK create a new task with overrides from base task definition. (assuming the base task definition already exists)
Keep checking the status of the new task in the same function call and set a flag in your database for your front end to be able to react to it.
Once the status is healthy you can add a rule in the application load balancer using AWS SDK to route traffic to the IP without exposing the IP address to the end client (AWS application load balancer can get expensive, I'll advise using Nginx or HAProxy on ec2 to manage dynamic routing)
Note:
Ensure your Image is lightweight, and the startup times are less than 15 mins as lambda cannot execute beyond that. If that's the case create a microservice for launching ad-hoc containers and hosting them on EC2
Using Terraform:
If you looking for infrastructure provisioning terraform is the way to go. It has a learning curve so recommend it as a secondary option.
Terraform is popular for parametrising using variables and it can be plugged in easily as a backend for an API. The flow of your application still remains the same from step 1, but instead of AWS Lambda API will be calling your ad-hoc container microservice, which in turn calls terraform script and passing variables to it.
Refer to the Terrafrom docs for AWS
https://registry.terraform.io/providers/hashicorp/aws/latest
I'm hoping I can get some help with this deployment issue that I'm facing:
I have created an RDS instance and can see it is "Available" by looking at the dashboard. I then use the Elastic Beanstalk CLI to deploy my application and the deployment is successful.
However, when I access the endpoint I am getting a 502 Bad Gateway from nginx. After checking the logs I can see the following error from my Node.js app:
Error: connect ETIMEDOUT x.x.x.x:5432 (ip ommitted)
As per the AWS documentation on this I have tried to assign the auto generated security group from my Elastic Beanstalk instance to my RDS instance, but I am still getting the same error.
Is there something I have misunderstood in the documentation here? I would be very grateful if anyone can point me in the right direction here.
Thank you in advance.
Managed to figure this out after a lot of trial and error. Turns out that it wasn't too tricky.
Go to your EB environment -> Configuration
Click "Edit" next to "Instances"
Note down the security group ID that is selected at the bottom
Create a new security group e.g. "my-eb-instance-rds-access"
Under "Inbound rules" select "Add rule". Choose whichever DB service you are using and it should automatically fill the port. Set source to "Custom" and then click in the search box. Select the security group that your EB instance has that you noted down earlier.
Click "Create security group"
Find your RDS instance and click "Modify"
Scroll down and find "Connectivity". Then select the security group that you just created from the drop down box.
Scroll all the way to the bottom and hit continue. Here I found there to be two options: one that updates the changes immediately and the other that waits for regular scheduled maintenance. I'm no expert but I selected the "immediately" option since the database is not being used in production yet so some downtime was not a problem.
Your EB instance should now be able to connect! This worked for me even after re-deploying.
Disclaimer: I am by no means an expert. This was done purely by trial and error. If anyone has any tips or improvements I'd be happy to hear them and edit the answer.
I'm trying to deploy backend application to the AWS Fargate using cloudformation templates that I found. When I was using the docker image training/webapp I was able to successfully deploy it and access with the externalUrl from the networking stack for the app.
When I try to deploy our backend image I can see the stacks are deploying correctly but when I try to go to the externalUrl I get 503 Service Temporarily Unavailable and I'm unable to see it... Another thing that I've noticed is on the docker hub I can see that the image is continuously pulled all the time when the cloudformation services are running...
The backend is some kind of maven project I don't know exactly what but I know that locally its working but to get it up running the container with this backend image takes about 8 minutes... I'm not sure if this affects the Fargate ?? Any Idea how to get it working ?
It sounds like you need to find the actual error that you're experiencing, the 503 isn't enough information. Can you provide some other context?
I'm not familiar with fargate but have been using ecs quite a bit this year and I generally would find that by going to (on the dashboard) ecs -> cluster -> service -> events. The events tab gives more specific errors as to what is happening.
My ecs deployment problems are generally summarized into
the container is not exposing the same port as is in the definition, this could be the case if you're deploying from a stack written by someone else.
the task definition memory/cpu restrictions don't grant enough space for the application and it has trouble placing (probably a problem with ecs more than fargate but you never know.)
Your timeout in the task definition is not set to 8 minutes: see this question, it has a lot of this covered
Your start command in the task definition does not work as expected with the container you're trying to deploy
If it is pulling from docker hub continuously my bet would be that it's 1, 3 or 4, and it's attempting to pull the image over and over again.
Try adding a Health check grace period of 60 by going to ECS -> cluster -> service -> update Network Access section.
Recently I've been testing AWS CodeDeploy to validate that it will be useful, and so far so good. But after seeing its workflow I started to wonder: "How can someone validate that the new environment is good, in a human way?"
Explaining in more detail:
On my "Deployment Group > Deployment Settings" using the Traffic Rerouting policy of "I will choose whether to reroute traffic", when the new environment boots up, the deployment pauses waiting for me to verify that everything is fine in this new environment. Then, after validation, I can push the "reroute traffic" button and it will proceed as expected.
To validate that the new environment is good I, as someone who has access to the machines, can SSH into one of them and do some tests. Or I can grab the Public DNS of one new machine and access it through the browser and verify that it is OK.
But is there a simpler way of validating the application on these new machines? Like having a Load Balancer that always points to the soon to be new environment that I can send to QA people. Or will I have to, for each and every deploy, manually grab information about the new environment and then send to the QA people?
In order to validate the new environment, you can add scripts as validation hooks in the Appspec that would be run after installing the new revision in the new hosts. Also, the new environment will be registered behind any load balancer you specify in the deployment configuration.
I have an autoscaling that works great, with a launchconfiguration where i defined a userdata script that is executed on a new instance launch.
The userscript updates basecode and generate cache, this takes some seconds. But as soon as the instance is "created" (and not "ready"), the autoscaling adds it to the load balancer.
It's a problem because while the userdata script is executed, the instance does not answer with a good response (basically, 500 errors are throw).
I would like to avoid that, of course I saw this documentation : http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/InstallingAdd
As with a standalone EC2 instance, you have the option of configuring instances launched into an Auto Scaling group using user data. For example, you can specify a configuration script using the User data field in the AWS Management Console, or the --userdata parameter in the AWS CLI.
If you have software that can't be installed using a configuration script, or if you need to modify software manually before Auto Scaling adds the instance to the group, add a lifecycle hook to your Auto Scaling group that notifies you when the Auto Scaling group launches an instance. This hook keeps the instance in the Pending:Wait state while you install and configure the additional software.
Looks like i'm not in this case. Also, modify the pending hook on the userdata script is complicated. There must be a simple solution to fix my problem.
Thank you for your help !
EC2 instance Userdata does not utilize a lifecycle hook to stop a newly launched instance being brought into service until after it has finished executing.
Stopping your web server at the start of your user data script sounds a little unreliable to me, and therefore I would urge you to utilize the features AutoScaling provides that were designed to solve this very problem.
I have two suggestions:
Option 1:
Using lifecycle hooks isn't at all complicated, once you read through the docs. And in your user data, you can easily use the CLI to control the hook, check this out. In fact, a hook can be controlled from any supported language or scripting language.
Option 2:
If manually taking care of lifecycle hooks doesn't appeal to you, then I would recommend scrapping your user data script and doing a work around with AWS CodeDeploy. You could have CodeDeploy deploy nothing (eg. empty S3 folder) but you could use the deployment hook scripts to replace your user data script. Code Deploy integrates with AutoScaling seamlessly and handles lifecycle hooks automatically. A newly launched instance won't be brought into service by AutoScaling until a deployment has succeeded. Read the docs here and here for more info.
However, I would urge you to go with option 1. Lifecycle hooks were designed to solve the very problem you have. They're powerful, robust, awesome and free. Use them.
#Brooks said the easiest way to "wait" before the ELB serve the instance is to deal with ELB health status.
I solved my problem by shutting down the http server at the start of the userdata script. So the ELB can't have a green health status, and it does not send clients to the instance. I re-start the http server at the end of the script, the health is good so the ELB serve it.