AWS ECS service monitoring (using multiple endpoints) - amazon-web-services

We are currently deploying multiple instances (a front-end, a back-end, a database, etc..). All are deployed and configured using a CloudFormation script so we can deploy our solution quickly. One of our central server has multiple connections to other services and, for some, we open very simple REST endpoints that reply with 200 or 500 if the server can connect to another service or the database (get on /dbConnectionStatus for example).
We would like to have perform calls on those endpoints periodically and have a view on these. A little bit like the health check but without restarting the instance in case of trouble and possibly multiple endpoints to check on a service.
Is there an AWS service that can achieve that? If not what alternative do you suggest?

AWS CloudWatch Synthetic Monitoring can do what you want. By default it will just perform checks against your endpoints, and log the success or failure, without triggering a redeployment or something like a load balancer health check would.

Related

Kubernetes Dashboard by Request ID - Distributed Tracing (for AWS EKS using Istio Service Mesh)

I have several applications deployed on AWS EKS as microservices.
They are also deployed across different AWS accounts and have dependencies on each other.
I would like some kind of dashboard that says where exactly a request failed in a long flow of request across say 10 different microservices (m1 calls m2 and so on till m5 and say 1 request fails at m2 and another at m4, i would like to see a dashboard that shows where this flow got interrupted for each request).
How could I achieve to get this dashboard?
FOund this tool named ZIkpin which provides pretty much what I am looking for.
Any alternatives available? DOes ELK provide this dashboard? How about Kiali?
I am using istio for service mesh. Is any dashboard available that works best with istio for distributed tracing?
To cover the scenario you mention here, firstly make sure to have a centralized logging. I have used Elk and found it to be good covering logs from multiple services and it comes with a good dashboard view to debug the logs.
You can have different source types for logs across the micro services to differentiate while debugging. use something like a request-id which flows across all the 10 different services which the request hits in the path. This would make the identification easier, there are other ways too to handle it but for someone new to the flow could debug faster
You can use filebeat to push the logs with different log levels to elk from the log files generated at every ms.
Kibana dashboard is good for monitoring and comes with multiple search options as basic as http status code 500 which would directly give all internal server errors.
To improve further monitoring use alerts, graphs to get triggers.

Configure cloud-based vscode ide on AWS

CONTEXT:
We have a platform where users can create their own projects - multiple projects per user. We need to provide them with a browser-based IDE to edit those projects.
We decided to go with coder-server. For this we need to configure an auto-scalable cluster on AWS. When the user clicks "Edit Project" we will bring up a new container each time.
https://hub.docker.com/r/codercom/code-server
QUESTION:
How to pass parameters from the url query (my-site.com/edit?project=1234) into a startup script to pre-configure the workspace in a docker container when it starts?
Let's say the stack is AWS + ECS + Fargate. We could use kubernetes instead of ECS if it helps.
I don't have any experience in cluster configuration. Will appreciate any help or at least a direction where to dig further.
The above can be achieved using multiple ways in AWS ECS. The basic requirements for such systems are to launch and terminate containers on the fly while persisting the changes in the files. (I will focus on launching the containers)
Using AWS SDK's:
The task can be easily achieved using AWS SDKs, Using a base task definition. AWS SDK allows starting tasks with overrides on the base task definition.
E.G. If task definition has a memory of 2GB then the SDK can override the memory to parameterised value while launching a task from task def.
Refer to the boto3 (AWS SDK for Python) docs.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs.html#ECS.Client.run_task
Overall Solution
Now that we know how to run custom tasks with python SDK (on demand). The overall flow for your application is your API calling AWS lambda function whit parameters to spin up and wait to keep checking task status and update and rout traffic to it once the status is healthy.
API calls AWS lambda functions with parameters
Lambda function using AWS SDK create a new task with overrides from base task definition. (assuming the base task definition already exists)
Keep checking the status of the new task in the same function call and set a flag in your database for your front end to be able to react to it.
Once the status is healthy you can add a rule in the application load balancer using AWS SDK to route traffic to the IP without exposing the IP address to the end client (AWS application load balancer can get expensive, I'll advise using Nginx or HAProxy on ec2 to manage dynamic routing)
Note:
Ensure your Image is lightweight, and the startup times are less than 15 mins as lambda cannot execute beyond that. If that's the case create a microservice for launching ad-hoc containers and hosting them on EC2
Using Terraform:
If you looking for infrastructure provisioning terraform is the way to go. It has a learning curve so recommend it as a secondary option.
Terraform is popular for parametrising using variables and it can be plugged in easily as a backend for an API. The flow of your application still remains the same from step 1, but instead of AWS Lambda API will be calling your ad-hoc container microservice, which in turn calls terraform script and passing variables to it.
Refer to the Terrafrom docs for AWS
https://registry.terraform.io/providers/hashicorp/aws/latest

Is there a way to load test every component of an AWS solution using Distributed Load Testing?

Is there a way to load test every component of an AWS solution using Distributed Load Testing? https://aws.amazon.com/solutions/implementations/distributed-load-testing-on-aws/
I have an AWS serverless ecommerce solution, which has a step function(which has a few lambda functions), an API gateway and RDS. I want to load test the solution at different endpoints like load the step function, then load the API gateway so on and so forth.
So, I've deployed https://aws.amazon.com/solutions/implementations/distributed-load-testing-on-aws/ and I am facing 2 issues:
To test the entire solution I have the target URL for the S3 bucket that is the entry point to the solution, now the problem is that the authentication key and password are cycled every week. So, I have to keep updating script with the latest key id and password. Is there a way for me to use some other mechanism like have a jenkins authorised user and integrate it with the distributed load testing(DLT) solution OR some other way to keep the entire process automated without compromising the security?
Secondly, I have to load test endpoints that do not have external URLs like the step function (there is an async lambda that initiates the step function) and in order to send payload to the step function through DLT I need a target URL. Is it even possible to load test in such a scenario? If yes, how? I have tried using serverless artillery but again it needs a target URL.
Load Testing
So if I understand your question correctly, you're looking for ways to load-test your AWS setup. Well you're using serverless technologies which are scalable by default. So if you load-test the environment, most probably you'll reach the service limits depending on the load you generate. All these limits are already documented well in AWS documentation.
Load testing only makes sense (to me) when you're using EC2 instances (or Fargate) and want to know how many instances you need for particular load. Or how much time it takes for system scaling to work.
Database
To load test your RDS you don't have to load test all the components of your setup. You can independently load test it using JMeter or any other tool.
Other Consideration
If you're going with distributed Load testing then you have to notify AWS beforehand. Your distributed load might trigger few DDoS like alarms in AWS.

Dynamic Stage Routing / Multi-Cluster Setup with Fargate

I'm having a fargate cluster with a service having two containers:
a container running nginx for terminating mTLS (it accepts a defined list of CAs) and forwarding calls to the app container with the DN of the client certificate
a Spring App running on tomcat which does fine-grained authorization checks (per route & HTTP method) based on the incoming DN via a filter
The endpoints from nginx are exposed to the internet via a NAT gateway.
Infrastructure is managed via terraform and rolling out a new version is done via a task definition replacement which then points to the new images in ECR. ECS takes care and starts the new containers and then switches the DNS to those within 5 to 10 minutes.
Problems with this setup:
I can't do canary or blue/green deployments
If the new app version has issues (app is not able to start, we have huge error spikes, ...) the rollback will take a lot of time.
I can't test my service integrated without applying a new version and therefore probably breaking everything.
What I'm aiming for is some concept with multiple clusters and a routing based on a specific header. So that I can spin up a new cluster with my new app version and the traffic will not be routed to this version until I either a) send a specific header or b) completely switch to the new version with for example a specific SSM parameter.
Basically the same you can do easily on CloudFront with Lambda#Edge for static frontend deployments (using multiple origin buckets and switching the origin with lambda based on the incoming request).
As I'm having the requirement for mTLS and those fine-grained authorisations I'm neither able to use a standard ALB nor API Gateway.
Are there any other smart solutions for my requirements?
To solve this question finally, we wen't on to replicate the task definitions (xxx-blue and xxx-green) & ELBs and creating two different A records. The deployment process:
find out which task definition is inactive by checking the weights of both CNAMES (one will have 0% weight)
replacing the inactive definition containing the new images at ECR.
waiting for apps to become healthy
switching the traffic via the CNAME records to ELB of the replaced task definition
running integration tests and verifying that there are no log anomalies
(Manually triggered) Setting the desired tasks at the other task definition to zero to scale the old version down. Otherwise, if there is unexpected behaviour the A records can be used to switch the traffic back to the ELB of the old task.
What we didn't achieve with this: having client-based routing to different tasks.

Kubernetes: Get mail once deployment is done

Is there a way to have post deployment mail in kubernetes on GCP/AWS ?
It has become harder to maintaining deployment on kubernetes once deployment team size grows. Having a post deployment mail service will ease up the process. As it'll also say who applied the deployment.
You could try to watch deployment events using https://github.com/bitnami-labs/kubewatch and webhook handler.
Another thing could be implementing customized solution with kubernetes API, for instance in python: https://github.com/kubernetes-client/python then run it as a separate notification pod in your cluster
Third option is to have deployment managed in ci/cd pipeline where actual deployment execution step is "approval" type, you should see user who approved and next step in the pipeline after approving could be the email notification
Approval in circle ci: https://circleci.com/docs/2.0/workflows/#holding-a-workflow-for-a-manual-approval
I don’t think such feature is built-in in Kubernetes.
There is a watch mechanism though, what you could use. Run the following GET query:
https://<api-server-url>/apis/apps/v1/namespace/<namespace>/deployments?watch=true
The connection will not close and you’ll get a “notification” about each deployment. Check the status fields. Then you can send the mail or do something else.
You’ll need to pass an authorization token to gain access to the API server. If you have kubectl setup, you could run a local proxy, which then won’t need the token: kubectl proxy.
You can attach handlers to container lifecycle events. Kubernetes supports preStop and postStart events. Kubernetes sends the postStart event immediately after the container is started. Here is the snippet of the pod manifest deployment file.
spec:
containers:
- name: <******>
images: <******>
lifecycle:
postStart:
exec:
command: [********]
Considering GCP, one option could be create a filter to get the info about your deployment finalization at Stackdriver Logging, and with the filter you can use the CREATE METRIC option, also in Stackdriver Logging.
With the metric created, use Stackdriver Monitoring to create an alert to send e-mails. More details at official documentation.
It looks like no one has mentioned "native tool" Kubernetes provides for that yet.
Please note, that there is a concept of Audit in Kubernetes.
It provides a security-relevant chronological set of records documenting the sequence of activities that have affected system by individual users, administrators or other components of the system.
Each request on each stage of its execution generates an event, which is then pre-processed according to a certain policy and processed by certain backend.
That allows cluster administrator to answer the following questions:
what happened?
when did it happen?
who initiated it?
on what did it happen?
where was it observed?
from where was it initiated?
to where was it going?
Administrator can specify what events should be recorded and what data they should include with the help of Audit policy/ies.
There are a few backends that persist audit events to an external storage.
Log backend, which writes events to a disk
Webhook backend, which sends events to an external API
Dynamic backend, which configures webhook backends through an AuditSink API object.
In case you use log backend, it is possible to collect data with tools such as a fluentd. With that data you can achieve more than just a post deployment mail in Kubernetes.
Hope that helps!