Imagine there is a Django application. The application dependencies: DB, Celery (workers, beat), RabbitMQ (as a broker for Celery). If with liveness probe it is more or less clear (if the liveness probe did not pass, it restarts pod), then with readiness probe there is no full understanding. They write that readiness probes determine whether the container is ready to accept traffic. In this regard, the following questions arise:
For the application configuration described above (with dependencies), what would a readiness probe look like? Will it be an endpoint that checks the availability to the database, RabbitMQ, the health of the Celery workers?
If the readiness test is complex (checking the availability of the database, RabbitMQ, Celery workers), then what will Kubernetes do if, for example, RabbitMQ becomes unavailable? Because all containers work with the same RabbitMQ, it makes no sense to switch traffic to another pod.
Related
The other day, we came across an issue where one of the containers in our ECS Cluster was unresponsive. The troubling part was that the instances health checks, administered via Docker, seemed to indicate that nothing was wrong.
In addition to ECS, we use Route53 service discovery, and all of our containers use these service entries to communicate.
For reference, Docker health checks can be used by ECS to determine if a container should be replaced. We use something like the following (pseudo code):
HEALTHCHECK curl localhost:3000 ... more docker params here
When the incident occurred, connections to the container were timing out, but I could see in the task logs that the health checks were completing successfully. I even tried logging into Cloud9 instance in the VPC and connecting to the task, and was receiving these timeouts. Nothing else seemed out of the ordinary.
All that was required to fix the issue was stopping the container in question.
How can this be avoided? In an ideal situation, ECS should have detected that the container was inaccessible. Is there a way to health check at the container level, AKA "can the container accept connections?", rather than just at the application level?
I am trying to scale a flask micro service in AWS ECS to handle production workloads. My application is using the flask-apschedueler to handle long running tasks. I am using uwsgi web server for deployment in ECS so I am packing the application inside the container along with uwsgi server. The nginx container is running separately on ECS cluster.
My uwsgi config uses a single process, single thread right now.
I have successfully deployed it on AWS ECS but wondering what to scale for handling production workloads. I am debating between these options
1) I can spin up multiple containers and nginx would round robin to all of them distributing requests equally through route 53 dns service
2) I can increase the number of processes in uwsgi config but that messes with my flask-apscheduler as I only need one instance of it running. The workarounds I found are not that neat
It would be great if someone can share how to go about this
The docker mentality is more along the lines of 'one process per task'. Anytime you have more than one task running on a container, you should rethink.
I would advise the first approach. Create a service to wrap your task in ECS and simply vary the 'Desired' number of tasks for that service to scale the service up and down as desired.
If you only need the scheduler running on one of the tasks, you should setup a separate service using the same image, but with an environment variable to tell your container to start the scheduler. Make it true on the scheduler service/task and false on the worker service/tasks. Those ENV variables can be set on the container definition inside your ECS task definition.
This would be the "docker way".
I am new to kubernetes and I am having trouble tracking down an exponential backoff signal I am observing on my Jmeter load tests for response times. I have a kubernetes service that is running between 4-32 pods with horizontal pod autoscaling. Each pod is running a Gunicorn WSGI serving a django backend. All of the different k8s services are behind nginx reverse proxy, which redirects incoming traffic directly to Service’s VIP. Nginx sits behind Amazon ELB which is exposed to end-user web traffic. ELB ultimately times out a request after 60 secs.
Each gunicorn server is running one worker with 3 greenlets and has a backlog limit of 1. So it can only server 4 requests at any given time and immediately returns an error response for any extra requests nginx tries to send its way. I am guessing that these error requests are then being caught and retried with exponential backoff, but I can’t quite make out where this is happening.
As far as I know, nginx can’t be the source for the exponential retries, as it is serving only one upstream endpoint for a request. And I couldn’t find anything in the documentation that discusses exponentially timed retries upon error response at any stage in kubernetes routing. The k8s cluster is running on version 1.9.
Wikipedia says:
In a variety of computer networks, binary exponential backoff or truncated binary exponential backoff refers to an algorithm used to space out repeated retransmissions of the same block of data, often as part of network congestion avoidance.
The 'truncated' simply means that after a certain number of increases, the exponentiation stops; i.e. the retransmission timeout reached a ceiling, and thereafter does not increase any further.
Kubernetes components do not have request retransmission capability. They just route traffic between network components, and if a packet is dropped for some reason, it is lost forever.
Istio has this kind of feature, so if you have it installed, it could be the reason of exponential backoff.
Istio is not a part of the default Kubernetes cluster distribution, so you have to install it manually to use this feature.
However, if you don't have istio installed, retransmission of packets on the connection level can be done by the participant of TCP connection which is Jmeter, nginx and your application. I assume that nginx in your configuration just redirects traffic to backend pods and nothing more.
On the application level retransmission is also possible, but in this case, it would be only Jmeter and backend application.
I have deployed a redis container using Amazon ECS, behind an application load balancer. It seems the health checks are failing, though the container is running and ready to accept connections. It seems to be failing because the health check is HTTP, and redis of course isn't an http server.
# Possible SECURITY ATTACK detected. It looks like somebody is sending
POST or Host: commands to Redis. This is likely due to an attacker
attempting to use Cross Protocol Scripting to compromise your Redis
instance. Connection aborted.
Fair enough.
Classic load balancers I figure would be fine since I can explicitly ping TCP. Is is feasible to use redis with ALB?
Change your health check to protocol HTTPS. All Amazon Load Balancers support this. The closer your health check is to what the user accesses the better. Checking an HTML page is better than a TCP check. Checking a page that requires backend services to respond is better. TCP will sometimes succeed even if your web server is not serving pages.
Deploy your container with nginx installed and direct the health check to nginx handling port.
I encountered a similar problem recently: My Redis container was up and working correctly, but the # Possible SECURITY ATTACK detected message appeared in the logs once every minute. The healthcheck was curl -fs http://localhost:6379 || exit 1; this was rejected by the Redis code (search for "SECURITY ATTACK").
My solution was to use a non-CURL healthcheck: redis-cli ping || exit 1 (taken from this post). The healthcheck status shows "healthy", and the logs are clean.
I know the solution above will not be sufficient for all parties, but hopefully it is useful in forming your own solution.
We are planning to run a service on ECS which isn't a webserver, but is a (node.js based) background daemon which we are going to use for processing asynchronous tasks. I wanted to add a health check to it so that the task is restarted in case the daemon dies or gets killed. The target group health checks only support http and https protocols and probably are not meant for this purpose. Any insights into how I can monitor non web based services on ECS and ensure that it's always up and running?