How to detect temporary network partition in Kubernetes? - amazon-web-services

We have a Kubernetes cluster set up on AWS VPC with 10+ nodes. We encountered an incident where one node was not accessible to others and vice-versa for ~10 minutes. Finding this out took quite a lot of time.
Is there a tool for Kubernetes or AWS to detect these kind of network problems? Maybe something like a Daemon Set where each pod pings the others in the network and logs it when the ping fails.

If you are mostly interested in being alerted when such problem happens, I would set up monitoring system and hook it up with something like alertmanager. For collecting metrics, you can look at open source project such as Prometheus. Once you set this up, it is really easy to integrate it with Grafana (for dashboard) and alertmanager (for alerting based on rules you specify in Prometheus). And they are all open source projects.
https://prometheus.io/

Related

GKE Fluent bit partial logs

I have K8S cluster in GCP (version is 1.20.8-gke.900 from the regular update channel).
All cluster pods write logs in STDOUT or STDERR from Docker containers.
A couple of weeks ago we found that some log entries are missing in the GCP logging console. I can see them via kubectl tool but looks like they don't reach the logging bucket. For example, I can hit API in the pod with invalid payload to emulate error in the logs, and sometimes this error reaches the logging bucket, sometimes no. Super weird to me...
The traffic and resource utilization in the cluster is super small.
As I understood fluent bit daemonset is responsible to fetch logs from pods and pass them into logging bucket. Current version of fluent bit: gke.gcr.io/fluent-bit:v1.5.7-gke.1 & gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0.
I don't see any errors in the fluent bit logs...
Could you please suggest what can be done to trace/debug/troubleshoot such case?
Thanks!
It appears the issue is with the log volume. The managed GKE logging agent is guaranteed at least 100KiB/s throughput and performance can be higher depending on other node factors.
If your workloads on a GKE node are generating significantly more than 100KiB/s, then it's possible that the logs are not being collected due to the log volume.
If you're generating more than 100kb/s, then there's a few workarounds:
Generate less logs.
Leave the node in question partially idle. This will allow fluentbit to pick up extra cpu cycles and process more logs.
Run your own instance of fluentbit with a higher resource allocation.
The underlying root cause of the 100kb/s limitation is that we only give a small resource allocation to fluentbit so as to leave more resources available for your workloads.
Refer to link for additional information.

AWS EC2 Immediate Scaling Up?

I have a web service running on several EC2 boxes. Based on the Cloudwatch latency metric, I'd like to scale up additional boxes. But, given that it takes several minutes to spin up an EC2 from an AMI (with startup code to download the latest application JAR and apply OS patches), is there a way to have a "cold" server that could instantly be turned on/off?
Not by using AutoScaling. At least not, instant in the way you describe. You could make it much faster however, by making your own modified AMI image where you place the JAR and the latest OS patches. These AMI's can be generated as part of your build pipeline. In that case, your only real wait time is for the OS and services to start, similar to a "cold" server.
Packer is a tool commonly used for such use cases.
Alternatively, you can mange it yourself, by having servers switched off, and start them by writing some custom Lambda scripts that gets triggered by Cloudwatch alerts. But since stopped servers aren't exactly free either, i would recommend against that for cost reasons.
Before you venture into the journey of auto scaling your infrastructure and spending time/effort. Perhaps you should do a little bit of analysis on the traffic pattern day over day, week over week and month over month and see if it's even necessary? Try answering some of these questions.
What was the highest traffic ever your app handled, How did the servers fare given the traffic? How was the user response time?
When does your traffic ramp up or hit peak? Some apps get traffic during business hours while others in the evening.
What is your current throughput? For example, you can handle 1k requests/min and two EC2 hosts are averaging 20% CPU. if the requests triple to 3k requests/min are you able to see around 60% - 70% avg cpu? this is a good indication that your app usage is fairly predictable can scale linearly by adding more hosts. But if you've never seen traffic burst like that no point over provisioning.
Unless you have a Zynga like application where you can see large number traffic at once perhaps better understanding your traffic pattern and throwing in an additional host as insurance could be helpful. I'm making these assumptions as I don't know the nature of your business.
If you do want to auto scale anyway, one solution would be to containerize your application with Docker or create your own AMI like others have suggested. Still it will take few minutes to boot them up. Next option is the keep hosts on standby but and add those to your load balancers using scripts ( or lambda functions) that watches metrics you define (I'm assuming your app is running behind load balancers).
Good luck.

what are best practices for health checks?

We have a REST API. Right now our /health makes an smoke test on every dependency we have (a database and a couple microservices) and then returns 200 if there are no errors.
The problem is that not all dependencies are mandatory for our application to work. So while a problem accessing the database can be critical, problems accessing some microservices will only affect a small portion of our app.
On top of that we have Amazon ELB. It doesn't seem right to tag our app as unhealty only because one dependency is unhealty. ELB should only try to recover the unhealty dependency and with that our app will be healty again.
Which leads to the question: what should we actually check in our health-check? because it looks like we shouldn't be checking for any dependency at all. On the other hand, it's actually realy helpful to know the status of our app accessing all its dependencies (e.g for troubleshooting problems), so is it common to use some other endpoint for that purpose (say /sanity or /diagnostics)?
Do not go overboard trying to check for every service, every dependency, etc. in your health check. Basically think of your health check as a Go / No Go test so that the load balancer knows if the service is running.
Load balancers will not recover failed instances. They will just take your service offline. Auto Scaling Groups can recover failed instances by creating new instances and terminating failed instances. CloudWatch can monitor your instances and report problems and cause events to happen (e.g. rebooting).
You can implement more comprehensive tests that run internal to your server and that chose a reporting / recovery path. Examples might include sending an SNS notification to your email or cell phone account, rebooting the server, etc.
Amazon has a number of services to help monitor, report and manage services. Look into CloudWatch for monitoring, SNS or SES for reporting, ASG for auto scaling, etc.
Think thru what type of fault tolerance, high availability and recovery strategy you need for your service. Then implement an approach that is simple enough so that the monitoring itself does not become a point of failure.

High-availability for Restcomm

I'm planning a High-availability set-up with autoscaling for RestComm and some general doubts about the best way to plan it.
This is what I have now:
Restcomm instance using Amazon ECS (docker), so we can launch more instances very easily.
All of them share the Amazon RDS database.
Workspace is shared and persisted between instances.
To move to the next step, I have some questions:
Amazon Load Balancer isn't an option because it doesn't support UDP so I'm considering Telestax LB, is it correct?. Is it possible to deploy it using docker?
Move Restcomm MS outside of the docker Restcomm image so it can scale independently. Restcomm provides env variables to specify the MS, so I would have a LB and several MS behind it. Correct?.
How much RAM needs a Restcomm instance and how many concurrent sessions supports?. How can we know how many concurrent sessions are in real time and in a programtically way?.
There is a "automatic scaling" mechanism implemented in RestComm? More info would be appreciated. Ubuntu Juju isn't an option for me.
We are considering Graylog2 or logstasch for logs management. Any insight here?. How do you install the agent in the docker images?.
The only documentation I found it was this very good document: https://docs.google.com/document/d/13xlaioF065pDnQUoZgfIpi6Noh0qHfAZ7U6afcPd2Y0/edit
Is there any other doc?.
Thanks in advance!
Very good questions :
Yes. See, https://hub.docker.com/r/restcomm/load-balancer/
You would have one LB (better to have 2 with active passive to avoid single point of failure) with X Restcomm behind it speaking to Z Media Servers behind them.
It depends on the complexity of the application on top. But here is some numbers https://github.com/RestComm/Restcomm-Connect/wiki/Load-Testing-on-Docker
Not yet. you can use Mesos or Kubernetes potentially if juju is not an option. We have a set of open issues for kubernetes right now but Mesos should be working.
You can check https://hub.docker.com/r/restcomm/graylog-restcomm/ it contains a docker image pre loaded with everything needed to poll a restcomm server for gathering metrics.

Collectd on AWS

We have instances setup in an autoscale group on AWS. We want to collect the metrics in order to determine our scalability needs. Collectd, so far I know that it collects the stats in the same machine and puts it all in RRD files. However, in a scenario of an autoscale cluster, if another instance is spawned and assuming the AMI from which it has been spawned already has collectd, how are we supposed to gather the stats of that second instance in the group? It might just stay up for five to six minutes and go down, but we would need the logs before it goes down. Any way by which we can club these logs for the same cluster or something similar? Or if collectd can make it report somewhere online?
Found the answer. This can be done by using the client-server architecture of collectd. More details can be found here