GKE Fluent bit partial logs - google-cloud-platform

I have K8S cluster in GCP (version is 1.20.8-gke.900 from the regular update channel).
All cluster pods write logs in STDOUT or STDERR from Docker containers.
A couple of weeks ago we found that some log entries are missing in the GCP logging console. I can see them via kubectl tool but looks like they don't reach the logging bucket. For example, I can hit API in the pod with invalid payload to emulate error in the logs, and sometimes this error reaches the logging bucket, sometimes no. Super weird to me...
The traffic and resource utilization in the cluster is super small.
As I understood fluent bit daemonset is responsible to fetch logs from pods and pass them into logging bucket. Current version of fluent bit: gke.gcr.io/fluent-bit:v1.5.7-gke.1 & gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0.
I don't see any errors in the fluent bit logs...
Could you please suggest what can be done to trace/debug/troubleshoot such case?
Thanks!

It appears the issue is with the log volume. The managed GKE logging agent is guaranteed at least 100KiB/s throughput and performance can be higher depending on other node factors.
If your workloads on a GKE node are generating significantly more than 100KiB/s, then it's possible that the logs are not being collected due to the log volume.
If you're generating more than 100kb/s, then there's a few workarounds:
Generate less logs.
Leave the node in question partially idle. This will allow fluentbit to pick up extra cpu cycles and process more logs.
Run your own instance of fluentbit with a higher resource allocation.
The underlying root cause of the 100kb/s limitation is that we only give a small resource allocation to fluentbit so as to leave more resources available for your workloads.
Refer to link for additional information.

Related

Is there data for AWS spot interruption rate over time?

We are running an EMR cluster with spot instances as task nodes. The EMR cluster is executing spark jobs which sometimes run for several hours. Interruptions of spot instances can cause the failure of the spark job which then requires us to restart the job entirely.
I can see that there is some basic information on the "Frequency of interruption" on AWS Spot Advisor - However, this data seems to be very generic, I can't see historic trends and I also miss the probability of interruption based on how long the spot instance is running (which should have a significant impact on the probability of interruption).
Is this data available somewhere? Or are there other data points that can be used as proxy?
I found this Github issue which provides a link to this JSON file in Spot Advisor S3 bucket that includes interruption rates.
https://spot-bid-advisor.s3.amazonaws.com/spot-advisor-data.json

Daemon service in ECS throwing no space left on device

Recently after a production deployment, our primary service was not reaching steady state. On analysis we found out that the filebeat service running as a daemon service was unsteady. The stopped tasks were throwing "no space left on the device". Also, the CPU and memory utilization for the filebeat was higher than the primary service.
A large amount of log files were being stored as part of the release. After reverting the change, the service came back to steady state.
Why did filebeat become unsteady? If memory was an issue, then why didn't the service also throw "no space" issue as both filebeat and primary service runs on the same EC2 instance?
Check (assuming Linux)
df -h
Better still, install AWS CloudWatch Agent on your EC2 instance to get additional metrics such as disk space usage reported into CloudWatch to help you get to the bottom of these things.
Sounds like your primary disk is full.

gke-resource-quotas applied on clusters with 10+ nodes

The GKE documentation about resource quotas says that those hard limits are only applied for clusters with 10 or fewer nodes.
Even though we have more than 10 nodes, this quota has been created and cannot be deleted
Is this a bug on GKE side or intentional and the documentation is invalid?
I had experienced a really strange error today using GKE. Our hosted gitlab-runner stopped running new jobs, and the message was:
pods "xxxx" is forbidden: exceeded quota: gke-resource-quotas, requested: pods=1, used: pods=1500, limited: pods=1500
So the quota resource is non-editable (as documentation says). The problem, however, that there was just 5 pods running, not 1500. So it can be a kubernetes bug, the way it calculated nodes count, not sure.
After upgrading control plane and nodes, the error didn't go away and I didn't know how to reset the counter of nodes.
What did work for me was to simply delete this resource quota. Was surprised that it was even allowed to /shrug.
kubectl delete resourcequota gke-resource-quotas -n gitlab-runner
After that, same resource quota was recreated, and the pods were able to run again.
The "gke-resource-quotas" protects the control plane from being accidentally overloaded by the applications deployed in the cluster that creates excessive amount of kubernetes resources. GKE automatically installs an open source kubernetes ResourceQuota object called ‘gke-resource-quotas’ in each namespace of the cluster. You can get more information about the object by using this command [kubectl get resourcequota gke-resource-quotas -o yaml -n kube-system].
Currently, GKE resource quotas include four kubernetes resources, the number of pods, services, jobs, and ingresses. Their limits are calculated based on the cluster size and other factors. GKE resource quotas are immutable, no change can be made to them either through API or kubectl. The resource name “gke-resource-quotas” is reserved, if you create a ResourceQuota with the same name, it will be overwritten.

Reduce Cloud Run on GKE costs

would be great if I could have to answers to the following questions on Google Cloud Run
If I create a cluster with resources upwards of 1vCPU, will those extra vCPUs be utilized in my Cloud Run service or is it always capped at 1vCPU irrespective of my Cluster configuration. In the docs here - this line has me confused Cloud Run allocates 1 vCPU per container instance, and this cannot be changed. I know this holds for managed Cloud Run, but does it also hold for Run on GKE?
If the resources specified for the Cluster actually get utilized (say, I create a node pool of 2 nodes of n1-standard-4 15gb memory) then why am I asked to choose a memory again when creating/deploying to Cloud Run on GKE. What is its significance?
The memory allocated dropdowon
If Cloud Run autoscales from 0 to N according to traffic, why can't I set the number of nodes in my cluster to 0 (I tried and started seeing error messages about unscheduled pods)?
I followed the docs on custom mapping and set it up. Can I limit the requests which cause a container instance to handle it to be limited by domain name or ip of where they are coming from (even if it only artificially setup by specifying a Host header like in the Run docs.
curl -v -H "Host: hello.default.example.com" YOUR-IP
So that I don't incur charges if I get HTTP requests from anywhere but my verified domain?
Any help will be very much appreciated. Thank you.
1: cloud run managed platform always allow 1 vcpu per revision. On gke, also by default. But, only for gke, you can override with --cpu param
https://cloud.google.com/sdk/gcloud/reference/beta/run/deploy#--cpu
2: can you precise what is asked and when performing which operation?
3: cloud run is build on top of kubernetes thank to knative. By the way, cloud run is in charge to scale pod up and down based on the traffic. Kubernetes is in charge to scale pod and node based on CPU and memory usage. The mechanism isn't the same. Moreover the node scale is "slow" and can't be compliant with spiky traffic. Finally, something have to run on your cluster for listening incoming request and serving/scaling correctly your pod. This thing has to run on a no 0 node cluster.
4: cloud run don't allow to configure this. I think that knative also can't. But you can deploy a ESP in front for routing requests to a specific cloud run service. By the way, you split the traffic before and address it to different services, and thus you scale independently. Each service can have a Max scale param, different concurrency param. ESP can implement rate limit.

How to detect temporary network partition in Kubernetes?

We have a Kubernetes cluster set up on AWS VPC with 10+ nodes. We encountered an incident where one node was not accessible to others and vice-versa for ~10 minutes. Finding this out took quite a lot of time.
Is there a tool for Kubernetes or AWS to detect these kind of network problems? Maybe something like a Daemon Set where each pod pings the others in the network and logs it when the ping fails.
If you are mostly interested in being alerted when such problem happens, I would set up monitoring system and hook it up with something like alertmanager. For collecting metrics, you can look at open source project such as Prometheus. Once you set this up, it is really easy to integrate it with Grafana (for dashboard) and alertmanager (for alerting based on rules you specify in Prometheus). And they are all open source projects.
https://prometheus.io/