Recently after a production deployment, our primary service was not reaching steady state. On analysis we found out that the filebeat service running as a daemon service was unsteady. The stopped tasks were throwing "no space left on the device". Also, the CPU and memory utilization for the filebeat was higher than the primary service.
A large amount of log files were being stored as part of the release. After reverting the change, the service came back to steady state.
Why did filebeat become unsteady? If memory was an issue, then why didn't the service also throw "no space" issue as both filebeat and primary service runs on the same EC2 instance?
Check (assuming Linux)
df -h
Better still, install AWS CloudWatch Agent on your EC2 instance to get additional metrics such as disk space usage reported into CloudWatch to help you get to the bottom of these things.
Sounds like your primary disk is full.
Related
I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.
We've started using GCP Instance schedule for one of our VMs which needs to be up for 3 hours every night. For some reason, about once per week the VM is not up - services can't access it.
Checking from Logs Explorer, there are no errors or warnings, but on those days when it is not working, there are a few events which are not published/logged. These are the GCE Agent Started and OSConfig Agent Started events which happen on days where everything is OK (09-11, 09-12, 09-14) but are missing on days when the instance is not up (09-13).
The VM is Windows Server 2012 R2.
There is no retry policy implemented in the GCP instance schedule feature.
We know there are other ways to schedule VMs but we'd prefer to use the instance schedule feature if possible and if it is stable.
Is there somewhere else we should look for understanding why the VM is not starting properly?
This is the image from logs:
Instance schedules do not provide capacity guarantees, so if the resources required for a scheduled VM instance are not available at the scheduled time, your VM instance might not start when scheduled. Although you can reserve VM instances before starting them to provide capacity guarantees, reservations cannot be automatically scheduled.(Assuming that randomly VM instances are showing up this behaviour every week, not a particular VM every week.)
If it's with the same VM everytime then high memory utilization can also cause VM not being responsive. Manual reboot would fix this since it would close whatever is consuming the memory and re-open processes or services that may have been killed due to being OOM.
Please consider monitoring the VM memory usage by installing a monitoring agent, and increase the memory request based on the utilization.
I have K8S cluster in GCP (version is 1.20.8-gke.900 from the regular update channel).
All cluster pods write logs in STDOUT or STDERR from Docker containers.
A couple of weeks ago we found that some log entries are missing in the GCP logging console. I can see them via kubectl tool but looks like they don't reach the logging bucket. For example, I can hit API in the pod with invalid payload to emulate error in the logs, and sometimes this error reaches the logging bucket, sometimes no. Super weird to me...
The traffic and resource utilization in the cluster is super small.
As I understood fluent bit daemonset is responsible to fetch logs from pods and pass them into logging bucket. Current version of fluent bit: gke.gcr.io/fluent-bit:v1.5.7-gke.1 & gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0.
I don't see any errors in the fluent bit logs...
Could you please suggest what can be done to trace/debug/troubleshoot such case?
Thanks!
It appears the issue is with the log volume. The managed GKE logging agent is guaranteed at least 100KiB/s throughput and performance can be higher depending on other node factors.
If your workloads on a GKE node are generating significantly more than 100KiB/s, then it's possible that the logs are not being collected due to the log volume.
If you're generating more than 100kb/s, then there's a few workarounds:
Generate less logs.
Leave the node in question partially idle. This will allow fluentbit to pick up extra cpu cycles and process more logs.
Run your own instance of fluentbit with a higher resource allocation.
The underlying root cause of the 100kb/s limitation is that we only give a small resource allocation to fluentbit so as to leave more resources available for your workloads.
Refer to link for additional information.
We have a VM server on GCP. Yesterday, the server stopped responding, we could not even SSH into the server, but everything was ok after restarting the server. I am having a look at the metrics and this is what I have noticed:
There is no Memory Utilization data for that period. Before this, the Memory Utilization was 90%.
Read Through Put is quite high; 13 MiB/s
What could have gone wrong? What else should I consider looking at?
Harith:
The applications processes running in your VM consumed the totality of the memory assigned to the VM.
Analyze each application hosted on the VM and evaluate its MTRs (Minimal Technical Requirements) and the actual work load that each one represent, this in order to estimate if the memory amount assigned is enough to support that load.
Consult log entries if available on those applications to see if they can reveal the consumption level just after the unresponsive condition.
Consider changing the machine type if you have to increase any resource capcity assigned to your vm.
If the resource consumption of your applications running on your VM will be very variable, you will need consider the implementation of autoscaling groups of instances.
According to me, as it is mentioned in PCF's 4 levels of High Availability, when an instance(process) fails, the Monit should recognize it and shourd restart it. And then it'll just send the report to BOSH. But if the whole VM goes down, it's BOSH's responsibility to recognize and restart it.
With this belief I answered one question in : https://djitz.com/guides/pivotal-cloud-foundry-pcf-certification-exam-review-questions-and-answers-set-4-logging-scaling-and-high-availability/
Question and answer
According to me, the answer for this question should be option 3, but it says I'm wrong and answer should be option 2. Now I'm confused. So please help me if my belief is wrong.
BOSH is responsible for creating new instance for failed VM.
I know that there is not much information available on internet for this but if you get chance, there is tutorial on pluralsight you can enroll. There instructor has explained high availability very well.
But you can get high level idea from PCF documents as well.
Process Monitoring PCF uses a BOSH agent, monit, to monitor the
processes on the component VMs that work together to keep your
applications running, such as nsync, BBS, and Cell Rep. If monit
detects a failure, it restarts the process and notifies the BOSH agent
on the VM. The BOSH agent notifies the BOSH Health Monitor, which
triggers responders through plugins such as email notifications or
paging.
Resurrection for VMs BOSH detects if a VM is present by listening for
heartbeat messages that are sent from the BOSH agent every 60 seconds.
The BOSH Health Monitor listens for those heartbeats. When the Health
Monitor finds that a VM is not responding, it passes an alert to the
Resurrector component. If the Resurrector is enabled, it sends the
IaaS a request to create a new VM instance to replace the one that
failed.