Datadog AWS EBS Monitoring - Drive Space - Exclude /dev/loop* devices - amazon-web-services

Okay, here is my setup:
Platform: AWS
Monitoring: DataDog
Metric: system.disk.in_use
Question: So I am running Ubuntu 18.04LTS instances and as time goes on, it seems to spawn additional devices periodically:
device:/dev/loop1, /dev/loop2 and so on.
When I first spun up these instances, there were only 3 /dev/loop(1-3) devices, however, over time, a /dev/loop4 showed up and our drive space alert paged me since these are 100% utilized when created.
So, I have to go into each of the monitors (one per environment) and add an exclusion for the new /dev/loop4, but I cannot set the exclusion until it has been created by at least one of the monitored instances.
Is there a way in DataDog that you can just add a blanket exclusion like:
device:/dev/loop*?
I have been combing through documentation and have not been able to find anything, so I thought I would ask here.

I solved this by adding 'squashfs' to the list of filesystem types to be ignored by the datadog agent.
Create a file /etc/datadog-agent/conf.d/disk.d/conf.yaml:
init_config:
file_system_global_blacklist:
- iso9660$
- squashfs
instances:
- use_mount: false
Restart datadog agent (systemctl restart datadog-agent).

You can use ! and *, like in:
avg:system.disk.in_use{!device:/dev/loop*} by {host,device}
source

Related

'Kubelet stopped posting node status' and node inaccessible

I am having some issues with a fairly new cluster where a couple of nodes (always seems to happen in pairs but potentially just a coincidence) will become NotReady and a kubectl describe will say that the Kubelet stopped posting node status for memory, disk, PID and ready.
All of the running pods are stuck in Terminating (can use k9s to connect to the cluster and see this) and the only solution I have found is to cordon and drain the nodes. After a few hours they seem to be being deleted and new ones created. Alternatively I can delete them using kubectl.
They are completely inaccessible via ssh (timeout) but AWS reports the EC2 instances as having no issues.
This has now happened three times in the past week. Everything does recover fine but there is clearly some issue and I would like to get to the bottom of it.
How would I go about finding out what has gone on if I cannot get onto the boxes at all? (Actually just occurred to me to maybe take a snapshot of the volume and mount it so will try that if it happens again, but any other suggestions welcome)
Running kubernetes v1.18.8
There are two most common possibilities here, both most likely caused by a large load:
Out of Memory error on the kubelet host. Can be solved by adding proper --kubelet-extra-args to BootstrapArguments. For example: --kubelet-extra-args "--kube-reserved memory=0.3Gi,ephemeral-storage=1Gi --system-reserved memory=0.2Gi,ephemeral-storage=1Gi --eviction-hard memory.available<200Mi,nodefs.available<10%"
An issue explained here:
kubelet cannot patch its node status sometimes, ’cos more than 250
resources stay on the node, kubelet cannot watch more than 250 streams
with kube-apiserver at the same time. So, I just adjust kube-apiserver
--http2-max-streams-per-connection to 1000 to relieve the pain.
You can either adjust the values provided above or try to find the cause of high load/iops and try to tune it down.
I had the same issue, after 20-30 min my nodes became in NotRready status, and all pods linked to these nodes became stuck in Terminating status.I tried to connect to my nodes via SSH, sometimes I faced a timeout, sometimes I could (hardly) connect, and I executed the top command to check the running processes.The most consuming process was kswapd0.My instance memory and CPU were both full (!), because it tried to swap a lot (due to a lack of memory), causing the kswapd0 process to consume more than 50% of the CPU!Root cause:Some pods consumed 400% of their memory request (defined in Kubernetes deployment), because they were initially under-provisioned. As a consequence, when my nodes started, Kubernetes placed them on nodes with only 32Mb of memory request per pod (the value I had defined), but that was insufficient.Solution:The solution was to increase containers requests:
requests:
memory: "32Mi"
cpu: "20m"
limits:
memory: "256Mi"
cpu: "100m"
With these values (in my case):
requests:
memory: "256Mi"
cpu: "20m"
limits:
memory: "512Mi"
cpu: "200m"
Important:
After that I processed a rolling update (cordon > drain > delete) of my nodes in order to ensure that Kubernetes reserve directly enough memory for my freshly started pods.
Conclusion:
Regularly check your pods' memory consumption, and adjust your resources requests over time.
The goal is to never leave your nodes be surprised by a memory saturation, because the swap can be fatal for your nodes.
The answer turned out to be an issue with iops as a result of du commands coming from - I think - cadvisor. I have moved to io1 boxes and have had stability since then so going to mark this as closed and the move of ec2 instance types as the resolution
Thanks for the help!

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

Why is AWS EC2 CPU usage shooting up to 100% momentarily from IOWait?

I have a large web-based application running in AWS with numerous EC2 instances. Occasionally -- about twice or thrice per week -- I receive an alarm notification from my Sensu monitoring system notifying me that one of my instances has hit 100% CPU.
This is the notification:
CheckCPU TOTAL WARNING: total=100.0 user=0.0 nice=0.0 system=0.0 idle=25.0 iowait=100.0 irq=0.0 softirq=0.0 steal=0.0 guest=0.0
Host: my_host_name
Timestamp: 2016-09-28 13:38:57 +0000
Address: XX.XX.XX.XX
Check Name: check-cpu-usage
Command: /etc/sensu/plugins/check-cpu.rb -w 70 -c 90
Status: 1
Occurrences: 1
This seems to be a momentary occurrence and the CPU goes back down to normal levels within seconds. So it seems like something not to get too worried about. But I'm still curious why it is happening. Notice that the CPU is taken up with the 100% IOWaits.
FYI, Amazon's monitoring system doesn't notice this blip. See the images below showing the CPU & IOlevels at 13:38
Interestingly, AWS says tells me that this instance will be retired soon. Might that be the two be related?
AWS is only displaying a 5 minute period, and it looks like your CPU check is set to send alarms after a single occurrence. If your CPU check's interval is less than 5 minutes, the AWS console may be rolling up the average to mask the actual CPU spike.
I'd recommend narrowing down the AWS monitoring console to a smaller period to see if you see the spike there.
I would add this as comment, but I have no reputation to do so.
I have noticed my ec2 instances have been doing this, but for far longer and after apt-get update + upgrade.
I tough it was an Apache thing, then started using Nginx in a new instance to test, and it just did it, run apt-get a few hours ago, then came back to find the instance using full cpu - for hours! Good thing it is just a test machine, but I wonder what is wrong with ubuntu/apt-get that might have cause this. From now on I guess I will have to reboot the machine after apt-get as it seems to be the only way to put it back to normal.

How to determine that a jvm app does more GC than normal work?

We recently had a problem that our EC2 instances had 90-100 percent cpu load cause of a bug in a library we include that created to many objects instead of reusing them (which was easy solvable), so we spent too much time in GC.
Unfortunately the AWS health checks and instance status metrics didn't cause the overloaded instances to be stopped and then new ones restarted, so after some time we hit the max autoscaling number and....died. Also our own health checks inside the app which are used for the ELB are so simple that they answered often enough to obviously not cause the instances to be terminated...and restarted, which would mitigate that problem for quite some time.
My idea is now to use our custom health check which is already included in the ELB health checks to report a failure if we spent to much time in GC.
How would I do such a thing inside the app?
There are a number of JVM parameters that allow GC monitoring
-Xloggc:<file> // logs gc activity to a file
-XX:+PrintGCDetails // tells you how different generations are impacted
You can either parse these logs yourself or use specific tool such as GCViewer to analyse gc activity.
Use GarbageCollectorMXBean:
long gcTime = 0;
for (GarbageCollectorMXBean gcBean : ManagementFactory.getGarbageCollectorMXBeans()) {
gcTime += gcBean.getCollectionTime();
}
long jvmUptime = ManagementFactory.getRuntimeMXBean().getUptime();
System.out.println("GC ratio: " + (100 * gcTime / jvmUptime) + "%");
You can use VisualVM to monitor what happens inside the JVM and you can monitor remote instances via JMX. You did not describe which application container that you are using (Apache Tomcat, GlassFish etc.), you can set up a JMX connector like this in the case of Tomcat.
Don't forget to adjust Security Groups in AWS to have the proper permission to access the JMX port.
The JVM flags PrintGCApplicationConcurrentTime and PrintGCApplicationStoppedTime will log how long the application was active or suspended. They're a bit of a misnomers since they actually measure time spent in and out of safepoints, not just GCs.

Forwarding journald to Cloudwatch Logs

I'm a newbie to CentOS and wanted to know the best way to parse journal logs to CloudWatch Logs.
My thought processes so far are:
Use FIFO to parse the journal logs and ingest this to Cloudwatch Logs, - It looks like this could come with draw backs where logs could be dropped if we hit buffering limits.
Forward journal logs to syslog and send syslogs to Cloudwatch Logs --
The idea is essentially to have everything logging to journald as JSON and then forward this across to CloudWatch Logs.
What is the best way to do this? How have others solved this problem?
Take a look at https://github.com/advantageous/systemd-cloud-watch
We had problems with journald-cloudwatch-logs. It just did not work for us at all.
It does not limit the size of the message or commandLine that it sends to CloudWatch and the CloudWatch sends back an error that journald-cloudwatch-logs cannot handle which makes it out of sync.
systemd-cloud-watch is stateless and it asks CloudWatch where it left off.
systemd-cloud-watch also creates the log-group if missing.
systemd-cloud-watch also uses the name tag and the private ip address so that you can easily find the log you are looking for.
We also include a packer file to show you how to build and configure a systemd-cloud-watch image with EC2/Centos/Systemd. There is no question about how to configure systemd because we have a working example.
Take a look at https://github.com/saymedia/journald-cloudwatch-logs by Matin Atkins.
This open source project creates a binary that does exactly what you want - ship your (systemd) journald logs to AWS CloudWatch Logs.
The project depends on libsystemd to forward directly to CloudWatch. It does not rely on forwarding to syslog. This is a good thing.
The project appears to use golang's concurrent channels to read the logs and batches writes.
Vector can be used to ship logs from journald to AWS CloudWatch Logs.
journald can be used as a source and AWS Cloudwatch Logs as a sink.
I'm working on integrating this with an existing deployment of about 6 EC2 instances that generate about 30 GB of logs daily. I'll update this answer with any caveats or gotchas after we've used Vector in production for a few weeks.
EDIT 8/17/2020
A few things to be aware of. The match batch size for the PutLogEvents is 1MB and there is a max of 5 requests per second per stream. See the limits here..
To help with that, in my set up each journald unit has it's own log stream. Also, there are a lot of fields that the Vector journald sink includes, I used a vector transform to remove all the ones I didn't need. However, I'm still running into rate limits.
EDIT 10/6/2020
I have this running in production now. I had to update the version of vector I was using from 0.8.1 to 0.10.0 to take care an issue with vector not respecting the max bytes per batch requirement for AWS CloudWatch logs. As far as the rate limit issues I was experiencing, it turns out I wasn't having any issues. I was getting this message in the vector logs tower_limit::rate::service: rate limit exceeded, disabling service. What that actually means is that vector is pausing send logs temporarily to respect the rate limit of the sink. Also, each Cloudwatch Log Stream can consume up to 18 GB per hour which is fine for my 30 GB per day requirement for over 30 different services on 6 VMs.
One issue I did run into was causing the CPU to spike on our main API service. I had a source for each service unit to tail the journald logs. I believe this somehow blocked our API from not being able to write to journald (not 100% though). What I did was have one source and specified multiple units to follow so there was only one command tailing the logs and I increased the batch size since each service generates a lot of logs. I then used vector's template syntax to split the Log Group and Log Stream based on the service name. Below is an example configuration:
[sources.journald_logs]
type = "journald"
units = ["api", "sshd", "vector", "review", "other-service"]
batch_size = 100
[sinks.cloud_watch_logs]
type = "aws_cloudwatch_logs"
inputs = ["journald_logs"]
group_name = "/production/{{host}}/{{_SYSTEMD_UNIT}}"
healthcheck = true
region = "${region}"
stream_name = "{{_SYSTEMD_UNIT}}"
encoding = "json"
I have one final issue I need to iron out, but it's not related to this question. I'm using a file source for nginx since it writes to an access log file. Vector is consuming 80% of the CPU on that machine getting the logs and sending them to AWS CloudWatch. Filebeat also runs on the same box sending the logs to Logstash, but it's never caused any issues. Once we get vector working reliably we'll retire the Elastic Stack, but for now we have them running side by side.