Cleaning/deleting /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io - diskspace

I've been a Linux admin for many years but am new to the rancher environment. I'm sure this is a simple question for most but we are experiencing disk pressure and disk space issues and one of the items I found after running a df-hT is high usage of overlay filesystems located in /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io. What is the easiest way to mitigate this buildup? It's on most of our nodes. Any help would be appreciated.
Example snapshot below:
overlay 24G 21G 2.1G 92% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/954cea5f085723d49415df2f811a637ef27fe6bf312d3abe475c69788e712141/rootfs
overlay 24G 21G 2.1G 92% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/cb581789c4072843c26e1a56c51682b31016f3821013fa8b8a33114d03f52a46/rootfs
overlay 24G 21G 2.1G 92% /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/ceb865ea0620cc7a117f647b8f93f19a31d71ccf6326178f839e3f59cbfb0b2a/rootfs

Related

Scheduler node always above 100% utilization

We're running Cloud Composer on a 5 node, n1-standard-2 cluster running composer-1.11.3-airflow-1.10.9. Private IP is enabled. Python 3 is selected. We currently have around 30 DAGs, some containing over 100 tasks. Most DAGs run once a day.
The node running the airflow scheduler workload is consistently running at around 150% CPU utilisation regardless of the number of running tasks. The only way to lower the CPU usage is to remove DAGs until only 5 or 6 remain (obviously not an option). What we've tried:
We have followed this Medium article detailing how to run the scheduler service on a dedicated node however we cannot find a configuration that reduces the CPU usage. We've tried a node as powerful as an e2-highcpu-32 running on a 100 GB SSD. Usage remained at 150%.
We've tried to update the airflow.cfg variables to reduce the frequency the dags directory is parsed via settings such as store_serialized_dags and max_threads. Again, this did not have any impact on the CPU usage.
For reference the other nodes all run at 30-70% CPU for the majority of the time, spiking to over 100% for short periods when big DAGs are running. No node has any issue with memory usage, with between 2 GB and 4 GB used.
We plan on adding more DAGs in future and are concerned the scheduler may become a bottle neck with this current setup. Are there any other configuration options available to reduce the CPU usage to allow for a future increase in DAG number?
Edit in response to Ines' answer:
I'm seeing the CPU usage as a percentage in the Monitoring tab, the node running the scheduler service is coloured orange:
Additionally, when I look at the pod running airflow scheduler this is the CPU usage, pretty much always 100%:
Please, have a look for the official documentation, which describes CPU usage per node metric. Can you elaborate where do you see the percentage values, since the documentation mentioned a core time usage ratio:
A chart showing the usage of CPU cores aggregated over all running
Pods in the node, measured as a core time usage ratio. This does not
include CPU usage of the App Engine instance used for the Airflow UI
or Cloud SQL instance. High CPU usage is often the root cause of
Worker Pod evictions. If you see very high usage, consider scaling out
your Composer environment or changing the schedule of your DAG runs.
In the meantime, there is an ongoing workaround and it would be worth trying. You should follow these steps to limit CPU usage on syncing POD:
Go to environment configuration page and click view cluster workloads
Click airflow-scheduler, then edit
Find name: gcs-syncd and add:
resources:
limits:
cpu: some value (you can try with 300m)
requests:
cpu: 10m
then click save (at the bottom).
Repeat the procedure for airflow-worker.
We have to edit also the section of airflow-scheduler of the workload airflow-scheduler. Click on edit the YAML file and for the section airflow-scheduler add:
resources:
limits:
cpu: 750m
requests:
cpu: 300m
It would be great if you could try the aforementioned steps and see if it improves the performance.
Sometimes bucket's /logs might consist of lot of files that causes gcs-syncd to use CPU much while doing an internal synchronization of the logs. You can try to remove some of the oldest logs of the bucket gs://<composer-env-name>/logs. As an example, if you would like to remove all logs of May, please use the following command:
gsutil -m rm -r gs://europe-west1-td2-composter-a438b8eb-bucket/logs/*/*/2020-05*
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.

How to reduce the disk size of a VM in google cloud [duplicate]

This question already has answers here:
Reduce Persistent Disk Size
(2 answers)
Closed 2 years ago.
I created two VMs having size of 1 TB disk size. Though the VMs are not running , even then I found that GCP was charging me for the disk space. How can I reduce the disk size to lower the cost.What are the other alternatives? There are lot of services installed on this VM and creating new VM from scratch is not an option.
I have explored the followings.
Google documentation: Which says, reducing disk size is not a option.
Creating VM from snapshot: Apparently , this also does not allow to reduce the disk size.
Creating VM from machine image: No luck here as well
You can'd reduce the disk size, but only spin up a new instance from a startup script.
Is this working for you? the document has detailed steps
Select the “Compute -> Compute Engine -> VM Instances” menu item.
Select the instance you wish to resize.
In the “Boot disk” section, select the boot disk of the instance.
On the “Disks” detail page, click the “Edit” button.
Enter a new size (GB) for the disk in the “Size” field.
Click the “Save” button at the bottom of the page.
and the last step you have to do is to restart the instance.
UPDATE
As #Kerem commented: official doc says:
You can only resize a zonal persistent disk to increase its size. You
cannot reduce the size of a zonal persistent disk.

Keras model not using GPU on ai platform training

I have a simple Keras model that I am submitting to Google Cloud AI Platform training, and would like to make use of a GPU for processing.
The job submits and completes successfully.
Looking at the usage statistics, the GPU never goes beyond 0% utilization. However, CPU usage increases as training progresses.
Any idea on what might be wrong in making my model work with a GPU?
Are there any ways that I might be able to troubleshoot a situation like this?
config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
I am using runtime version 1.13, which comes with tensorflow already installed. My additional required packages in my setup.py include:
REQUIRED_PACKAGES = ['google-api-core==1.14.2',
'google-cloud-core==1.0.3',
'google-cloud-logging==1.12.1',
'google-cloud-storage==1.18.0',
'gcsfs==0.2.3',
'h5py==2.9.0',
'joblib==0.13.2',
'numpy==1.16.4',
'pandas==0.24.2',
'protobuf==3.8.0',
'scikit-learn==0.21.2',
'scipy==1.3.0',
'Keras==2.2.4',
'Keras-Preprocessing==1.1.0',
]
Looking at logs, it looks like the GPU is found
master-replica-0 Found device 0 with properties: master-replica-0
master-replica-0 name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 master-replica-0
Update:
The model is using a GPU, but is under-utilized.
Within AI Platform, the utilization graphs in the Job overview page are about 5 minutes behind the activity displayed in the logs.
As a result, your logs could show an epoch being processed, but the utilization graphs can still show 0% utilization.
How I resolved -
I am using the fit_generator function
I set multiprocessing=true, queue_length=10, workers=5. I am currently tweaking these parameters to determine what works best, however I see ~30% utilization on my GPU now.
The model is using a GPU, but is under-utilized.
Within AI Platform, the utilization graphs in the Job overview page are about 5 minutes behind the activity displayed in the logs.
As a result, your logs could show an epoch being processed, but the utilization graphs can still show 0% utilization.
How I resolved -
I am using the fit_generator function
I set multiprocessing=true, queue_length=10, workers=5. I am currently tweaking these parameters to determine what works best, however I see ~30% utilization on my GPU now.

Replace HDD with SSD on google cloud compute engine

I am running GETH node on google cloud compute engine instance and started with HDD. It grows 1.5TB now. But it is damn slow. I want to move from HDD to SSD now.
How I can do that?
I got some solution like :
- make a snapshot from the existing disk(HDD)
- Edit the instance and attach new SSD with the snapshot made.
- I can disconnected old disk afterwards.
One problem here I saw is : Example - If my HDD is 500GB, it is not allowing SSD of size less than 500GB. My data is in TBs now. It will cost like anything.
But, I want to understand if it actually works? Because this is a node I want to use for production. I already waiting too long and cannot afford to wait more.
One problem here I saw is : If my HDD is 500GB, it is not allowing SSD of size less than 500GB. My data is in TBs now. It will cost like anything.
You should try to use Zonal SSD persistent disks.
As standing in documentation
Each persistent disk can be up to 64 TB in size, so there is no need to manage arrays of disks to create large logical volumes.
The description of the issue is confusing so I will try to help from my current understanding of the problem. First, you can use a booting disk snapshot to create a new booting disk accomplishing your requirements, see here. The size limit for persistent disk is of 2 TB so I don’t understand your comment about the 500 GB minimum size. If your disk have 1.5 TB then will meet the restriction.
Anyway, I don’t recommend having such a big disk as a booting disk. A better approach could be to use a smaller boot disk and expand the total capacity by attaching additional disks as needed, see this link.

aws ecs instances running out of space

Since this morning I'm having troubles while updating services in AWS ECS. The tasks fails to start. The failed tasks shows this error:
open /var/lib/docker/devicemapper/metadata/.tmp928855886: no space left on device
I have checked disk space and there is.
/dev/nvme0n1p1 7,8G 5,6G 2,2G 73% /
Then I have checked the inodes usage, and I found that 100% are used:
/dev/nvme0n1p1 524288 524288 0 100% /
Narrowing the search I found that Docker volumes are the ones using the inodes.
I'm using the standard Centos AMI.
Does this mean that there is a maximum number of services that can run on a ECS cluster? (at this moment I'm running 18 services)
This can be solved? At this moment I can't do updates.
Thanks in advance
You need to tweak the following environment variables on your EC2 hosts:
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
ECS_IMAGE_CLEANUP_INTERVAL
ECS_IMAGE_MINIMUM_CLEANUP_AGE
ECS_NUM_IMAGES_DELETE_PER_CYCLE
You can find the full docs on all these settings here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html
The default behavior is to check every 30 minutes, and only delete 5 images that are more than 1 hour old and unused. You can make this behavior more aggressive if you want to clean up more images more frequently.
Another thing to consider to save space is rather than squashing your image layers together make use of a common shared base image layer for your different images and image versions. This can make a huge difference because if you have 10 different images that are each 1 GB in size that takes up 10 GB of space. But if you have a single 1 GB base image layer, and then 10 small application layers that are only a few MB in size that only takes up a little more than 1 GB of disk space.