GCP Monitoring agent increasing memory usage continuously

GCP Monitoring agent increasing memory usage continuously - google-cloud-platform

I have installed GCP monitoring and logging agent on my compute engine instance. It has increased memory consumption more than 50% from the time it was installed.
Any way to stop the memory utilization and reset back to the initial memory consumption?
I have 3.75 GB RAM, out of which more than 3 GB consumed and more than 2 GB is being consumed by this process "/opt/google-fluentd/embedded/bin/ruby -Eascii-8bit:ascii-8bit /usr/sbin/google-fluentd --log /var/log/google-fluentd/google-fluentd.log --daemon /var/run/google-fluentd/google-fluentd.pid --under-supervisor"
Update:
After restart google-fluentd service, it brings down memory usage. But need to know the reason of its increased memory consumption. Is it a bug in fluentd service?

Yes, it seems to be a known issue. The Logging Agent product team is still working on finding a fix for this issue. You can track google-fluentd (monitoring service) memory usage increase for updates.
Meanwhile the only workaround to solve this is to schedule a cron job to restart the fluentd agent periodically.
To restart the agent periodically run the following command on your instance:
$ sudo service google-fluentd restart
Another recommendation is to check that there are not multiple Logging agent instances running on the VM (periodically).
Use ps -aux | grep "/usr/sbin/google-fluentd" to show running agent processes (there should be only two: one supervisor and one worker), and sudo netstat -nltp | grep :24231 to show running processes that occupy the port. Kill older instances as seen fit.
Edit :
Check whether your fluent-plugin-systemd version is upgraded to 1.0.5 by using the command:
$ /opt/google-fluentd/embedded/bin/gem list | grep fluent-plugin-systemd
If it is not upgraded to 1.0.5, you can upgrade using fluent-plugin-systemd 1.0.5.
If you have fluent-plugin-systemd 1.0.5 but are still seeing the issue, it might be the buffer output plugin issue that is still under investigation in https://github.com/fluent/fluentd/issues/3401

Related

How to kill a process in GCP

Often times, some rogue processes gets in a busy spin mode using up 100% of the CPUs. I have a GCP Ubunutu instance with 4 CPU Cores and 32 Gigs of RAM. I still get into this situation of 100% CPU usage and I can't even SSH into the VM instance.
Does GCP provide a way of killing the offending process? Through gcloud SDK command or web console?

As Serhii Rohoza mentioned, GCP does not provide you any tool to kill proccess.
Instead, you can SSH your VM instance and figure out what process is eating away your CPU and stop it, by executing this commands:
Open a terminal with Ctrl+Alt+t
Execute the command "top"
Note the process using the most CPU
If the process isn't a system process, kill it with "sudo pkill [processname]"
where [processname] is the name of the process you want to kill.
If it is a system process, don't kill it, but try to Google the name of it and figure out what functionality it does in Ubuntu.

aws-shell not working in Ubuntu 20, on AWS Lightsail

I've created AWS Lightsail instance with Ubuntu 20.04, installed python3 and pip3.
I installed AWS Shell tool using pip3 install aws-shell command.
However, when I try to run it, it hangs and outputs Killed after several minutes.
This is how it looks like:
root#ip-...:/home/ubuntu# aws-shell
First run, creating autocomplete index...
Killed
root#ip-...:/home/ubuntu# aws-shell
First run, creating autocomplete index...
Killed
On Metrics page of AWS Lightsail it shows CPU utilization spike in Burstable zone.
So I'm quite sad that this just wastes CPU quota by loading CPU for several minutes and doesn't work.
I've done the same steps on Ubuntu 16.0 on virtual machine and it worked there fine. So I'm completely lost here and don't know how can I fix it. Tried to google this problem and didn't find anything related.
UPD: also I've just tried to use python 2.7 version to install aws-shell, it still doesn't work. So it doesn't work for both python 3.8.5 and 2.7.18

The aws-shell tool should be used on local machine, instead of on AWS Lightsail instance.
I wish it had warning or info message about it, besides me knowing now that it was an incorrect endeavor.

google_osconfig's CPU usage of VM on Google Cloud Platform increases steadily

I have been used google cloud platform to offer the services to clients. A few days ago, I found the problem that the CPU usage of VM keeps increasing continuously. For uncovering the reason of this problem, I made the empty(or new) VMs to watch their status, and these new VMs also keeps increasing their CPU usages.
I used "top" command to know which process takes CPU resources, and the result makes me shocked. "google_osconfig" keeps consuming CPU resources, and it is eating more and more like pigs.
what is "google_osconfig", and is there anyone who know to solve this problem?
I restarted google-osconfig-agent to make it release its CPU usage. After using "service google-osconfig-agent restart", the CPU usage decreased.

google_osconfig It is part of the VM Manager, this definition is in the documentation
VM Manager is a suite of tools that can be used to manage operating systems for large virtual machine (VM) fleets running Windows and Linux on Compute Engine.
The following services are available as part of the VM Manager suite:
OS inventory management: osinventory
OS patch management: tasks
OS configuration management: guestpolicies
The OS Config agent is installed by default on Red Hat Enterprise Linux (RHEL), Debian, CentOS, and Windows images that have a build date of v20200114 or later.
You could check the status of this service with the following command:
sudo systemctl status google-osconfig-agent
If it was a problem with some subprocess that started the CPU consumptions the restart you made will fix it.
But it might a problem with the service, maybe the version you are using has a problem, you could consider updating the OS Config agent.
To update the agent on CentOS and RHEL operating systems, run the following command:
sudo yum update google-osconfig-agent
To update the agent on Debian and Ubuntu operating systems, run the following commands:
sudo apt update
sudo apt install google-osconfig-agent
sudo service google-osconfig-agent restart

This is a known bug in some older versions of osconfig (pre-Dec 2020). To permanently fix, update to a current version:
sudo apt update
sudo apt install google-osconfig-agent
sudo service google-osconfig-agent restart

GAN Training with Tensorflow 1.4 inside Docker Stops without Prompting and without Releasing Memory connected to VM with SSH Connection

Project Detail
I am running open source code of A GAN based Research Paper named "Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition"
source code: here
The dependencies include:
Python 2.7
TensorFlow 1.4.0
I pulled a Docker Image of TensorFlow 1.4.0 with python 2.7 on my GPU Virtual Machine Connected with ssh connection with this command :
docker pull tensorflow/tensorflow:1.4.0-gpu
I am running
bash rsrgan/run_gan_rnn_placeholder.sh
according to readme of source code
Issue's Detail
Everything is working, Model is Training and loss is decreasing, But there is only one issue that After some iterations terminal shows no output, GPU still shows PID but no Memory freed and sometime GPU-Utils becomes 0%. Training on VM's GPU and CPU are same case.
It is not a memory issue Because GPU Memory usage by model is 5400MB out of 11,000MB and RAM for CPU is also very Big
When I ran 21 Iteration on my local Computer each iteration with 0.09 hours with 1st Gen i5 and 4GB RAM all iterations executed. But whenever I run it with ssh inside docker issue happens again and again with both GPU and CPU.
Just keep in mind the issue is happening inside docker with computer connected with ssh and ssh is also not disconnect very often.
exact Numbers
If an iteration take 1.5 hour then issue happens after two to three iterations and if single iteration take 0.06 hours then issue happens exactly after 14 iteration out of 25

Perform operations inside Docker container
The first thing you can try out is to build the Docker image and then enter inside the Docker container by specifying the -ti flag or /bin/bash parameter in your docker run command.
Clone the repository inside the container and while building the image you should also copy your training data from local to inside the docker. Run the training there and commit the changes so that you need not repeat the steps in future runs as after you exit from the container, all the changes are lost if not committed.
You can find the reference for docker commit here.
$ docker commit <container-id> <image-name:tag>
While training is going on check for the GPU and CPU utilization of the VM, see if everything is working as expected.
Use Anaconda environment on you VM
Anaconda is a great package manager. You can install anaconda and create a virtual environment and run your code in the virtual environment.
$ wget <url_of_anaconda.sh>
$ bash <path_to_sh>
$ source anaconda3/bin/activate or source anaconda2/bin/activate
$ conda create -n <env_name> python==2.7.*
$ conda activate <env_name>
Install all the dependencies via conda (recommended) or pip.
Run your code.
Q1: GAN Training with Tensorflow 1.4 inside Docker Stops without Prompting
Although Docker gives OS-level virtualization inside Docker, we face issues in running some processes which run with ease on the system. So to debug the issue you should go inside the image and performs the steps above in order to debug the problem.
Q2: Training stops without Releasing Memory connected to VM with SSH Connection
Ya, this is an issue I had also faced earlier. The best way to release memory is to stop the Docker container. You can find more resource allocation options here.
Also, earlier versions of TensorFlow had issues with allocating and clearing memory properly. You can find some reference here and here. These issues have been fixed in recent versions of TensorFlow.
Additionally, check for Nvidia bug reports
Step 1: Install Nvidia-utils installed via the following command. You can find the driver version from nvidia-smi output (also mentioned in the question.)
$ sudo apt install nvidia-utils-<driver-version>
Step 2: Run the nvidia-bug-report.sh script
$ sudo /usr/bin/nvidia-bug-report.sh
Log file will be generated in your current working directory with name nvidia-bug-report.log.gz. Also, you can access the installer log at /var/log/nvidia-installer.log.
You can find additional information about Nvidia logs at these links:
Nvidia Bug Report Reference 1
Nvidia Bug Report Reference 2
Log GPU load
Hope this helps.

VMWare Workstation won't suspend from command line

I'm trying to automate VMWare Desktop on Windows 7 to suspend all vm's before I do a backup job each night. I used to have a script that did this but I've noticed now that it won't suspend anymore with the same command that used to work.
If I do vmrun list I get a list of the running vms with no issue.
If I do vmrun suspend "V:\Virtual Machines\RICHARD-DEV\RICHARD-DEV.vmx" it just hangs and I have to kill the command with CTRL+C.
I've even tried a newer command using -T to specify it's workstation, ie vmrun -T ws suspend "V:\Virtual Machines\RICHARD-DEV\RICHARD-DEV.vmx" and still no love.
If I have the vm already stopped, I can issue vmrun start "V:\Virtual Machines\RICHARD-DEV\RICHARD-DEV.vmx" and it starts fine.
As well as the suspend command, the stop command also does not work. I'm running VMWare Workstation 11.1.3 build-3206955 on Windows 7.
Any ideas?
Update:
I installed latest VMWare Tools on the guest, as well as the latest Vix on the Host so everything should be up to date.
I can start a vm using vmrun with no problem using vmrun -T ws start <path to vmx> but the command doesn't come back to the command prompt, so I'm assuming it's not getting confirmation from the vm that it is now running.
If I cancel the 'start' command and now try and suspend I'm getting the same lack of communication from the guest. If I manually suspend the vm, once it's suspended I get an 'Error: vm is not running' and the 'suspend' command finally times out and comes back.
So, it looks to me like there is no communication from vmrun to the guest about what state it's in etc. Is there a way to debug the communication from the host to the guest using vmrun or other means? Are there ports I need open in the guest OS?

So, I never did get vmrun to work properly on my main system, although I did get it behave ok on my laptop so there is something weird happening on this machine. I also installed a trial of the latest VMWare 12 and the same thing happens.
As a workaround, I ended up changing the power management settings in my guest OS so that it would 'sleep' after 1 hr of inactivity. When this happens VMWare detects it and automatically suspends the guest which is really what I'm looking for. Not the most slick solution but it does manage to unlock the files I need to be backed up in a nightly backup.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js