OS error load averages unobtainable greenlet celery stackoverflow - concurrency

I am frequently running into this error.
Error Image
This is happening when I am running celery worker with Gevent Thread.
This error starts coming frequently after its first incidence.
I am using
Python 2.7
Celery 4.0.2
OS Ubuntu 14.04

This is basically OS error stating that it is running out of file descriptors.
Celery is running 2000 tasks in parallel which need 4000 task descriptors.
Default limit of OS is (ulimit - n)
Soft 1024
Hard 65000
Updating these limits resolved the issue.
To update these limits follow steps on this link
How to increase Neo4j's maximum file open limit (ulimit) in Ubuntu?

Related

Tensorboard hangs on startup using 100% CPU

So I'm trying to visualise logs generated by stable baselines 3 but when I run tensorboard from the terminal as such tensorboard --logdir logs, it only outputs this message TensorFlow installation not found - running with reduced feature set., nothing about where it's hosted. It also uses 100% of one CPU core, sometimes switching which core is used, as can be seen below. Going to localhost:6006, or any other if I specify it using --host, just says "Unable to connect". As far as I can see it stays like this indefinitely.
I have also tried with tensorflow installed, same thing. I am running on Linux Mint and have a AMD GPU.

Can not find NVIDIA driver after stop and start a deep learning VM

[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again.
I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the instance, when I run nvidia-smi, I got the following error message:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
But if I type which nvidia-smi, I got
/usr/bin/nvidia-smi
It seems the driver is there but can not be used. Can someone suggest how to enable NVIDIA driver after stopping and starting a deep learning VM? The first time I created and opened the instance, the driver is automatically installed.
The system information is (using uname -m && cat /etc/*release):
x86_64
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
I tried the installation scripts from GCP. First run
curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py
And then run
sudo python3 install_gpu_driver.py
which gives the following message:
Executing: which nvidia-smi
/usr/bin/nvidia-smi
Already installed.
After posting the question, the Nvidia driver starts to work properly after waiting for a couple of minutes.
In the following days, I tried stopping/starting the VM instance multiple times. Sometimes nvidia-smi directly works, sometimes does not after >20 min waiting. My current best answer to this question is first waiting for several minutes. If nvidia-smi still does not work, stop and start the instance again.
What worked for me (not sure if it will go well to next starts) was to remove all drivers: sudo apt remove --purge '*nvidia*', and then force the installation with sudo python3 install_gpu_driver.py.
In the install_gpu_driver.py, change line 230 to return False inside of the check_driver_installed function. Then, run the script.
Who uses docker may face this error docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] and have to reinstall the docker too. This thread helped me.

aws-shell not working in Ubuntu 20, on AWS Lightsail

I've created AWS Lightsail instance with Ubuntu 20.04, installed python3 and pip3.
I installed AWS Shell tool using pip3 install aws-shell command.
However, when I try to run it, it hangs and outputs Killed after several minutes.
This is how it looks like:
root#ip-...:/home/ubuntu# aws-shell
First run, creating autocomplete index...
Killed
root#ip-...:/home/ubuntu# aws-shell
First run, creating autocomplete index...
Killed
On Metrics page of AWS Lightsail it shows CPU utilization spike in Burstable zone.
So I'm quite sad that this just wastes CPU quota by loading CPU for several minutes and doesn't work.
I've done the same steps on Ubuntu 16.0 on virtual machine and it worked there fine. So I'm completely lost here and don't know how can I fix it. Tried to google this problem and didn't find anything related.
UPD: also I've just tried to use python 2.7 version to install aws-shell, it still doesn't work. So it doesn't work for both python 3.8.5 and 2.7.18
The aws-shell tool should be used on local machine, instead of on AWS Lightsail instance.
I wish it had warning or info message about it, besides me knowing now that it was an incorrect endeavor.

GAN Training with Tensorflow 1.4 inside Docker Stops without Prompting and without Releasing Memory connected to VM with SSH Connection

Project Detail
I am running open source code of A GAN based Research Paper named "Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition"
source code: here
The dependencies include:
Python 2.7
TensorFlow 1.4.0
I pulled a Docker Image of TensorFlow 1.4.0 with python 2.7 on my GPU Virtual Machine Connected with ssh connection with this command :
docker pull tensorflow/tensorflow:1.4.0-gpu
I am running
bash rsrgan/run_gan_rnn_placeholder.sh
according to readme of source code
Issue's Detail
Everything is working, Model is Training and loss is decreasing, But there is only one issue that After some iterations terminal shows no output, GPU still shows PID but no Memory freed and sometime GPU-Utils becomes 0%. Training on VM's GPU and CPU are same case.
It is not a memory issue Because GPU Memory usage by model is 5400MB out of 11,000MB and RAM for CPU is also very Big
When I ran 21 Iteration on my local Computer each iteration with 0.09 hours with 1st Gen i5 and 4GB RAM all iterations executed. But whenever I run it with ssh inside docker issue happens again and again with both GPU and CPU.
Just keep in mind the issue is happening inside docker with computer connected with ssh and ssh is also not disconnect very often.
exact Numbers
If an iteration take 1.5 hour then issue happens after two to three iterations and if single iteration take 0.06 hours then issue happens exactly after 14 iteration out of 25
Perform operations inside Docker container
The first thing you can try out is to build the Docker image and then enter inside the Docker container by specifying the -ti flag or /bin/bash parameter in your docker run command.
Clone the repository inside the container and while building the image you should also copy your training data from local to inside the docker. Run the training there and commit the changes so that you need not repeat the steps in future runs as after you exit from the container, all the changes are lost if not committed.
You can find the reference for docker commit here.
$ docker commit <container-id> <image-name:tag>
While training is going on check for the GPU and CPU utilization of the VM, see if everything is working as expected.
Use Anaconda environment on you VM
Anaconda is a great package manager. You can install anaconda and create a virtual environment and run your code in the virtual environment.
$ wget <url_of_anaconda.sh>
$ bash <path_to_sh>
$ source anaconda3/bin/activate or source anaconda2/bin/activate
$ conda create -n <env_name> python==2.7.*
$ conda activate <env_name>
Install all the dependencies via conda (recommended) or pip.
Run your code.
Q1: GAN Training with Tensorflow 1.4 inside Docker Stops without Prompting
Although Docker gives OS-level virtualization inside Docker, we face issues in running some processes which run with ease on the system. So to debug the issue you should go inside the image and performs the steps above in order to debug the problem.
Q2: Training stops without Releasing Memory connected to VM with SSH Connection
Ya, this is an issue I had also faced earlier. The best way to release memory is to stop the Docker container. You can find more resource allocation options here.
Also, earlier versions of TensorFlow had issues with allocating and clearing memory properly. You can find some reference here and here. These issues have been fixed in recent versions of TensorFlow.
Additionally, check for Nvidia bug reports
Step 1: Install Nvidia-utils installed via the following command. You can find the driver version from nvidia-smi output (also mentioned in the question.)
$ sudo apt install nvidia-utils-<driver-version>
Step 2: Run the nvidia-bug-report.sh script
$ sudo /usr/bin/nvidia-bug-report.sh
Log file will be generated in your current working directory with name nvidia-bug-report.log.gz. Also, you can access the installer log at /var/log/nvidia-installer.log.
You can find additional information about Nvidia logs at these links:
Nvidia Bug Report Reference 1
Nvidia Bug Report Reference 2
Log GPU load
Hope this helps.

Python thread Error: Thread is running with limited resource on GCE vm instance

We have our code build in python which runs on Google Compute engine. The code processes data files from Cloud Storage to Bigquery. We are using 8 threads for multiprocessing. It has been tested successfully in some environments but in One environment, it keeps giving error:
{'status':'Service Running with limited resources-one or more worker threads have been terminated' deadthreads':7,'threadpoolsize':8,'alivethreads':1}
second and all other threads are dying after it .
Can anyone help with above error message ?
The potential reason for the issue was that the code was not comaptible with latest version of google-auth package . With vm spin up the default version installed google-auth 1.4.1 however on other environments it was
google-auth 1.3.0.
We downgraded this package to 1.3.0 and also downgraded grpcio package from 1.9.1 to 1.8.6 to bring environment in synch with tested environment.
Threading issue is resolved now.