google cloud vm cannot find GPU after restart - google-cloud-platform

I created one google cloud vm with GPU. The GPU was working after instance created.
But after restarting the vm, the GPU was gone.
I ran nvidia-smi and got this error:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Who knows how to fix it? Thanks

Related

NVIDIA Driver not installing correcly

I am currently working on a Google Cloud environment with a Tesla T4 GPU type. I need to install an NVIDIA driver with it (which I did using the .run file). I downloaded the NVIDIA-Linux-x86_64-515.43.04.run file from the NVIDIA website. I also needed the CUDA Toolkit installer which I also downloaded as a .deb file into my Google Cloud instance (cuda-repo-debian11-11-7-local_11.7.0-515.43.04-1_amd64.deb).
I followed these instructions to finish installing CUDA: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local. I also tried to follow these pre and post-installation steps to set up the driver: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-overview.
I get some weird errors. For example when I run the nvidia-smi command in the Cloud cl I get this error: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. I am confused since I'm pretty sure I downloaded the latest version.
I am a noob in python and GPU related things so I don't really know what I'm doing. Can someone help me install the NVIDIA driver onto my Google Cloud instance? Thanks!!

Can not find NVIDIA driver after stop and start a deep learning VM

[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again.
I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the instance, when I run nvidia-smi, I got the following error message:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
But if I type which nvidia-smi, I got
/usr/bin/nvidia-smi
It seems the driver is there but can not be used. Can someone suggest how to enable NVIDIA driver after stopping and starting a deep learning VM? The first time I created and opened the instance, the driver is automatically installed.
The system information is (using uname -m && cat /etc/*release):
x86_64
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
I tried the installation scripts from GCP. First run
curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py
And then run
sudo python3 install_gpu_driver.py
which gives the following message:
Executing: which nvidia-smi
/usr/bin/nvidia-smi
Already installed.
After posting the question, the Nvidia driver starts to work properly after waiting for a couple of minutes.
In the following days, I tried stopping/starting the VM instance multiple times. Sometimes nvidia-smi directly works, sometimes does not after >20 min waiting. My current best answer to this question is first waiting for several minutes. If nvidia-smi still does not work, stop and start the instance again.
What worked for me (not sure if it will go well to next starts) was to remove all drivers: sudo apt remove --purge '*nvidia*', and then force the installation with sudo python3 install_gpu_driver.py.
In the install_gpu_driver.py, change line 230 to return False inside of the check_driver_installed function. Then, run the script.
Who uses docker may face this error docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] and have to reinstall the docker too. This thread helped me.

Why is nvidia-smi not working on deep learning ami + aws g5.xlarge

I want to use tensorflow on aws g5.xlarge. For AMI, I used AWS Deep Learning AMI (Ubuntu 18.04) ver 50.0. But when I start the instance and try nvidia-smi, the following error exists. Why am I getting the following error even though I used deep learning ami?
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Installing NVIDIA drivers for application on K8S

We have a flask app that's deployed on k8s. The base image of the app is this: https://hub.docker.com/r/tiangolo/uwsgi-nginx-flask/, and we build our app on top of this. We ship our docker image to ECR, and then deploy pods on k8s.
We want to start running ML models in our k8s nodes. The underlying nodes have GPUs (we're using g4dn instances), and they are using a GPU AMI.
When running our app, I'm seeing the following error:
/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
What's the right way to get CUDA installed on our nodes? I would have expected it to be built into the AMI shipped with the gpu instances but that doesn't seem to be the case.
There are a couple of options:
Use tensorflow:latest-gpu as base image and setup additional configuration for your system.
Setup Cuda drivers yourself in your Docker image.

Can't start VM in Virtualbox after update

I just updated virtualbox to the latest 4.1.16 r78094 and realized that I can no longer start my virtual machines.
If I start a VM, the error displayed is:
Failed to open a session for the virtual machine Historical Image.
Failed to load VMMR0.r0 (VERR_NO_MEMORY).
Result Code: NS_ERROR_FAILURE (0x80004005)
Component: Console
Interface: IConsole {1968b7d3-e3bf-4ceb-99e0-cb7c913317bb}
I have 8GB RAM in my VM host which is running OSX 10.6 of which I have over 4GBs free. Does anyone know how I can get my VMs working again?
It is a well known issue and at the moment you have two options:
install new version from http://download.virtualbox.org/virtualbox/4.1.18/VirtualBox-4.1.18-78361-OSX.dmg
downgrade your version to previous one: https://www.virtualbox.org/wiki/Download_Old_Builds_4_1