GCP VM not installing nVidia driver properly - google-cloud-platform

I have created the VM using GCP Console in browser.
While creating VM, I selected the VM Image as "c2-deeplearning-pytorch-1-8-cu110-v20210619-debian-10". Also, I selected GPU as T4.
VM gets created and started and it shows green icon in browser.
Then I try to connect from "gcloud compute ssh " and it asks if I want to install nVidia Driver and I do Y, then it gives error for lock file and driver is not installed as:
This VM requires Nvidia drivers to function correctly. Installation
takes ~1 minute. Would you like to install the Nvidia driver? [y/n] y
Installing Nvidia driver. install linux headers:
linux-headers-4.19.0-16-cloud-amd64 E: dpkg was interrupted, you must
manually run 'sudo dpkg --configure -a' to correct the problem.
Nvidia driver installed.
I try to verify if driver is installed by running python code as:
import torch
torch.cuda.is_available() #returns False.
Anybody else faced this issue?

This is the correct way to install NVIDIA driver on a GCP instance:
cd /
sudo apt purge nvidia-*
Reboot
cd /
sudo wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
sudo sh cuda_11.2.2_460.32.03_linux.run
Adjust your config accordingly as it pops options in the terminal
Reboot

Solution to my problem was:
Run manually : sudo dpkg --configure -a
Disconnect from machine.
Connect again using SSH. Select Y again when asked to install nVidia Driver.
It works then.

Make sure you are running as root. I know this sounds silly, but if you use their notebook instances the default user is not root and if you try to ssh into the instance and run something like gpustat etc or run custom code, you might get errors like NVIDIA drivers are not loaded or such.
If you make sure your user (which is called jupyter in the default case) is in the sudoers then all will work fine.
It is often very complicated to install or reinstall GPU drivers on GCP instances. Make sure you actually need to reinstall before you attempt other solutions.

Related

Can not find NVIDIA driver after stop and start a deep learning VM

[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again.
I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the instance, when I run nvidia-smi, I got the following error message:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
But if I type which nvidia-smi, I got
/usr/bin/nvidia-smi
It seems the driver is there but can not be used. Can someone suggest how to enable NVIDIA driver after stopping and starting a deep learning VM? The first time I created and opened the instance, the driver is automatically installed.
The system information is (using uname -m && cat /etc/*release):
x86_64
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
I tried the installation scripts from GCP. First run
curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py
And then run
sudo python3 install_gpu_driver.py
which gives the following message:
Executing: which nvidia-smi
/usr/bin/nvidia-smi
Already installed.
After posting the question, the Nvidia driver starts to work properly after waiting for a couple of minutes.
In the following days, I tried stopping/starting the VM instance multiple times. Sometimes nvidia-smi directly works, sometimes does not after >20 min waiting. My current best answer to this question is first waiting for several minutes. If nvidia-smi still does not work, stop and start the instance again.
What worked for me (not sure if it will go well to next starts) was to remove all drivers: sudo apt remove --purge '*nvidia*', and then force the installation with sudo python3 install_gpu_driver.py.
In the install_gpu_driver.py, change line 230 to return False inside of the check_driver_installed function. Then, run the script.
Who uses docker may face this error docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] and have to reinstall the docker too. This thread helped me.

Unable to install NVIDIA driver on various GCP Ubuntu VM's with Tesla K80 GPU

I have followed this GCP guide with Ubuntu 18 and 20 (have also tried Ubuntu Lite, Debian and Centos 7) but, unfortunately, after completing the lengthy install I get this:
me#gpu:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running
I have tried installing via the script and via the direct downloads from the Nvidia site for Cuda 10. Ready to pull my hair out if that helps! I don't understand how a company that builds a bazillion GPU's can't make the installation process robust?
I have also tried these recommendations with no luck.
I was able to get it working. The mistake I was making was not doing the pre-installation steps before running the cuda_10.1.243_418.87.00_linux.run script. I was under the impression the *.run file would do everything for me. It would help if users were told they MUST do the pre-installation steps. Specifically I had to do this for Ubuntu 18:
sudo nano /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
sudo update-initramfs -u
reboot
This seems like a bit of a “hack”, so not sure why nvidia can’t make the installation process more robust? They make a bazillion of these cards. It’s not like some homemade product with a niche user base…
If you've installed the driver so many times and nvidia-smi is still failing to communicate, take a look into prime-select.
Run prime-select query, this way you are going to get all possible options, it should show at least nvidia | intel.
Select prime-select nvidia.
Then, if you see nvidia is already selected, choose a different one, e.g. prime-select intel. Next, switch back to nvidia prime-select nvidia
Reboot and check nvidia-smi.
Plus, it could be a good idea to run again:
sudo apt install nvidia-cuda-toolkit
When it finishes, reboot the machine, and nvidia-smi should work then.
Now, in other cases it works to follow these instructions to install CuDNn and Cuda on VMs cuda_11.2_installation_on_Ubuntu_20.04.
And finally, in some other cases it is caused by unattended-upgrades. Take a look into the settings and adjust them if it is causing unexpected results. This URL has the documentation for Debian, and I was able to see that you already tested with that distro UnattendedUpgrades.

How to add cuda drivers to gcp Ubuntu vm?

When I run a pytorch model on a google virtual machine using:
model.cuda()
I get this error:
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
As of 2020-05-01, GCP default machines run 9.12. So it's hard to use the default repository :
ppa:graphics-drivers
I recommend just downloading the driver yourself and installing it :
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/440.82/NVIDIA-Linux-x86_64-440.82.run
Then run it in sudo mode :
sudo bash NVIDIA-Linux-x86_64-440.82.run
Accept everything and you should be okay to use your GPU

CloudInit, Kernel Upgrades & DKMS?

I'm using cloud-init to configure my EC2 instances at launch time, currently just on CentOS 7. I need to upgrade to the latest kernel, etc so first I have:
package_upgrade: true
Then I add a bunch of repos and install some packages with yum that ultimately compile some kernel modules with DKMS (Nvidia drivers)
Finally I reboot the system with:
power_state:
mode: reboot
timeout: 30
This all works great! However, when the system comes back up, DKMS reports that the nvidia driver is "added" but not installed and the Nvidia driver doesn't work. If I yum reinstall nvidia-kmod everything works. So obviously what's happening is the kernel module is being compiled and installed for the previous kernel and not the new kernel.
So what is the suggested way to solve this? Is there a way to reboot after the package_upgrade but before any of the other steps? Is there a way to force nvidia-kmod to compile for the new kernel and not the current kernel? Any other ideas?
Looks like the only real option is to create a cloud-init per-boot script that runs dkms-autoinstall. This attempts to compile any "added" kernel module that aren't yet installed on every boot.

Launching anaconda spyder gui in cygwin

I am connecting my windows 7 computer to a linux based cluster using cygwin. Within a specific node in the cluster I want to launch the anaconda spyder gui.
to launch spyder you simply type:
spyder into cygwin
but that returns:
QXcbConnection: Could not connect to display
Aborted (core dumped)
I also tried:
QTA_QPA_PLATFORM=offscreen spyder
but that returns:
QFontDatabase: Cannot find font directory /home/spotter/anaconda2/lib/fonts - is Qt installed correctly?
I installed qt4 dev-tools but it didn't change anything
EDIT:
I installed xinit and xorg and now I try this:
before logging in with ssh i run:
export DISPLAY=localhost:0.0
then I login using ssh:
ssh -Y -X usrname#machine
and now when I try to use spyder I get:
connect localhost port 6000: Connection refused
QXcbConnection: Could not connect to display localhost:11.0
So it sounds like you are running Cygwin on your local Windows machine, logging into a remote server with ssh, and running spyder from that machine with the intent of having it show up on your local screen. Now that you have startx working, you are close to a solution.
Between steps 5 and 6, you need to run the export DISPLAY command on the remote machine and set it to the name of your local computer. You will need to know your hostname for this. The steps will look like this:
startx
ssh -Y -X username#machine
export DISPLAY=win-machine-name:0.0
spyder
The last two commands are executed on the remote machine. I just made up the win-machine-name. In its place, you will put the IP address or machine name of your windows machine. That is how you tell set the DISPLAY environment variable on the remote machine, so X clients know where to send the graphics commands.
Hope this helps!
For me what I did was:
Install packages associated with startx
Change the sshd_config file to allow X11 forwarding
export DISPLAY=localhost:0.0
startx
login with ssh -Y -X username#machine
spyder