When I run a pytorch model on a google virtual machine using:
model.cuda()
I get this error:
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
As of 2020-05-01, GCP default machines run 9.12. So it's hard to use the default repository :
ppa:graphics-drivers
I recommend just downloading the driver yourself and installing it :
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/440.82/NVIDIA-Linux-x86_64-440.82.run
Then run it in sudo mode :
sudo bash NVIDIA-Linux-x86_64-440.82.run
Accept everything and you should be okay to use your GPU
Related
I am currently working on a Google Cloud environment with a Tesla T4 GPU type. I need to install an NVIDIA driver with it (which I did using the .run file). I downloaded the NVIDIA-Linux-x86_64-515.43.04.run file from the NVIDIA website. I also needed the CUDA Toolkit installer which I also downloaded as a .deb file into my Google Cloud instance (cuda-repo-debian11-11-7-local_11.7.0-515.43.04-1_amd64.deb).
I followed these instructions to finish installing CUDA: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local. I also tried to follow these pre and post-installation steps to set up the driver: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-overview.
I get some weird errors. For example when I run the nvidia-smi command in the Cloud cl I get this error: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. I am confused since I'm pretty sure I downloaded the latest version.
I am a noob in python and GPU related things so I don't really know what I'm doing. Can someone help me install the NVIDIA driver onto my Google Cloud instance? Thanks!!
I have followed this GCP guide with Ubuntu 18 and 20 (have also tried Ubuntu Lite, Debian and Centos 7) but, unfortunately, after completing the lengthy install I get this:
me#gpu:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running
I have tried installing via the script and via the direct downloads from the Nvidia site for Cuda 10. Ready to pull my hair out if that helps! I don't understand how a company that builds a bazillion GPU's can't make the installation process robust?
I have also tried these recommendations with no luck.
I was able to get it working. The mistake I was making was not doing the pre-installation steps before running the cuda_10.1.243_418.87.00_linux.run script. I was under the impression the *.run file would do everything for me. It would help if users were told they MUST do the pre-installation steps. Specifically I had to do this for Ubuntu 18:
sudo nano /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
sudo update-initramfs -u
reboot
This seems like a bit of a “hack”, so not sure why nvidia can’t make the installation process more robust? They make a bazillion of these cards. It’s not like some homemade product with a niche user base…
If you've installed the driver so many times and nvidia-smi is still failing to communicate, take a look into prime-select.
Run prime-select query, this way you are going to get all possible options, it should show at least nvidia | intel.
Select prime-select nvidia.
Then, if you see nvidia is already selected, choose a different one, e.g. prime-select intel. Next, switch back to nvidia prime-select nvidia
Reboot and check nvidia-smi.
Plus, it could be a good idea to run again:
sudo apt install nvidia-cuda-toolkit
When it finishes, reboot the machine, and nvidia-smi should work then.
Now, in other cases it works to follow these instructions to install CuDNn and Cuda on VMs cuda_11.2_installation_on_Ubuntu_20.04.
And finally, in some other cases it is caused by unattended-upgrades. Take a look into the settings and adjust them if it is causing unexpected results. This URL has the documentation for Debian, and I was able to see that you already tested with that distro UnattendedUpgrades.
I have created the VM using GCP Console in browser.
While creating VM, I selected the VM Image as "c2-deeplearning-pytorch-1-8-cu110-v20210619-debian-10". Also, I selected GPU as T4.
VM gets created and started and it shows green icon in browser.
Then I try to connect from "gcloud compute ssh " and it asks if I want to install nVidia Driver and I do Y, then it gives error for lock file and driver is not installed as:
This VM requires Nvidia drivers to function correctly. Installation
takes ~1 minute. Would you like to install the Nvidia driver? [y/n] y
Installing Nvidia driver. install linux headers:
linux-headers-4.19.0-16-cloud-amd64 E: dpkg was interrupted, you must
manually run 'sudo dpkg --configure -a' to correct the problem.
Nvidia driver installed.
I try to verify if driver is installed by running python code as:
import torch
torch.cuda.is_available() #returns False.
Anybody else faced this issue?
This is the correct way to install NVIDIA driver on a GCP instance:
cd /
sudo apt purge nvidia-*
Reboot
cd /
sudo wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
sudo sh cuda_11.2.2_460.32.03_linux.run
Adjust your config accordingly as it pops options in the terminal
Reboot
Solution to my problem was:
Run manually : sudo dpkg --configure -a
Disconnect from machine.
Connect again using SSH. Select Y again when asked to install nVidia Driver.
It works then.
Make sure you are running as root. I know this sounds silly, but if you use their notebook instances the default user is not root and if you try to ssh into the instance and run something like gpustat etc or run custom code, you might get errors like NVIDIA drivers are not loaded or such.
If you make sure your user (which is called jupyter in the default case) is in the sudoers then all will work fine.
It is often very complicated to install or reinstall GPU drivers on GCP instances. Make sure you actually need to reinstall before you attempt other solutions.
I am new to Docker and I am trying my hands at it.
I have an image that is on ubuntu 18.04 running an interactive container with it.
I want to install .net3.1 core on it and commit it to my image.
referring to this linklink to install the .net core.
I have installed .net core in one of my EC2, machines similar way.
Here is a commend that I am running
wget https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
I am getting an error below:
--2021-01-08 11:38:46-- https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb
Resolving packages.microsoft.com (packages.microsoft.com)... failed: Temporary failure in name resolution.
wget: unable to resolve host address 'packages.microsoft.com'
Could some one help me out with this and make me install .net core in this container?
Trying to get set up with Vagrant but getting the error:
The "VBoxManage" command or one of its dependencies could not be found.
Please verify VirtualBox is properly installed. You can verify everything
is okay by running "VBoxManage --version" and verifying that the VirtualBox
version is outputted.
Just confused because the Vagrant documentation states:
"The getting started guide will use Vagrant with VirtualBox, since it is free, available on every major platform, and built-in to Vagrant."
Don't want to install VirtualBox separately if its supposed to be included when I installed Vagrant. Running OSX 10.8 if it's relevant, guessing I just need to install VirtualBox? If that's the case, what do they mean in the documentation when they say it's "built-in"?
Installing VirtualBox is required if you plan on using VirtualBox with Vagrant. I'm guessing they meant that the VirtualBox integration is built-in?
Recently, they've abstracted out the VirtualBox specific code and are working on allowing for multiple providers. I believe VMWare is now supported in addition to VirtualBox.
I had this message but my problem was different. I use Vmware_fusion as the provider. Vagrant was not able to detect what provider I am using.It assumed that I am using VirtualBox. Had this issue fixed by calling vagrant up provider flag. Here is the full command
vagrant up --provider vmware_fusion