Unable to get VM debian machine work with K80 - google-cloud-platform

I create a deep learning VM to run a project using some custom tensorflow models and google vision api, google nlu api.
I set up a machine with Debian10 and tensorflow 2.4(cuda11) and I choose 1 nvidia K80 GPU. I installed cuda11 using this link. when I run nvidia-smi, I get this famous ugly message:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I try to install cuda10 or any other but it does not exist for debian at all: see this cuda 10
How to resolve this problem, please!

I tried to reproduce this error in my own project.
I have installed a VM Instance with the following characteristics:
Machine type: n1-standard-1
GPUs: 1 x NVIDIA Tesla K80
Boot disk: debian-10-buster-v20201216
As you mentioned in your post there are no drivers for Linux: CUDA Toolkit 10, So I used the steps described in this link to install it, I had some complications to install the drivers and at the end I was able to reproduce your issue and I got the following message after the installation:
$ sudo nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I tried again, but now I changed my installation a little bit:
Machine type: n1-standard-1
GPUs: 1 x NVIDIA Tesla K80
Boot disk: c0-common-gce-gpu-image-20200128
The boot disk I used this time c0-common-gce-gpu-image-20200128 is a GPU Optimized Debian image , m32 (with CUDA 10.0), A Debian 9 based image with CUDA/CuDNN/NCCL pre-installed
When I accessed to this instance through ssh the first time I received the following question:
This VM requires Nvidia drivers to function correctly. Installation takes ~1 minute.
Would you like to install the Nvidia driver? [y/n] y
Installing Nvidia driver.
And it automatically installed the drivers.
$ sudo nvidia-smi
Thu Jan 7 19:08:06 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 75C P0 91W / 149W | 0MiB / 11441MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I also tried with a TensorFlow image as you mentioned that you are using TensorFlow: c0-deeplearning-tf-1-15-cu110-v20201229-debian-10
That according to the information of this image, it is a Deep Learning Image: TensorFlow 1.15, m61 CUDA 110, A debian-10 Linux based image with TensorFlow 1.15 (With CUDA 110 and Intel(TM) MKL-DNN, IntelĀ® MKL) plus Intel(TM)optimized NumPy, SciPy, and scikit-learn.
In this case I verified the TensorFlow installation too:
$ python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2021-01-07 20:29:02.854218: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
Tensor("Sum:0", shape=(), dtype=float32)
And it works well.
Hence, there seems to be a problem between the image installed (Devian 10) and the CUDA Toolkit needed for the GPU type (NVIDIA K80).
My suggestion here is to use a Deep Learning VM image, You could see the the full list at this link: Choosing an image

Related

How to install DPDK kernel module?

I have installed dpdk 18.08 on a Centos 7 machine, which has the kernel source installed.
I built dpdk using:
$ make -j T=x86_64-native-linuxapp-gcc install
<snip>
Build complete [x86_64-native-linuxapp-gcc]
Installation cannot run with T defined and DESTDIR undefined
I want to interface dpdk to a HP NIC that uses the Intel X722 chipset. So I then run:
$ /opt/dpdk/dpdk-18.08/usertools/dpdk-devbind.py -b igb_uio `lspci | grep X722 | awk '{print $1}'`
Error - no supported modules(DPDK driver) are loaded
I think that this error means that the DPDK kernel module is not installed.
How can I fix this?
Based on the comment interaction, the reason for missing Kernel Module. Installing igb_uio or vfio-pci has solved the problem.

Wrong gcc version linked with nvidia

I had gcc-5 and gcc-7 installed, and when I tried to compile a cuda sample with 'make' i got lot's of errors, after some research i saw that i needed to downgrade my gcc, so i thought the system was using gcc-7 instead of the other and so i uninstalled it using purge, but then gcc was not even recognized, gcc --version gave error. So i purge the other gcc too and installed again with 'sudo apt-get update' and 'suda apt-get install build essential'. 'gcc --version' now already works, but my cuda drivers aren't working anymore. nvidia-smi results in "command not found" and i can't run any cuda sample, although now i can compile it. For example, deviceQuery returns:
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
'nvcc --version' also works, here's the output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
Running 'lshw -numeric -C display' results in:
WARNING: you should run this program as super-user.
*-display
description: 3D controller
product: GM107M [GeForce GTX 950M] [10DE:139A]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci#0000:01:00.0
version: a2
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:38 memory:f6000000-f6ffffff memory:e0000000-efffffff memory:f0000000-f1ffffff ioport:e000(size=128) memory:f7000000-f707ffff
*-display
description: VGA compatible controller
product: 4th Gen Core Processor Integrated Graphics Controller [8086:416]
vendor: Intel Corporation [8086]
physical id: 2
bus info: pci#0000:00:02.0
version: 06
width: 64 bits
clock: 33MHz
capabilities: vga_controller bus_master cap_list rom
configuration: driver=i915 latency=0
resources: irq:34 memory:f7400000-f77fffff memory:d0000000-dfffffff ioport:f000(size=64) memory:c0000-dffff
WARNING: output may be incomplete or inaccurate, you should run this program as super-user.
I didn't change nothing on my drivers, but reinstalling gcc broke them. How can I solve this?
Thanks
-- EDIT --
When i do 'locate nvidia-smi' i get the following result:
/etc/alternatives/x86_64-linux-gnu_nvidia-smi.1.gz
/usr/bin/nvidia-smi
/usr/share/man/man1/nvidia-smi.1.gz
Although when i go into those directories, like /usr/bin there is no nvidia-smi executable, under /usr/share/man/man1/ there is no nvidia-smi.1.gz
Doing 'cat /proc/driver/nvidia/version' i get:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.111 Tue Dec 19 23:51:45 PST 2017
GCC version: gcc version 7.2.0 (Ubuntu 7.2.0-1ubuntu1~16.04)
It still shows the old gcc, i now have gcc-5, not 7
I managed to solve this, actually it was very simple, i just had to reinstall my nvidia drivers by doing:
sudo apt-get purge nvidia*
sudo apt-get update
sudo apt-get install nvidia-384

Building PCL library with GPU/CUDA support

I'm building PCL library master in Windows 8.1 using CMAKE:
All other modules are successfully built except the gpu/cuda modules!
Here is the error log
Observation:
-ccbin $(VCInstallDir)bin -> Environment Variable not set successfully. Due to that this error is Generated: '$' is not recognized as an internal or external command, operable program or batch file? I'm i right? What else could be the problem?
Note that only pcl_gpu_containers module was successfully built.
Can someone please help me fix this?
Version Details:
Microsoft Visual Studio Verison: 11 (VS Prof 2012)
cuda toolkit: 7.5
boost version: boost-1_57
eigen: 3.3
VTK Version: 6.2
PC Info:
OS Name Microsoft Windows 8.1 Pro N
Version 6.3.9600 Build 9600
System Type x64-based PC
Processor AMD FX(tm)-9590 Eight-Core Processor, 4700 Mhz, 4 Core(s), 8 Logical Processor(s)
Installed Physical Memory (RAM) 8.00 GB
Name NVIDIA GeForce GT 610
Adapter Type GeForce GT 610, NVIDIA compatible
Adapter RAM (2,147,483,648) bytes
Name NVIDIA GeForce GT 730
Adapter Type GeForce GT 730, NVIDIA compatible
Adapter RAM (2,147,483,648) bytes
Here is my CMakeCache.txt
IIRC, that issue was related to a missing environment variable setting.
On my system, this setting was missing after installing CUDA as admin, then working as non-admin user.
After fixing this, now the variables are set as follows:
CUDA_PATH =
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5
and
PATH =
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin;
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\libnvvp;
...

Linux VMware 11 guest mesa drivers install to update OpenGL

I currently am using VMWARE 11 guest Linux Mint, on a host win 8.1
So i am basically trying to update OpenGL so that i can program with higher versions of GLSL and OpenGL such as 3.3 , as it currently stands when i check with glxinfo | grep OpenGL , i get
$ glxinfo | grep OpenGL
OpenGL vendor string: VMware, Inc.
OpenGL renderer string: Gallium 0.4 on SVGA3D; build: RELEASE;
OpenGL version string: 2.1 Mesa 10.1.3
OpenGL shading language version string: 1.20
OpenGL extensions:
Now i went to Mesa3d website and downloaded the 10.5.6(as you can see above the one i have installed is 10.1.3)in which the OPenGL version has been updated to version 3.3 which is what i would prefer to have over the current version which is 1.2(shading language GLSL) i downloaded the tar , extracted it , ran ./configure as instructed, and it did its thing, ran make as ./configure finished by saying run make now , and finally ran sudo install make which was the last autoconf instruction , and all completed successfully, i even ran them a second time when i found out that glxinfo | grep OpenGL was still giving me the same return as before i "installed" the mesa 3d drivers
so i am trying to find out what is missing here and why this is not installing?
is it because of the fact that i am running a guest OS linux on a host vmware?or is it something less complicated and actually fixable?
Did you uninstall the system Mesa installation? Because if not you've got two Mesa installations side-by-side and that will not end well. Also when configuring you should select the parts of Mesa you want. Which in your case would be none of the GPU drivers and just the softpipe implementation.

is there a vm that i can do opengl 3+ with? virtualbox and vmware don't

I am trying to write some openFrameworks (C++) code in a VM. My host is Windows 8 and I've tried both Arch Linux and Ubuntu guests. My host computer runs the graphics code just fine with an NVidia Optimus setup and 8GB of RAM.
I do my main development in Visual Studio, however I do prefer to create Android and test packages from Linux. For this reason I just want to fire up a VM and take care of business. The problem is that some of my graphics apps need OpenGL 3+
Has anybody else had the same problem and solved it?
Give up on VirtualBox. VB's OpenGL guest support craps out at 2.1, even then only after you install VB Guest Additions from the command line with switches and then add some Registry keys to actually enable the OpenGL guest drivers.
If you're willing to shell out money, VMware Fusion for Mac and VMware Workstation for Windows both support DirectX 10 and OpenGL 3.3.
A bit late to the party here, but hopefully helpful for someone encountering similar issues these days:
The mesa software renderer now supports OpenGL 4.5, so for me, the solution is to disable 3D acceleration in the settings of the VirtualBox machine!
The mesa software OpenGL support then takes over and provides its capabilities. It's for sure not that fast, but for my purpose (testing whether an OpenGL application starts and displays something under linux) it's sufficient!
Tested both on Fedora 34 and Ubuntu 20.04.
Try VirtualBox and prepend MESA_GL_VERSION_OVERRIDE=3.0 MESA_GLSL_VERSION_OVERRIDE=130 to your linux command line. Some of the opengl3 functions may work. Though not all of them will. I used that to bring up Civ5, the animation did not show up, nor did the on-screen fonts.
If you want to see the source code:
VirtualBox uses chromium 1.9 that is opengl 2.1. The info can be verified by the glxinfo command. Use the following commands to track the VirtualBox opengl lib file:
$ ldd /usr/bin/glxinfo
$ apt-file search /usr/lib/x86_64-linux-gnu/libGL.so.1.2
$ LIBGL_DEBUG=verbose glxinfo
Then follow links:
$ ls -l x86_64-linux-gnu/dri/
lrwxrwxrwx Apr 14 2014 vboxvideo_dri.so -> ../../VBoxOGL.so
$ apt-file search /usr/lib/VBoxOGL.so
virtualbox-dbg: /usr/lib/debug/usr/lib/VBoxOGL.so
virtualbox-guest-x11: /usr/lib/VBoxOGL.so
$ dpkg -l virtualbox*
ii virtualbox-guest-x11 4.1.18-dfsg-2+deb7 amd64
$ apt-file list virtualbox-guest-x11
...
The source code tarball was virtualbox-4.3.10-dfsg.orig.tar.gz from trusty repo. The version string can be grep'ed by $ grep -r CR_OPENGL_VERSION_STRING * and $ grep -r CR_VERSION_STRING * in the source code directory.
Update 6/1/2017: Someone told me the kvm works for civ5. A quick search turned up this thread titled "GPU Passthrough with KVM: Have Your Cake and Eat it Too". The thread is too long to read, though hope it could be useful to somebody.