Installing NVIDIA drivers for application on K8S

Installing NVIDIA drivers for application on K8S - amazon-web-services

We have a flask app that's deployed on k8s. The base image of the app is this: https://hub.docker.com/r/tiangolo/uwsgi-nginx-flask/, and we build our app on top of this. We ship our docker image to ECR, and then deploy pods on k8s.
We want to start running ML models in our k8s nodes. The underlying nodes have GPUs (we're using g4dn instances), and they are using a GPU AMI.
When running our app, I'm seeing the following error:
/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
What's the right way to get CUDA installed on our nodes? I would have expected it to be built into the AMI shipped with the gpu instances but that doesn't seem to be the case.

There are a couple of options:
Use tensorflow:latest-gpu as base image and setup additional configuration for your system.
Setup Cuda drivers yourself in your Docker image.

Related

How do you automate installation of NVIDIA drivers with a compute image VM from nvidia-ngc-public on GCP?

I am trying to use the images found here to deploy a VM to GCP's Compute Engine with a GPU enabled. I have successfully created a VM from a publicly available NVIDIA image (e.g. nvidia-gpu-cloud-image-2022061 from the nvidia-ngc-public project) to create a VM, but the VM forces a prompt to install drivers upon being started. So, I have to SSH into the VM to manually install the GPU drivers by answering 'y' to the install drivers prompt. It will then install the drivers.
My issue is that I need to automate this GPU driver installation process so that I can cleanly and deterministically (fixed driver version) create these images with drivers installed via CI/CD pipelines. What is the best way to achieve this automation? I would like to avoid creating my own base image and installing all the drivers/dependencies if possible.
I have created a VM with this image using the following command:
gcloud compute instances create $INSTANCE_NAME --project=$PROJECT --zone=$ZONE --machine-type=n1-standard-16 \--maintenance-policy=TERMINATE --network-interface=network-tier=PREMIUM, subnet=default --service-account=my-service-account#$PROJECT.iam.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=1,type=nvidia-tesla-t4 --image=nvidia-gpu-cloud-image-2022061 --image-project=nvidia-ngc-public --boot-disk-size=200 --boot-disk-type=pd-standard --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any --no-restart-on-failure
I have then SSH'd into the VM and answered yes to the prompt.
I have then saved the image using gcloud compute images create --source-disk $INSTANCE_NAME for future use.
How can I automate this cleanly?

docker image xgboost w/ GPU support on AWS DL AMI

I'm trying to run XGBoost with GPU support in a docker image hosted on an AWS GPU machine (p3 family), but I keep getting:
xgboost.core.XGBoostError: [05:48:26] ../src/tree/updater_gpu_hist.cu:786: Exception in gpu_hist: [05:48:26] ../src/tree/updater_gpu_hist.cu:795: Check failed: device_ >= 0 (-1 vs. 0) : Must have at least one device
i.e. the GPU is not found by xgboost inside my docker image. The host machine runs this AWS AMI:
amzn2-ami-ecs-gpu-hvm-2.0.20210301-x86_64-ebs
which comes with NVIDIA CUDA, cuDNN, NCCL, Intel MKL-DNN, Docker, NVIDIA-Docker support.
I run the docker image with the following command:
docker run --rm --gpus all --runtime nvidia <imageid> someparameter
which runs some xgboost fit with tree_method='gpu_hist' but crashes with the error message mentioned above. The code itself runs fine on the host machine, so I must be doing something stupid in my Docker image.
My docker base image is FROM ubuntu:18.04
Also when I log in to my docker image I can run nvidia-smi and the GPU seems to be detected correctly, but XGBoost doesn't seem to find it.
I thought that with the particular AMI I use '--runtime nvidia' when running docker should take care of loading all required libs for GPU support, but seems not to be the case. I'm a docker beginner so I might be missing something obvious. Also I couldn't find a public docker image with xgboost w/ GPU support, but if someone can point me to one, that would be great too.

CloudFoundry - How to understand the operating system(OS) environment of an app?

We push a java app on cloud foundry using cf push with below manifest file
applications:
- name: xyz-api
instances: 1
memory: 1G
buildpack: java_buildpack_offline
path: target/xyz-api-0.1-SNAPSHOT.jar
I understand that, PAAS (ex: cloud foundry) is a layer on top of IAAS(ex:vcenter hosting linux and windows VM's).
In manifest file, buildpack just talks about userspace runtime libraries required to run an app.
Coming from non-cloud background, and reading this manifest file, I would like to understand...
1) How to understand the operating system(OS) environment, that an app is running? On which operating system...
2) How app running on bosh instance different from docker container?

1) How to understand the operating system(OS) environment, that an app is running? On which operating system...
The stack determines the operating system on which your app will run. There is a stack attribute in the manifest or you can use cf push -s to indicate the stack.
You can run cf stacks to see all available stacks.
In most environments at the time of writing, you will have cflinuxfs2. This is Ubuntu Trusty 14.04. It will be replaced by cflinuxfs3 which is Ubuntu Bionic 18.04, because Trusty is only supported through April of 2019. You will always have some cflinuxfs* stack though, the number will just vary depending on when you read this.
In some environments you might also have a Windows based stack. The original Windows based stack is windows2012r2. This is quite old as I write this so you probably won't see it any more. What you're likely to see is windows2016 or possibly something even newer depending on when you read this.
If you need more control than that, you can always push a docker container. That would let you pick the full OS image for your app.
2) How app running on bosh instance different from docker container?
Apps running on Cloud Foundry aren't deployed by BOSH directly. The app runs in a container. The container is scheduled and run by Diego. Diego is a BOSH deployed VM. So there's an extra layer in there.
At the core, the difference between running your app on Cloud Foundry and running an app in a docker container is minimal. They both run in a Linux "container" which has limitations put on it by kernel namespaces & cgroups.
The difference comes in a.) how you build the container and b.) how the container is deployed.
With Cloud Foundry, you don't build the container. You provide your app to CF & CF builds the container image based on the selected stack and the additional software added by buildpacks. The output in CF terminology is called a "droplet", but it basically an OCI image (this will be even more so with buildpacks v3). When you need to upgrade or add new code, you just repeat the process and push again. The stack and buildpacks, which are automatically updated by the platform, will in turn provide you with a patched & up-to-date app image.
With Docker, you manually create your image building it up from scratch or from some trusted base image. You add you own runtimes & application code. When you need to upgrade, that's on you to pull in updates from the base image & runtimes or worse to update your from-scratch image.
When it comes to deployment, CF handles this all for you automatically. It can run any number of instance of your app that you'd like & it will automatically place those so that your app is resilient to failures in the infrastructure & in CF.
With Docker, that's on you or increasingly often on some other tool like Kubernetes.
Hope that helps!

"Compile with USE_CUDA=1 to enable GPU usage" Message with MXNet on AWS Deep Learning AMI

I would like to train a neural network whilst utilising all 4 GPU's on my g2.8xarge EC2 instance using MXNet. I am using the following AWS Deep Learning Linux community AMI:
Deep Learning AMI Amazon Linux - 3.3_Oct2017 - ami-999844e0)
As per these instructions, when I connect to the instance I switch to keras v1 with the MXNet backend by issuing this command:
source ~/src/anaconda3/bin/activate keras1.2_p2
I have also added the context flag to my python model compile code to utilise the GPU's in MXNet:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'], context=gpu_list)
where gpu_list is meant to utilise all 4 GPU's.
However every time I run my code, I get this error message:
Epoch 1/300
[15:09:52] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [15:09:52] src/storage/storage.cc:113: Compile with USE_CUDA=1 to enable GPU usage
and
RuntimeError: simple_bind error. Arguments:
dense_input_1: (25, 34L)
[15:09:52] src/storage/storage.cc:113: Compile with USE_CUDA=1 to enable GPU usage
I have checked the config.mk file in /home/ec2-user/src/mxnet and it contains USE_CUDA=1. I have also issued the 'made' command to try and recompile MXNet with the USE_CUDA=1 flag - no change.
Am I having this issue as I'm using the virtual environment the AWS documentation says to use? Has anyone else had this issue with MXNet on the AWS Deep Learning Ubuntu AMI using this virtual env?
Any suggestions greatly appreciated -

This is because the Keras Conda environment has a dependency on mxnet cpu pip package. You can install the gpu version inside the Conda environment with:
pip install mxnet-cu80

Can I install Bosh-Lite on an Openstack VM without using Virtualbox?

Doubt: Can Bosh-Lite be deployed on an OpenStack VM without using Virtualbox ?
Use Case: Want to have a Bosh-Lite setup that can be used by CI systems
I am not sure if Bosh-Lite can be directly installed on an OpenStack VM or I need to setup VirtualBox first (Does this another layer of virtualization work ?)
I followed the docs # https://bosh.io/docs/bosh-lite and currently stuck at defining the parameters for three variables for Openstack CPI and getting the following error:
Parsing release set manifest '/root/workspace/bosh-deployment/bosh.yml':
Evaluating manifest:
- Expected to find variables:
- default_key_name
- net_id
- private_key

No, bosh-lite is explicitly a virtual box solution where you don't have a full IaaS/CPI like OpenStack
BOSH Lite v2 is a Director VM running in VirtualBox (typically locally)
Is the very first line on the docs you linked to.
You can run full blown bosh on openstack with very little effort though.
https://bosh.io/docs/init-openstack.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js