docker image xgboost w/ GPU support on AWS DL AMI - amazon-web-services

I'm trying to run XGBoost with GPU support in a docker image hosted on an AWS GPU machine (p3 family), but I keep getting:
xgboost.core.XGBoostError: [05:48:26] ../src/tree/updater_gpu_hist.cu:786: Exception in gpu_hist: [05:48:26] ../src/tree/updater_gpu_hist.cu:795: Check failed: device_ >= 0 (-1 vs. 0) : Must have at least one device
i.e. the GPU is not found by xgboost inside my docker image. The host machine runs this AWS AMI:
amzn2-ami-ecs-gpu-hvm-2.0.20210301-x86_64-ebs
which comes with NVIDIA CUDA, cuDNN, NCCL, Intel MKL-DNN, Docker, NVIDIA-Docker support.
I run the docker image with the following command:
docker run --rm --gpus all --runtime nvidia <imageid> someparameter
which runs some xgboost fit with tree_method='gpu_hist' but crashes with the error message mentioned above. The code itself runs fine on the host machine, so I must be doing something stupid in my Docker image.
My docker base image is FROM ubuntu:18.04
Also when I log in to my docker image I can run nvidia-smi and the GPU seems to be detected correctly, but XGBoost doesn't seem to find it.
I thought that with the particular AMI I use '--runtime nvidia' when running docker should take care of loading all required libs for GPU support, but seems not to be the case. I'm a docker beginner so I might be missing something obvious. Also I couldn't find a public docker image with xgboost w/ GPU support, but if someone can point me to one, that would be great too.

Related

How do you automate installation of NVIDIA drivers with a compute image VM from nvidia-ngc-public on GCP?

I am trying to use the images found here to deploy a VM to GCP's Compute Engine with a GPU enabled. I have successfully created a VM from a publicly available NVIDIA image (e.g. nvidia-gpu-cloud-image-2022061 from the nvidia-ngc-public project) to create a VM, but the VM forces a prompt to install drivers upon being started. So, I have to SSH into the VM to manually install the GPU drivers by answering 'y' to the install drivers prompt. It will then install the drivers.
My issue is that I need to automate this GPU driver installation process so that I can cleanly and deterministically (fixed driver version) create these images with drivers installed via CI/CD pipelines. What is the best way to achieve this automation? I would like to avoid creating my own base image and installing all the drivers/dependencies if possible.
I have created a VM with this image using the following command:
gcloud compute instances create $INSTANCE_NAME --project=$PROJECT --zone=$ZONE --machine-type=n1-standard-16 \--maintenance-policy=TERMINATE --network-interface=network-tier=PREMIUM, subnet=default --service-account=my-service-account#$PROJECT.iam.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=1,type=nvidia-tesla-t4 --image=nvidia-gpu-cloud-image-2022061 --image-project=nvidia-ngc-public --boot-disk-size=200 --boot-disk-type=pd-standard --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any --no-restart-on-failure
I have then SSH'd into the VM and answered yes to the prompt.
I have then saved the image using gcloud compute images create --source-disk $INSTANCE_NAME for future use.
How can I automate this cleanly?

Installing NVIDIA drivers for application on K8S

We have a flask app that's deployed on k8s. The base image of the app is this: https://hub.docker.com/r/tiangolo/uwsgi-nginx-flask/, and we build our app on top of this. We ship our docker image to ECR, and then deploy pods on k8s.
We want to start running ML models in our k8s nodes. The underlying nodes have GPUs (we're using g4dn instances), and they are using a GPU AMI.
When running our app, I'm seeing the following error:
/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
What's the right way to get CUDA installed on our nodes? I would have expected it to be built into the AMI shipped with the gpu instances but that doesn't seem to be the case.
There are a couple of options:
Use tensorflow:latest-gpu as base image and setup additional configuration for your system.
Setup Cuda drivers yourself in your Docker image.

AWS Batch Failing to launch Dockerfile - standard_init_linux.go:219: exec user process caused: exec format error

I am attempting to use AWS Batch to launch a linux server, which will in essence perform the fetch and go example included within AWS (to download a SH from S3 and run it).
Does AWS Batch work at all for anyone?
The aws fetch_and_go example always fails, even followed someone elses guide online which mimicked the aws example.
I have tried creating Dockerfile for amazonlinux:latest and ubuntu:20.04 with numerous RUN and CMD.
The scripts always seem to fail with the error:
standard_init_linux.go:219: exec user process caused: exec format error
I thought at first this was relevant to my deployment access rights maybe within the amazonlinux so have played with chmod 777, chmod -x etc on the she file.
The final nail in the coffin, my current script is litterely:
FROM ubuntu:20.04
Launch this using AWS Batch, no command or parameters passed through and it still fails with the same error code. This is almost hinting to me that there is either a setup issue with my AWS Batch (which im using default wizard settings, except changing to an a1.medium server) or that AWS Batch has some major issues.
Has anyone had any success with AWS Batch launching their own Dockerfiles ? Could they share their examples and/or setup parameters?
Thank you in advance.
A1 instances are ARM based first-generation Graviton CPU. It is highly likely the image you are trying to run something that is expecting x86 CPU (Intel or AMD). Any instance class with a "g" in it ("c6g" or "m5g") are Graviton2 which is also ARM based and will not work for the default examples.
You can test whether a specific container will run by launching an A1 instance yourself and running the container (after installing docker). My guess is that you will get the same error. Running on Intel or AMD instances should work.
To leverage Batch with ARM your containerized application will need to work on ARM. If you point me to the exact example, I can give more details on how to adjust to run on A1 or Graviton2 instances.
I had the same issue, and it was because I build the image locally on my M1 Mac.
Try adding --platform linux/amd64 to your docker build command before pushing if this is your case.
In addition to the other comment. You can create multi-arch images yourself which will provide the correct architecture.
https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/

GAN Training with Tensorflow 1.4 inside Docker Stops without Prompting and without Releasing Memory connected to VM with SSH Connection

Project Detail
I am running open source code of A GAN based Research Paper named "Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition"
source code: here
The dependencies include:
Python 2.7
TensorFlow 1.4.0
I pulled a Docker Image of TensorFlow 1.4.0 with python 2.7 on my GPU Virtual Machine Connected with ssh connection with this command :
docker pull tensorflow/tensorflow:1.4.0-gpu
I am running
bash rsrgan/run_gan_rnn_placeholder.sh
according to readme of source code
Issue's Detail
Everything is working, Model is Training and loss is decreasing, But there is only one issue that After some iterations terminal shows no output, GPU still shows PID but no Memory freed and sometime GPU-Utils becomes 0%. Training on VM's GPU and CPU are same case.
It is not a memory issue Because GPU Memory usage by model is 5400MB out of 11,000MB and RAM for CPU is also very Big
When I ran 21 Iteration on my local Computer each iteration with 0.09 hours with 1st Gen i5 and 4GB RAM all iterations executed. But whenever I run it with ssh inside docker issue happens again and again with both GPU and CPU.
Just keep in mind the issue is happening inside docker with computer connected with ssh and ssh is also not disconnect very often.
exact Numbers
If an iteration take 1.5 hour then issue happens after two to three iterations and if single iteration take 0.06 hours then issue happens exactly after 14 iteration out of 25
Perform operations inside Docker container
The first thing you can try out is to build the Docker image and then enter inside the Docker container by specifying the -ti flag or /bin/bash parameter in your docker run command.
Clone the repository inside the container and while building the image you should also copy your training data from local to inside the docker. Run the training there and commit the changes so that you need not repeat the steps in future runs as after you exit from the container, all the changes are lost if not committed.
You can find the reference for docker commit here.
$ docker commit <container-id> <image-name:tag>
While training is going on check for the GPU and CPU utilization of the VM, see if everything is working as expected.
Use Anaconda environment on you VM
Anaconda is a great package manager. You can install anaconda and create a virtual environment and run your code in the virtual environment.
$ wget <url_of_anaconda.sh>
$ bash <path_to_sh>
$ source anaconda3/bin/activate or source anaconda2/bin/activate
$ conda create -n <env_name> python==2.7.*
$ conda activate <env_name>
Install all the dependencies via conda (recommended) or pip.
Run your code.
Q1: GAN Training with Tensorflow 1.4 inside Docker Stops without Prompting
Although Docker gives OS-level virtualization inside Docker, we face issues in running some processes which run with ease on the system. So to debug the issue you should go inside the image and performs the steps above in order to debug the problem.
Q2: Training stops without Releasing Memory connected to VM with SSH Connection
Ya, this is an issue I had also faced earlier. The best way to release memory is to stop the Docker container. You can find more resource allocation options here.
Also, earlier versions of TensorFlow had issues with allocating and clearing memory properly. You can find some reference here and here. These issues have been fixed in recent versions of TensorFlow.
Additionally, check for Nvidia bug reports
Step 1: Install Nvidia-utils installed via the following command. You can find the driver version from nvidia-smi output (also mentioned in the question.)
$ sudo apt install nvidia-utils-<driver-version>
Step 2: Run the nvidia-bug-report.sh script
$ sudo /usr/bin/nvidia-bug-report.sh
Log file will be generated in your current working directory with name nvidia-bug-report.log.gz. Also, you can access the installer log at /var/log/nvidia-installer.log.
You can find additional information about Nvidia logs at these links:
Nvidia Bug Report Reference 1
Nvidia Bug Report Reference 2
Log GPU load
Hope this helps.

"Compile with USE_CUDA=1 to enable GPU usage" Message with MXNet on AWS Deep Learning AMI

I would like to train a neural network whilst utilising all 4 GPU's on my g2.8xarge EC2 instance using MXNet. I am using the following AWS Deep Learning Linux community AMI:
Deep Learning AMI Amazon Linux - 3.3_Oct2017 - ami-999844e0)
As per these instructions, when I connect to the instance I switch to keras v1 with the MXNet backend by issuing this command:
source ~/src/anaconda3/bin/activate keras1.2_p2
I have also added the context flag to my python model compile code to utilise the GPU's in MXNet:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'], context=gpu_list)
where gpu_list is meant to utilise all 4 GPU's.
However every time I run my code, I get this error message:
Epoch 1/300
[15:09:52] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [15:09:52] src/storage/storage.cc:113: Compile with USE_CUDA=1 to enable GPU usage
and
RuntimeError: simple_bind error. Arguments:
dense_input_1: (25, 34L)
[15:09:52] src/storage/storage.cc:113: Compile with USE_CUDA=1 to enable GPU usage
I have checked the config.mk file in /home/ec2-user/src/mxnet and it contains USE_CUDA=1. I have also issued the 'made' command to try and recompile MXNet with the USE_CUDA=1 flag - no change.
Am I having this issue as I'm using the virtual environment the AWS documentation says to use? Has anyone else had this issue with MXNet on the AWS Deep Learning Ubuntu AMI using this virtual env?
Any suggestions greatly appreciated -
This is because the Keras Conda environment has a dependency on mxnet cpu pip package. You can install the gpu version inside the Conda environment with:
pip install mxnet-cu80