Train on AWS using GPU not CPU - amazon-web-services

I just launched an AWS P2 instance trying to train a model. However it seems to be using the CPU to train not the GPU. How can I force it to train using the GPU not the CPU?
$ nano ~/.keras/keras.json says this
{
 "image_dim_ordering": "th",
"epsilon": 1e-07,

 "floatx": "float32",
"backend": "tensorflow"
 }
I am getting a message "Failed to load the native TensorFlow runtime."
I then changed
$ nano ~/.keras/keras.json says this
{
 "image_dim_ordering": "th",
"epsilon": 1e-07,

 "floatx": "float32",
"backend": "theano"
 }
It's training, however very slowly and seems to be using the cpu.

It looks like the answer is add a gpus flag!
python cnn_homework_solution.py --gpus 0,1

Related

CUDA GPU __global__ function does not complete

__global__ void functionA()
{
printf("functionA");
}
int main()
{
printf("main1");
functionA<<<1,1>>>();
printf("main2");
}
I'm trying to run a simple test with the above. But the program only outputs "main1". The program should output "functionA" and "main2" too.
This seems to have two reasons:
First of all you need to add
cudaDeviceSynchronize();
after the CUDA routine in order to block the main until the device has completed all tasks.
Furthermore this might happen if you set the wrong GPU architecture/compute capability XX when compiling the code
$ nvcc -gencode=arch=compute_XX,code=sm_XX -o my_app my_app.cu
In this case only the host code is run while the parts on the accelerator will be omitted it seems. You can find an overview of the corresponding number XX for the different hardware generations over here. The K20m you are running is 35. So it should be
$ nvcc -gencode=arch=compute_35,code=sm_35 -o my_app my_app.cu
in your case.
This might also occur if you have multiple graphic accelerators in your system and the code is executed on the wrong one. Each graphics card/accelerator is assigned a particular device id. The device with number 0 should be assigned automatically to the most powerful device and will be used by default. Therefore the first time I compiled the code on my system containing a powerful Tesla K80 (architecture 37) and a low power Quadro P620 (architecture 60) I selected 37 and had the same error as you have while when selecting 60 the code would run. I then used then the Querying Device Properties example to give me a list of the CUDA-capable devices and their corresponding device id, just to find out that on my system the Tesla K80 is set as 1 and 2 while the simple Quadro P620 graphics card is set as 0. I assume this is the case as the K80 is deprecated in CUDA 11!
You can select the device inside your code with cudaSetDevice or change it when launching the program with
$ CUDA_VISIBLE_DEVICES="1" ./my_app
where 1 has to be replaced by the device id you wish to use. Doing so should make your code run without any problems.
You can also test if this really is the issue this by cloning the Github repository of "Learn CUDA Programming", then browsing Chapter01/01_cuda_introduction/01_hello_world/, compile the make file with $ make and finally run it with $ ./hello_world. It automatically compiles for multiple architectures/compute capabilities and should therefore run without any issue!

AWS EC2 Deep Learning instance cuda 3.0

I just launched (and paid for) the Deep Learning AMI (Ubuntu 18.04) Version 27.0 (ami-0dbb717f493016a1a) instance type g2.2xlarge. I activated
for PyTorch with Python3 (CUDA 10.1 and Intel MKL) ____________source activate pytorch_p36
When I run my pytorch network I see a warning
/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/cuda/__init__.py:134: UserWarning:
Found GPU0 GRID K520 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability that we support is 3.5.
Is this real?
This is my code to put my neural net on the gpu
if torch.cuda.is_available():
device = torch.device("cuda:0") # you can continue going on here, like cuda:1 cuda:2....etc.
print("Running on the GPU")
else:
device = torch.device("cpu")
print("Running on the CPU")
net = Net(image_height, image_width)
net.to(device)
I had to use a g3s.xlarge instance. I guess the g2 instances use older GPUs.
Also I had to make num_workers=0 on my dataloaders following this https://discuss.pytorch.org/t/oserror-errno-12-cannot-allocate-memory-but-memory-usage-is-actually-normal/56027.
And this is another pytorch gotcha https://stackoverflow.com/a/51606286/3614578 when adding tensors to a device.

tensorflow places softmax op on cpu instead of gpu

I have a tensorflow model with multiple inputs and several layers, and a final softmax layer. The model is trained in Python (using the Keras framework), then saved and inference is done using a C++ program that facilitates a CMake build of TensorFlow (following basically those instructions: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake).
In python (tensorflow-gpu) all ops use the GPU (using log_device_placement):
out/MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.005837: I C:\tf_jenkins\home\workspace\rel-in\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872] out/MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006201: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872]
out/BiasAdd: (BiasAdd)/job:localhost/replica:0/task:0/gpu:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006535: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872] out/Softmax: (Softmax)/job:localhost/replica:0/task:0/gpu:0
To save the graph, the freeze_graph script is used (the script producing the log above loads again the freezed graph in .pb format).
When I use the C++ program and load the freezed graph (following closely the LoadGraph() function in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/label_image/main.cc - ReadBinaryProto() and session->Create()), and log again the device placements, I find that the Softmax is placed on CPU (all others ops are on GPU):
dense_6/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/Relu: (Relu): /job:localhost/replica:0/task:0/device:GPU:0
out/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/device:CPU:0
This placement is also confirmed by high CPU/low GPU utilization, and also apparent from profiling the application. The data type of the out layer is float32 (out/Softmax -> (<tf.Tensor 'out/Softmax:0' shape=(?, 1418) dtype=float32>,)).
Further investigation revealed:
Creating the softmax-op in C++ and placing it on GPU explicitly throws this error message:
Cannot assign a device for operation 'tsoftmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
A call to tensorflow::LogAllRegisteredKernels() showed also that Softmax is only available for CPU!
The build directory contains many files related to "softmax" (e.g. `tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.obj.Release.cmake). Don't know how to check every compilation step, though.
when I look into the "tf_core_gpu_kernels.lib" (one can open a .lib with 7Z ;)), there are files like "tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.lib" - so I believe there is nothing wrong with compiling the kernels itself
But: inspecting the "tensorflow.dll" (Dependency Walker) shows that only CPU kernels for Softmax are included (there are functions like const tensorflow::SoftmaxOp<struct Eigen::ThreadPoolDevice,double>, but no functions with GPU such as const tensorflow::SoftplusGradOp<struct Eigen::GpuDevice,float>).
Setup: Tensorflow 1.3.0, Windows 10, GPU: NVidia GTX 1070 (8GB RAM, memory utilization also very low).
I found a workaround - the workaround is to include the tf_core_gpu_kernels.lib in some of the steps (create_def_file.py). More details here: GitHub Issue 15254

How to get TensorFlow to detect all GPUs on AWS?

I am running an lstm net on ec2 p2.8xlarge. Of course I'd like to take advantage of all the gpus available(8). I an run it on one gpu easily, but not more. I get the following error when calling "multi_gpu_model":
"To call multi_gpu_model with gpus=8, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3', '/gpu:4', '/gpu:5', '/gpu:6', '/gpu:7']. However this machine only has: ['/cpu:0']. Try reducing gpus."
When I type nvidia-smi, all 8 gpus show up in terminal. How can I add these to my tf (keras) environment?
when I run device_lib.list_local_devices() in jupyter notebook it returns only CPU0 when it should return 8 GPUs too. Here is the relevant bit of code:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model=multi_gpu_model(model, gpus=8)
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Tensorflow does not recognize GPU on AWS

So here it goes: I wanted to use TensorFlow with GPU on AWS - p2.xlarge plan. Unfortunately, something must have gone wrong and I continue to get:
InvalidArgumentError (see above for traceback): Cannot assign a device to node 'Variable_1': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
I checked both CUDA and cuDNN:
nvcc -V
cat /usr/local/cuda/include/cudnn.h
and got 8.0 and 5.1, respectively.
I call gpu like this:
with tf.device('/gpu:0'):
a = tf.Variable(tf.truncated_normal([100, 100]))
b = tf.Variable(tf.truncated_normal([100, 1000]))
with tf.Session() as sess:
sess.run(tf.matmul(a,b))
happy to post more details if necessary - don't know what will be useful yet.
I suppose you're trying to set up an EC2 instance from scratch? That can be difficult.
Instead, I'd strongly recommend using the Deep Learning AMI (https://aws.amazon.com/machine-learning/amis/). It comes preinstalled with everything you need (drivers, popular DL libraries, etc.). It's also free to use, you just pay for the instance itself.