How to test pytorch GPU code on a CPU machine

How to test pytorch GPU code on a CPU machine - unit-testing

I want to verify on CPU machine that I've moved all tensors to CUDA device correctly. Is there a way to create a mock device that is still computed on CPU but marked to be incompatible with tensors I forget to move to this device?
Motivation: this would speed up my development cycle as I don't need to deploy code and run it on the GPU server only to find out I forgot to specify device on some tensor / model.

Maybe the meta device is what you need.
>>> a = torch.ones(2) # defaults to cpu
>>> b = torch.ones(2, device='meta')
>>> c = a + b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, meta and cpu!
Note that original intention for meta device is to compute shape and dtype of output without actually doing full computation. Tensors on meta device have no data in them, so I am not sure if there are some aspects of the test where meta device behaves different from a "real" device.
Interestingly I couldn't find much documentation about this meta device at all. I found it mentioned here https://github.com/pytorch/pytorch/issues/61654#issuecomment-879989145

Related

Training model on AWS Deep Learning AMI instance - gets 'killed' with warnings

I am trying to train inception ResNetV2 model on my own dataset on Amazon's Deep Learning AMI
When I try to train on local machine the training starts as usual but when I try to train on aws instance it gets killed.
First I tried to train with MXNET backend . It gave the following error :
Notice that it gets killed.
So in
nano ~/.keras/keras.json
I tried to set image data format to channels_first :
{
"image_data_format": "channels_first",
"backend": "mxnet"
}
Then I got the error:
Traceback (most recent call last):
File "train.py", line 17, in <module>
model = applications.inception_resnet_v2.InceptionResNetV2(include_top=False, weights='imagenet', input_shape = (img_width, img_height, 3))
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/keras_applications/inception_resnet_v2.py", line 243, in InceptionResNetV2
weights=weights)
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/keras_applications/imagenet_utils.py", line 296, in _obtain_input_shape
'`input_shape=' + str(input_shape) + '`')
ValueError: The input must have 3 channels; got `input_shape=(182, 182, 3)`
Then I tried to switch to tensorflow backend to see how it plays out because there might be some misunderstanding on my part on how this process works. But when I switched to tensorflow backend and started training I got the following error :
As you can see it gets killed again.
I am not sure what to do next. Some help would be great.
P.S I am sorry for the screenshots. You're going to have to zoom in a little to get a better view.

Deep Learning AMI was mostly not supported on t2 instance type. It should work on most of the good cpu instance type (like C4, C5) or GPU instance type (G3, P2 and P3) and many other instance type.

multiprocessing for keras model predict with single GPU

Background
I want to predict pathology images using keras with Inception-Resnet_v2. I have trained the model already and got a .hdf5 file. Because the pathology image is very large (for example: 20,000 x 20,000 pixels), so I have to scan the image to get small patches for prediction.
I want to speed up the prediction procedure using multiprocessing lib with python2.7. The main idea is using different subprocesses to scan different lines and then sending patches to model.
I saw somebody suggests importing keras and loading model in subprocesses. But I don't think it is suitable for my task. Loading model usingkeras.models.load_model() one time will take about 47s, which is very time-consuming. So I can't reload the model every time when I start a new subprocess.
Question
My question is can I load the model in my main process and pass it as a parameter to subprocesses?
I have tried two methods but both of them didn't work.
Method 1. Using multiprocessing.Pool
The code is :
import keras
from keras.models import load_model
import multiprocessing
def predict(num,model):
print dir(model)
print num
model.predict("image data, type:list")
if __name__ == '__main__':
model = load_model("path of hdf5 file")
list = [(1,model),(2,model),(3,model),(4,model),(5,model),(6,model)]
pool = multiprocessing.Pool(4)
pool.map(predict,list)
pool.close()
pool.join()
The output is
cPickle.PicklingError: Can't pickle <type 'module'>: attribute lookup __builtin__.module failed
I searched the error and found Pool can't map unpickelable parameters, so I try method 2.
Method 2. Using multiprocessing.Process
The code is
import keras
from keras.models import load_model
import multiprocessing
def predict(num,model):
print num
print dir(model)
model.predict("image data, type:list")
if __name__ == '__main__':
model = load_model("path of hdf5 file")
list = [(1,model),(2,model),(3,model),(4,model),(5,model),(6,model)]
proc = []
for i in range(4):
proc.append(multiprocessing.Process(predict, list[i]))
proc[i].start()
for i in range(4):
proc[i].join()
In Method 2, I can print dir(model). I think it means the model is passed to subprocesses successfully. But I got this error
E tensorflow/stream_executor/cuda/cuda_driver.cc:1296] failed to enqueue async memcpy from host to device: CUDA_ERROR_NOT_INITIALIZED; GPU dst: 0x13350b2200; host src: 0x2049e2400; size: 4=0x4
The environment which I use:
Ubuntu 16.04, python 2.7
keras 2.0.8 (tensorflow backend)
one Titan X, Driver version 384.98, CUDA 8.0
Looking forward to reply! Thanks!

Maybe you can use apply_async() instead of Pool()
and you can find more details here:
Python multiprocessing pickling error

Multi-processing works on CPU, while model prediction happened in GPU, which there is only one. I cannot see how multi-processing can help you on prediction.
Instead, I think you can use multi-processing to scan different patches, which you seems to have already managed to achieve. Then stack these patches into a batch or batches to predict in parallel in GPU.

As noted by Statham multiprocess requires all args to be compatible with pickle. This blog post describes how to save a keras model as a pickle: [http://zachmoshe.com/2017/04/03/pickling-keras-models.html][1]
It may be a sufficient workaround to get your keras model passed as an arg to multiprocess, but I have not tested the idea myself.
I will also add that I had better luck running two keras processes on a single gpu using windows rather than linux. On linux I was getting out of memory errors on the 2nd process, but the same memory allocation (45% of total GPU ram for each) worked on windows. In my case they were fits - for running predictions only, maybe the memory requirements are less.

tensorflow places softmax op on cpu instead of gpu

I have a tensorflow model with multiple inputs and several layers, and a final softmax layer. The model is trained in Python (using the Keras framework), then saved and inference is done using a C++ program that facilitates a CMake build of TensorFlow (following basically those instructions: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake).
In python (tensorflow-gpu) all ops use the GPU (using log_device_placement):
out/MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.005837: I C:\tf_jenkins\home\workspace\rel-in\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872] out/MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006201: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872]
out/BiasAdd: (BiasAdd)/job:localhost/replica:0/task:0/gpu:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006535: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872] out/Softmax: (Softmax)/job:localhost/replica:0/task:0/gpu:0
To save the graph, the freeze_graph script is used (the script producing the log above loads again the freezed graph in .pb format).
When I use the C++ program and load the freezed graph (following closely the LoadGraph() function in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/label_image/main.cc - ReadBinaryProto() and session->Create()), and log again the device placements, I find that the Softmax is placed on CPU (all others ops are on GPU):
dense_6/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/Relu: (Relu): /job:localhost/replica:0/task:0/device:GPU:0
out/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/device:CPU:0
This placement is also confirmed by high CPU/low GPU utilization, and also apparent from profiling the application. The data type of the out layer is float32 (out/Softmax -> (<tf.Tensor 'out/Softmax:0' shape=(?, 1418) dtype=float32>,)).
Further investigation revealed:
Creating the softmax-op in C++ and placing it on GPU explicitly throws this error message:
Cannot assign a device for operation 'tsoftmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
A call to tensorflow::LogAllRegisteredKernels() showed also that Softmax is only available for CPU!
The build directory contains many files related to "softmax" (e.g. `tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.obj.Release.cmake). Don't know how to check every compilation step, though.
when I look into the "tf_core_gpu_kernels.lib" (one can open a .lib with 7Z ;)), there are files like "tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.lib" - so I believe there is nothing wrong with compiling the kernels itself
But: inspecting the "tensorflow.dll" (Dependency Walker) shows that only CPU kernels for Softmax are included (there are functions like const tensorflow::SoftmaxOp<struct Eigen::ThreadPoolDevice,double>, but no functions with GPU such as const tensorflow::SoftplusGradOp<struct Eigen::GpuDevice,float>).
Setup: Tensorflow 1.3.0, Windows 10, GPU: NVidia GTX 1070 (8GB RAM, memory utilization also very low).

I found a workaround - the workaround is to include the tf_core_gpu_kernels.lib in some of the steps (create_def_file.py). More details here: GitHub Issue 15254

RAM consumption regarding cores/

I am working on a 31 ,available, Go of RAM, 12 cores Linux KUbuntu computer.
I produce simulations which calculate functions over 4 dimensions (x,y,z,t).
I define my dimensions as arrays that I numpy.meshgrid for use. So, for each point of time, I calculate for each point x,y,z the result. It comes as heavy calculations with heavy data.
First, I learned how to use it with only one core. It works well and whatever are the size of my "boxs" ( x,y,z). Because of the fact I work a lot with Fourier transform, I define x,y,z,t as powers of 2 : 64,128,256,...
I can,without dificulties, go to x = y = z = t = 512, even if it takes a lot of time to run it (which makes sense). When I do that, I use around 20-30% of the available RAM of the computer. Great.
Then I wanted to use more cores. So I implemented this code :
import multiprocessing as mp
pool = mp.Pool(processes=8)
results = [pool.apply_async(conv_green, args=(tstep, S_, )) for tstep in t]
So here I ask my script to use 8 cores, and define my results as the use of the function "conv_green" with the args "tstep,S_" all along t.
It works pretty well, use 8 cores as expected BUT I can not run any more simulations who use figures equal or above to 512 for x,y,z,t.
This is where my problem is. Technically, switching from the mono core system to multi chanegd nothing to the routine of my calculations. I do not understand why I have enough RAM for 512... in mono core and why,sudenly, when I switch to multi cores, computer does not even want to launch it ( and the error occurs at the" results = pool.apply ..." line)
So if you guys know how this works and why I get this "treshold", thanks for helping me solving out !
Best regards.
PS : this is the error which pops out when it crashes with 512 in multi cores :
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist
packages/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "/home/alexis/Heat/Simu⁄Lecture Propre/Test Tkinter/Simulation N spots SCAN Tkinter.py", line 280, in
XYslice = array([p.get()[0] for p in results])
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
SystemError: NULL result without error in PyObject_Call

For multiprocessing in any language each thread will need private storage which it can write to without interference from the other threads. As soon as interference is possible the data structure has to be locked, which (in the worst case) takes us back to single threading.
It would appear that your large data structure is being copied for each of the threads, effectively multiplying your memory usage by eight when you have eight processors ... or up to 200% of your available RAM.
The best solution would be to prevent the unnecessary copying.
If that's not feasible then all you can do is limit the number of processors it can run on, four should be ok in your instance but make sure your machine has lots of swap space. The swap space also gives you some play to allow the virtual memory to exceed the physical RAM, if the "working set" is small enough you may be able to significantly exceed your physical RAM given enough swap.

Analog output from USB6009 using python and NIDAQmx base on Mac OSX

All,
I'm attempting to use Python and DAQmx Base to record analog input and generate analog output from my USB 6009 device. I've been using a wrapper I found and have been able to get AI but am struggling with AO.
There is a base class NITask which handles task generation etc. The class i'm calling is below. The function throws an error when I try to configure the clock. When I do not there is no error but nor is there voltage generated on the output. Any help would be appreciated.
Thanks!
class AOTask(NITask):
def __init__(self, min=0.0, max=5.0,
channels=["Dev1/ao0"],
timeout=10.0):
NITask.__init__(self)
self.min = min
self.max = max
self.channels = channels
self.timeout = timeout
self.clockSource ="OnboardClock"
sampleRate=100
self.sampleRate = 100
self.timeout = timeout
self.samplesPerChan = 1000
self.numChan = chanNumber(channels)
if self.numChan is None:
raise ValueError("Channel specification is invalid")
chan = ", ".join(self.channels)
self.CHK(self.nidaq.DAQmxBaseCreateTask("",ctypes.byref(self.taskHandle)))
self.CHK(self.nidaq.DAQmxBaseCreateAOVoltageChan(self.taskHandle, "Dev1/ao0", "", float64(self.min), float64(self.max), DAQmx_Val_Volts, None))
self.CHK(self.nidaq.DAQmxBaseCfgSampClkTiming(self.taskHandle, "", float64(self.sampleRate), DAQmx_Val_Rising, DAQmx_Val_FiniteSamps, uInt64(self.samplesPerChan)))
"""Data needs to be of type ndarray"""
def write(self, data):
nWritten = int32()
# data = numpy.float64(3.25)
data = data.astype(numpy.float64)
self.CHK(self.nidaq.DAQmxBaseWriteAnalogF64(self.taskHandle,
int32(1000), 0,float64(-1),DAQmx_Val_GroupByChannel,
data.ctypes.data,None,None))
# if nWritten.value != self.numChan:
# print "Expected to write %d samples!" % self.numChan

Your question covers two problems:
Why does DAQmxBaseCfgSampClkTiming return an error?
Without using that function, why isn't any output generated?
1. Hardware vs Software Timing
rjb3 wrote:
The function throws an error when I try to configure the clock. When I do not there is no error but nor is there voltage generated on the output.
Your program receives the error because the USB 600x devices do not support hardware-timed analog output [1]:
The NI USB-6008/6009 has two independent analog output channels that can generate outputs from 0 to 5 V. All updates of analog output channels are software-timed. GND is the ground-reference signal for the analog output channels.
"Software-timed" means a sample is written on demand by the program whenever DAQmxBaseWriteAnalogF64 is called. If an array of samples is written, then that array is written one at a time. You can learn more about how NI defines timing from the DAQmx help [2]. While that document is for DAQmx, the same concepts apply to DAQmx Base since the behavior is defined by the devices and not their drivers. The differences are in how much of the hardware's capabilities are implemented by the driver -- DAQmx implements everything, while DAQmx Base is a small select subset.
2. No Output When Software Timed
rjb3 wrote:
When I do not there is no error but nor is there voltage generated on the output.
I am not familiar with the Python bindings for the DAQmx Base API, but I can recommend two things:
Try using the installed genVoltage.c C example and confirm that you can see voltage on the ao channel.
Examples are installed here: /Applications/National Instruments/NI-DAQmx Base/examples
If you see output, you've confirmed that the device and driver are working correctly, and that the bug is likely in the python file.
If you don't see output, then the device or driver has a problem, and the best place to get help troubleshooting is the NI discussion forums at http://forums.ni.com.
Try porting genVoltage.c using the python bindings. At first glance, I would try:
Use DAQmxBaseStartTask before DAQmxBaseWriteAnalogF64
Or set the autostart parameter in your call to DAQmxBaseWriteAnalogF64 to true.
References
[1] NI USB-6008/6009 User Guide And Specifications :: Analog Output (page 16)
http://digital.ni.com/manuals.nsf/websearch/CE26701AA052E1F0862579AD0053BE19
[2] Timing, Hardware Versus Software
http://zone.ni.com/reference/en-XX/help/370466V-01/TOC11.htm

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js