Tensorflow C API Selecting GPU - c++

I am using the Tensorflow C API to run models saved/frozen in python. We use to run these models on CPU but recently switched to GPU for performance. To interact with the C API we use a wrapper library called CPPFlow (https://github.com/serizba/cppflow). I recently updated this library so that we can pass in GPU Config options so that we can control GPU memory allocations. However we also now have systems with multiple GPUs which is causing some issues. It seems like I cant get Tensorflow to use the same GPU as our software does.
I use the visible_device_list parameter with the same GPU ID as our software. If I set our software to run on device 1 and Tensorflow to device 1, Tensorflow will pick device 2. If I set our software to use device 1 and Tensorflow to use device 2, both software use the same GPU.
How does Tensorflow order GPU devices and do I need to use another method to manually select the device? Every where I look suggests it can be done using the GPU Config options.

One way to set the device is getting the hex string in python and then using the string in C API: For example,
Sample 1:
gpu_options = tf.GPUOptions(allow_growth=True,visible_device_list='1')
config = tf.ConfigProto(gpu_options=gpu_options)
serialized = config.SerializeToString()
print(list(map(hex, serialized)))
Sample 2:
import tensorflow as tf
config = tf.compat.v1.ConfigProto(device_count={"CPU":1}, inter_op_parallelism_threads=1,intra_op_parallelism_threads=1)
ser = config.SerializeToString()
list(map(hex,ser))
Out[]:
['0xa',
'0x7',
'0xa',
'0x3',
'0x43',
'0x50',
'0x55',
'0x10',
'0x1',
'0x10',
'0x1',
'0x28',
'0x1']
Use this string in C API as
uint8_t config[13] = {0xa, 0x7, 0xa, ... , 0x28, 0x1};
TF_SetConfig(opts, (void*)config, 13, status);
For more details:
https://github.com/tensorflow/tensorflow/issues/29217
https://github.com/cyberfire/tensorflow-mtcnn/issues/1
https://github.com/tensorflow/tensorflow/issues/27114

You can set Tensorflow GPU order by setting the environment variable CUDA_VISIBLE_DEVICES during execution. For more details, you can check it here
//Set TF to use GPU:1 and GPU:0 (in this order)
setenv( "CUDA_VISIBLE_DEVICES", "1,0", 1 );
//Set TF to use only GPU:0 (in this order)
setenv( "CUDA_VISIBLE_DEVICES", "0", 1 );
//Set TF to do not use GPUs
setenv( "CUDA_VISIBLE_DEVICES", "-1", 1 );

Related

How to change per_process_gpu_memory_fraction in tensorflow using c++?

I have develop two applications based on tf in c++ language, these applications are served as libraries. In the caller execuable program, library1 is called then library2. In library1 initialization, gpu memory fraction is set to 0.5, run some inference, and session closed. then library2 is called, gpu memory fraction is set to 0.8, but the setting can not work, gpu memory allocation did not change. Both library have the same initialization code but differnet fraction value
int XXXLib::init(double per_process_gpu_memory_fraction)
{
SessionOptions options;
ConfigProto* config = &options.config;
GPUOptions* gpu_options = config->mutable_gpu_options();
// for library1, fraction = 0.5; for library2, fraction = 0.8
gpu_options->set_per_process_gpu_memory_fraction(per_process_gpu_memory_fraction);
Status status = NewSession(options, &_session);
}
It seems that when set_per_process_gpu_memory_fraction() is called, the gpu memory in this process is fixed, even new another Newsession(), the original fraction value is used.
Should different app(library) use different session ?
gpu memory fraction is related to session or to process ?
How to change the fraction in different session but the same process?
Some env info:
Have I written custom code? NO
OS Platform and Distribution? Win10 Pro
TensorFlow installed from? Source code
TensorFlow version? 1.9
CUDA/cuDNN version? CUDA9.0, cudnn 7.05
GPU model and memory? GTX1080 with 8GB memory
It is unfortunate, but in the current TensorFlow (1.11) the GPU memory allocator is created once (per GPU device) - the first time a session is created in the process. Changing per_process_gpu_memory_fraction in the following sessions will not have any effect.
In regards to your library, I would suggest not creating sessions inside it. Ask the user to provide you a session that they configure as they wish. Alternatively, you can just create a graph and return the operations to run. The user can then run them as they see fit.

tensorflow places softmax op on cpu instead of gpu

I have a tensorflow model with multiple inputs and several layers, and a final softmax layer. The model is trained in Python (using the Keras framework), then saved and inference is done using a C++ program that facilitates a CMake build of TensorFlow (following basically those instructions: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake).
In python (tensorflow-gpu) all ops use the GPU (using log_device_placement):
out/MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.005837: I C:\tf_jenkins\home\workspace\rel-in\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872] out/MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006201: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872]
out/BiasAdd: (BiasAdd)/job:localhost/replica:0/task:0/gpu:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006535: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872] out/Softmax: (Softmax)/job:localhost/replica:0/task:0/gpu:0
To save the graph, the freeze_graph script is used (the script producing the log above loads again the freezed graph in .pb format).
When I use the C++ program and load the freezed graph (following closely the LoadGraph() function in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/label_image/main.cc - ReadBinaryProto() and session->Create()), and log again the device placements, I find that the Softmax is placed on CPU (all others ops are on GPU):
dense_6/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/Relu: (Relu): /job:localhost/replica:0/task:0/device:GPU:0
out/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/device:CPU:0
This placement is also confirmed by high CPU/low GPU utilization, and also apparent from profiling the application. The data type of the out layer is float32 (out/Softmax -> (<tf.Tensor 'out/Softmax:0' shape=(?, 1418) dtype=float32>,)).
Further investigation revealed:
Creating the softmax-op in C++ and placing it on GPU explicitly throws this error message:
Cannot assign a device for operation 'tsoftmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
A call to tensorflow::LogAllRegisteredKernels() showed also that Softmax is only available for CPU!
The build directory contains many files related to "softmax" (e.g. `tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.obj.Release.cmake). Don't know how to check every compilation step, though.
when I look into the "tf_core_gpu_kernels.lib" (one can open a .lib with 7Z ;)), there are files like "tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.lib" - so I believe there is nothing wrong with compiling the kernels itself
But: inspecting the "tensorflow.dll" (Dependency Walker) shows that only CPU kernels for Softmax are included (there are functions like const tensorflow::SoftmaxOp<struct Eigen::ThreadPoolDevice,double>, but no functions with GPU such as const tensorflow::SoftplusGradOp<struct Eigen::GpuDevice,float>).
Setup: Tensorflow 1.3.0, Windows 10, GPU: NVidia GTX 1070 (8GB RAM, memory utilization also very low).
I found a workaround - the workaround is to include the tf_core_gpu_kernels.lib in some of the steps (create_def_file.py). More details here: GitHub Issue 15254

Tensorflow does not recognize GPU on AWS

So here it goes: I wanted to use TensorFlow with GPU on AWS - p2.xlarge plan. Unfortunately, something must have gone wrong and I continue to get:
InvalidArgumentError (see above for traceback): Cannot assign a device to node 'Variable_1': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
I checked both CUDA and cuDNN:
nvcc -V
cat /usr/local/cuda/include/cudnn.h
and got 8.0 and 5.1, respectively.
I call gpu like this:
with tf.device('/gpu:0'):
a = tf.Variable(tf.truncated_normal([100, 100]))
b = tf.Variable(tf.truncated_normal([100, 1000]))
with tf.Session() as sess:
sess.run(tf.matmul(a,b))
happy to post more details if necessary - don't know what will be useful yet.
I suppose you're trying to set up an EC2 instance from scratch? That can be difficult.
Instead, I'd strongly recommend using the Deep Learning AMI (https://aws.amazon.com/machine-learning/amis/). It comes preinstalled with everything you need (drivers, popular DL libraries, etc.). It's also free to use, you just pay for the instance itself.

Analog output from USB6009 using python and NIDAQmx base on Mac OSX

All,
I'm attempting to use Python and DAQmx Base to record analog input and generate analog output from my USB 6009 device. I've been using a wrapper I found and have been able to get AI but am struggling with AO.
There is a base class NITask which handles task generation etc. The class i'm calling is below. The function throws an error when I try to configure the clock. When I do not there is no error but nor is there voltage generated on the output. Any help would be appreciated.
Thanks!
class AOTask(NITask):
def __init__(self, min=0.0, max=5.0,
channels=["Dev1/ao0"],
timeout=10.0):
NITask.__init__(self)
self.min = min
self.max = max
self.channels = channels
self.timeout = timeout
self.clockSource ="OnboardClock"
sampleRate=100
self.sampleRate = 100
self.timeout = timeout
self.samplesPerChan = 1000
self.numChan = chanNumber(channels)
if self.numChan is None:
raise ValueError("Channel specification is invalid")
chan = ", ".join(self.channels)
self.CHK(self.nidaq.DAQmxBaseCreateTask("",ctypes.byref(self.taskHandle)))
self.CHK(self.nidaq.DAQmxBaseCreateAOVoltageChan(self.taskHandle, "Dev1/ao0", "", float64(self.min), float64(self.max), DAQmx_Val_Volts, None))
self.CHK(self.nidaq.DAQmxBaseCfgSampClkTiming(self.taskHandle, "", float64(self.sampleRate), DAQmx_Val_Rising, DAQmx_Val_FiniteSamps, uInt64(self.samplesPerChan)))
"""Data needs to be of type ndarray"""
def write(self, data):
nWritten = int32()
# data = numpy.float64(3.25)
data = data.astype(numpy.float64)
self.CHK(self.nidaq.DAQmxBaseWriteAnalogF64(self.taskHandle,
int32(1000), 0,float64(-1),DAQmx_Val_GroupByChannel,
data.ctypes.data,None,None))
# if nWritten.value != self.numChan:
# print "Expected to write %d samples!" % self.numChan
Your question covers two problems:
Why does DAQmxBaseCfgSampClkTiming return an error?
Without using that function, why isn't any output generated?
1. Hardware vs Software Timing
rjb3 wrote:
The function throws an error when I try to configure the clock. When I do not there is no error but nor is there voltage generated on the output.
Your program receives the error because the USB 600x devices do not support hardware-timed analog output [1]:
The NI USB-6008/6009 has two independent analog output channels that can generate outputs from 0 to 5 V. All updates of analog output channels are software-timed. GND is the ground-reference signal for the analog output channels.
"Software-timed" means a sample is written on demand by the program whenever DAQmxBaseWriteAnalogF64 is called. If an array of samples is written, then that array is written one at a time. You can learn more about how NI defines timing from the DAQmx help [2]. While that document is for DAQmx, the same concepts apply to DAQmx Base since the behavior is defined by the devices and not their drivers. The differences are in how much of the hardware's capabilities are implemented by the driver -- DAQmx implements everything, while DAQmx Base is a small select subset.
2. No Output When Software Timed
rjb3 wrote:
When I do not there is no error but nor is there voltage generated on the output.
I am not familiar with the Python bindings for the DAQmx Base API, but I can recommend two things:
Try using the installed genVoltage.c C example and confirm that you can see voltage on the ao channel.
Examples are installed here: /Applications/National Instruments/NI-DAQmx Base/examples
If you see output, you've confirmed that the device and driver are working correctly, and that the bug is likely in the python file.
If you don't see output, then the device or driver has a problem, and the best place to get help troubleshooting is the NI discussion forums at http://forums.ni.com.
Try porting genVoltage.c using the python bindings. At first glance, I would try:
Use DAQmxBaseStartTask before DAQmxBaseWriteAnalogF64
Or set the autostart parameter in your call to DAQmxBaseWriteAnalogF64 to true.
References
[1] NI USB-6008/6009 User Guide And Specifications :: Analog Output (page 16)
http://digital.ni.com/manuals.nsf/websearch/CE26701AA052E1F0862579AD0053BE19
[2] Timing, Hardware Versus Software
http://zone.ni.com/reference/en-XX/help/370466V-01/TOC11.htm

OpenCL/OpenGL Interop with Multiple GPUs

I'm having trouble using multiple GPUs with OpenCL/OpenGL interop. I'm trying to write an application which renders the result of an intensive computation. In the end it will run an optimization problem, and then, based on the result, render something to the screen. As a test case, I'm starting with the particle simulation example code from this course: http://web.engr.oregonstate.edu/~mjb/sig13/
The example code creates and OpenGL context, then creates a OpenCL context that shares the state, using the cl_khr_gl_sharing extension. Everything works fine when I use a single GPU. Creating a context looks like this:
3. create an opencl context based on the opengl context:
cl_context_properties props[ ] =
{
CL_GL_CONTEXT_KHR, (cl_context_properties) glXGetCurrentContext( ),
CL_GLX_DISPLAY_KHR, (cl_context_properties) glXGetCurrentDisplay( ),
CL_CONTEXT_PLATFORM, (cl_context_properties) Platform,
0
};
cl_context Context = clCreateContext( props, 1, Device, NULL, NULL, &status );
if( status != CL_SUCCESS)
{
PrintCLError( status, "clCreateContext: " );
exit(1);
}
Later on, the example creates shared CL/GL buffers with clCreateFromGLBuffer.
Now, I would like to create a context from two GPU devices:
cl_context Context = clCreateContext( props, 2, Device, NULL, NULL, &status );
I've successfully opened the devices, and can query that they both support cl_khr_gl_sharing, and both work individually. However, when attempting to create the context as above, I get
CL_INVALID_OPERATION
Which is an error code added by the cl_khr_gl_sharing extension. In the extension description (linked above) it says
CL_INVALID_OPERATION if a context or share group object was
specified for one of CGL, EGL, GLX, or WGL and any of the
following conditions hold:
The OpenGL implementation does not support the window-system
binding API for which a context or share group objects was
specified.
More than one of the attributes CL_CGL_SHAREGROUP_KHR,
CL_EGL_DISPLAY_KHR, CL_GLX_DISPLAY_KHR, and CL_WGL_HDC_KHR is
set to a non-default value.
Both of the attributes CL_CGL_SHAREGROUP_KHR and
CL_GL_CONTEXT_KHR are set to non-default values.
Any of the devices specified in the argument cannot
support OpenCL objects which share the data store of an OpenGL
object, as described in section 9.12."
That description doesn't seem to fit any of my cases exactly. Is it not possible to do OpenCL/OpenGL interop with multiple GPUs? Or is it that I have heterogeneous hardware? I printed out a few parameters from my enumerated devices. I've just taken two random GPUs that I could get my hands on.
PlatformID: 18483216
Num Devices: 2
-------- Device 00 ---------
CL_DEVICE_NAME: GeForce GTX 285
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DEVICE_VERSION: OpenCL 1.0 CUDA
CL_DRIVER_VERSION: 304.88
CL_DEVICE_MAX_COMPUTE_UNITS: 30
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1476
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
-------- Device 01 ---------
CL_DEVICE_NAME: Quadro FX 580
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DEVICE_VERSION: OpenCL 1.0 CUDA
CL_DRIVER_VERSION: 304.88
CL_DEVICE_MAX_COMPUTE_UNITS: 4
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1125
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
cl_khr_gl_sharing is supported on dev 0.
cl_khr_gl_sharing is supported on dev 1.
Note that if I create the context without the interop portion (such that the props array looks like below) then it successfully creates the context, but obviously cannot share buffers with the OpenGL side of the application.
cl_context_properties props[ ] =
{
CL_CONTEXT_PLATFORM, (cl_context_properties) Platform,
0
};
Several related Questions and Examples
Here's a related example of a pure OpenGL approach to shared
processing between multiple gpus
Another pure OpenGL mulitiple gpu question
A producer/consumer example using multiple gpus see the producer source file for calls to make current (looks windows specific but the flow will be similar elsewhere). See glContext for details
bool stageProducer::preExecution()
{
if(!glContext::getInstance().makeCurrent(_rc))
{
window::getInstance().messageBoxWithLastError("wglMakeCurrent");
return false;
}
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, _fboID);
return true;
}
OpenCL specific, but relevant to this question:
"If you enqueue a write to the buffer on queueA(deviceA) then OpenCL will use that device to do the write. However, if you then use the buffer on queueB(deviceB) in the same context, OpenCL will recognize that deviceA has the most recent data and will move it over to deviceB before using it. In short, as long as you use events to ensure that no two devices are trying to access the same memory object at the same time, OpenCL will make sure that each use of the memory object has the most recent data, regardless of which device last used it."
I assume when you take OpenGL out of the equation sharing memory between gpus works as expected?
When you call these two lines:
CL_GL_CONTEXT_KHR, (cl_context_properties) glXGetCurrentContext( ),
CL_GLX_DISPLAY_KHR, (cl_context_properties) glXGetCurrentDisplay( ),
the calls need to come from inside a new thread with a new OpenGL context. You can usually only associate one OpenCL context with one OpenGL context for one device at a time per thread.