C++ design for CUDA codes - c++

I have a piece of C++ CUDA code which I have to write declaring the data variable in float. I also have to rewrite the code declaring the data variable in double.
What is a good design to handle a situation like this in CUDA?
I do not want to have two sets of same code because then in the future for any change I will have to have to change two sets of otherwise identical code. I also want to keep the code clean without too many #ifdef to change between float and double within the code.
Can anyone please suggest any good (in terms of maintenance and "easy to read") design?

CUDA supports type templating, and it is without doubt the most efficient way to implement kernel code where you need to handle multiple types in the same code.
As a trivial example, consider a simple BLAS AXPY type kernel:
template<typename Real>
__global__ void axpy(const Real *x, Real *y, const int n, const Real a)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for(; tid<n; tid += stride) {
Real yval = y[tid];
yval += a * x[tid];
y[tid] = yval;
}
}
This templated kernel can be instantiated for both double and single precision without loss of generality:
template axpy<float>(const float *, float *, const int, const float);
template axpy<double>(const double *, double *, const int, const double);
The thrust template library, which ships with all recent versions of the CUDA toolkit, makes extensive use of this facility for implementing type agnostic algorithms.

In addition to templating, you may be able to achieve what you want with a single typedef:
typedef float mysize; // or double
Then just use mysize throughout where you would use float or double.
You might be interested in the simpleTemplates sample code, and there are other templatized CUDA examples as well, in addtion to thrust where, as talonmies states, it's used extensively. Thrust provides many other benefits as well to C++ programmers.

Related

TensorFlow Eigen underlying execution model

I have been writing TensorFlow custom op and hence am dealing with some matrix operations using Eigen library. I am trying to understand how Eigen library executes the operations. I have the below code:
void __attribute__((optimize("O0"))) quantizeDequantize(const GPUDevice& d, TTypes<float>::ConstMatrix inputs,
float delta, float offset, float minVal, float maxVal,
TTypes<float>::Matrix outputs, int channel)
{
float invScale = 1.0f / ((float)delta);
const auto clampedTensor = inputs.chip<1>(channel).cwiseMax(min).cwiseMin(max);
const auto tensor = (clampedTensor * invScale).round() + offset;
const auto tensor_2 = (tensor - offset) * scale;
outputs.chip<1>(channel).device(d) = clampedTensor; // line taking the most time
}
If I disable the below line, the code is almost 7 times faster when running on a large model when compared to having the line in (I understand output wont be correct).
outputs.chip<1>(channel).device(d) = clampedTensor;
But, if I have following code, the execution time is pretty much close to what I see with all the code in.
void __attribute__((optimize("O0"))) quantizeDequantize(const GPUDevice& d, TTypes<float>::ConstMatrix inputs,
float delta, float offset, float minVal, float maxVal,
TTypes<float>::Matrix outputs, int channel)
{
outputs.chip<1>(channel).device(d) = inputs.chip<1>(channel);
}
The above two experiments are leading me to infer the following,
Eigen backend would not run any operations if we are not using the intermediate results to generate the output. Is it correct?
If above is True, how does Eigen library know the graph. Does it figure these details out at compile time similar to how GCC optimizes the code?
Does adding attribute((optimize("O0"))) make any difference to the way Eigen backend executes the above code?
Eigen seems to have answered these questions here: https://eigen.tuxfamily.org/dox/TopicLazyEvaluation.html

Fastest (or most elegant) way of passing constant arguments to a CUDA kernel

Lets say I want a CUDA kernel that needs to do lots of stuff, but there are dome parameters that are constant to all the kernels. this arguments are passed to the main program as an input, so they can not be defined in a #DEFINE.
The kernel will run multiple times (around 65K) and it needs those parameters (and some other inputs) to do its maths.
My question is: whats the fastest (or else, the most elegant) way of passing these constants to the kernels?
The constants are 2 or 3 element length float* or int* arrays. They will be around 5~10 of these.
toy example: 2 constants const1 and const2
__global__ void kernelToyExample(int inputdata, ?????){
value=inputdata*const1[0]+const2[1]/const1[2];
}
is it better
__global__ void kernelToyExample(int inputdata, float* const1, float* const2){
value=inputdata*const1[0]+const2[1]/const1[2];
}
or
__global__ void kernelToyExample(int inputdata, float const1x, float const1y, float const1z, float const2x, float const2y){
value=inputdata*const1x+const2y/const1z;
}
or maybe declare them in some global read only memory and let the kernels read from there? If so, L1, L2, global? Which one?
Is there a better way I don't know of?
Running on a Tesla K40.
Just pass them by value. The compiler will automagically put them in the optimal place to facilitate cached broadcast to all threads in each block - either shared memory in compute capability 1.x devices, or constant memory/constant cache in compute capability >= 2.0 devices.
For example, if you had a long list of arguments to pass to the kernel, a struct passed by value is a clean way to go:
struct arglist {
float magicfloat_1;
float magicfloat_2;
//......
float magicfloat_19;
int magicint1;
//......
};
__global__ void kernel(...., const arglist args)
{
// you get the idea
}
[standard disclaimer: written in browser, not real code, caveat emptor]
If it turned out one of your magicint actually only took one of a small number of values which you know beforehand, then templating is an extremely powerful tool:
template<int magiconstant1>
__global__ void kernel(....)
{
for(int i=0; i < magconstant1; ++i) {
// .....
}
}
template kernel<3>(....);
template kernel<4>(....);
template kernel<5>(....);
The compiler is smart enough to recognise magconstant makes the loop trip known at compile time and will automatically unroll the loop for you. Templating is a very powerful technique for building fast, flexible codebases and you would be well advised to accustom yourself with it if you haven't already done so.

error: cannot convert 'float*' to 'qreal* {aka double*} in initialization

I'm trying to compile old Qt project and i encounter this error:
error: cannot convert 'float*' to 'qreal* {aka double*}' in
initialization
Here's the fragment of code:
void Camera::loadProjectionMatrix()
{
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
qreal *dataMat = projectionMatrix_.data();
GLfloat matriceArray[16];
for (int i= 0; i < 16; ++i)
matriceArray[i] = dataMat[i];
glMultMatrixf(matriceArray);
}
What are my options to overcome this error?
The projection matrix will return float* to you as per documentation:
float * QMatrix4x4::​data()
Returns a pointer to the raw data of this matrix.
The best practice would be to eliminate the qreal usage in your codebase regardless this case. When the contributors went through the Qt 5 refactoring, the qreal ancient concept was dropped as much as possible and definitely should not be used much in new code where the API deals with float.
The recommendation is to use float these days in such cases. This is a bit historical, really. Back then, it made sense to define qreal to double where available, but float where not, e.g. ARM platforms. See the old documentation:
typedef qreal
Typedef for double on all platforms except for those using CPUs with ARM architectures. On ARM-based platforms, qreal is a typedef for float for performance reasons.
In Qt 5, the documentation is slightly different, although the main concept seems to have remained the same:
typedef qreal
Typedef for double unless Qt is configured with the -qreal float option.
I would fix your code the following way:
void Camera::loadProjectionMatrix()
{
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
float *dataMat = projectionMatrix_.data();
GLfloat matriceArray[16];
for (int i= 0; i < 16; ++i)
matriceArray[i] = dataMat[i];
glMultMatrixf(matriceArray);
}
Strictly speaking, you could also go an alternative way to solve the issue, namely by using this method rather than data():
float & QMatrix4x4::​operator()(int row, int column)
Returns a reference to the element at position (row, column) in this matrix so that the element can be assigned to.
In which case, you could even eliminate the dataMat variable and assign the items directly to your matriceArray in the iteration.
Going even further than that, you should consider using a Qt library for this common task, namely e.g. the opengl classes either in QtGui or Qt3D. It would make more sense to mess with low-level opengl API calls if you do something custom.
Apparently, projectionMatrix_.data() returns a float*, and you cannot assign a float* to a double* (which is what qreal* is in this case).
Use
float *dataMat = projectionMatrix_.data();
or
auto dataMat = projectionMatrix_.data();
instead. The latter sometimes has the advantage that it might still be correct code if the return type of the function changes for some reason, although that is nothing to expect from a mature library. Additionally, you cannot get the type wrong on accident.

ptxas "double is not supported" warning when using thrust::sort on a struct array

I'm trying to sort an array of structs on my GPU with thrust::sort. However, when I compile with nvcc, I get this warning:
ptxas /tmp/tmpxft_00005186_00000000-5_antsim.ptx, line 1520; warning : Double is not supported. Demoting to float
I've isolated the problem to my call to thrust::sort, here:
thrust::sort(thrustAnts, thrustAnts + NUM_ANTS, antSortByX());
thrustAnts is an array of Ant structs located on the GPU, while antSortByX is a functor as defined below:
typedef struct {
float posX;
float posY;
float direction;
float speed;
u_char life;
u_char carrying;
curandState rngState;
} Ant;
struct antSortByX {
__host__ __device__ bool operator()(Ant &antOne, Ant &antTwo) {
return antOne.posX < antTwo.posX;
}
};
It seems to me as though there aren't any doubles in this, though I'm suspicious the less-than operator in my functor evaluates those floats as doubles. I can solve this problem by compiling with -arch sm_13, but I'm curious as to why this is complaining at me in the first place.
The demotion happens because CUDA devices support double precision calculations at first with compute capability 1.3. NVCC knows the specifications and demotes every double to float for devices with CC < 1.3 just because the hardware cannot handle double precisions.
A good feature list could be found on wikipedia: CUDA
That you can’t see any doubles in this code doesn't mean that they are not there. Most commonly this error results from a missing f postfix on a floating point constant. The compiler performance an implicit cast from all floats to double when one double is part of the expression. A floating point constant without the f is a double value and the casting starts. However, for the less-operator a cast without constant expressions should not happen.
I can only speculate, but it seems to me that in your case a double precision value could be used within the thrust::sort implementation. Since you provide only a user function to a higher order function (functions that take functions as parameters).

Int or Unsigned Int to float without getting a warning

Sometimes I have to convert from an unsigned integer value to a float. For example, my graphics engine takes in a SetScale(float x, float y, float z) with floats and I have an object that has a certain size as an unsigned int. I want to convert the unsigned int to a float to properly scale an entity (the example is very specific but I hope you get the point).
Now, what I usually do is:
unsigned int size = 5;
float scale = float(size);
My3DObject->SetScale(scale , scale , scale);
Is this good practice at all, under certain assumptions (see Notes)? Is there a better way than to litter the code with float()?
Notes: I cannot touch the graphics API. I have to use the SetScale() function which takes in floats. Moreover, I also cannot touch the size, it has to be an unsigned int. I am sure there are plenty of other examples with the same 'problem'. The above can be applied to any conversion that needs to be done and you as a programmer have little choice in the matter.
My preference would be to use static_cast:
float scale = static_cast<float>(size);
but what you are doing is functionally equivalent and fine.
There is an implicit conversion from unsigned int to float, so the cast is strictly unnecessary.
If your compiler issues a warning, then there isn't really anything wrong with using a cast to silence the warning. Just be aware that if size is very large it may not be representable exactly by a float.