I have been writing TensorFlow custom op and hence am dealing with some matrix operations using Eigen library. I am trying to understand how Eigen library executes the operations. I have the below code:
void __attribute__((optimize("O0"))) quantizeDequantize(const GPUDevice& d, TTypes<float>::ConstMatrix inputs,
float delta, float offset, float minVal, float maxVal,
TTypes<float>::Matrix outputs, int channel)
{
float invScale = 1.0f / ((float)delta);
const auto clampedTensor = inputs.chip<1>(channel).cwiseMax(min).cwiseMin(max);
const auto tensor = (clampedTensor * invScale).round() + offset;
const auto tensor_2 = (tensor - offset) * scale;
outputs.chip<1>(channel).device(d) = clampedTensor; // line taking the most time
}
If I disable the below line, the code is almost 7 times faster when running on a large model when compared to having the line in (I understand output wont be correct).
outputs.chip<1>(channel).device(d) = clampedTensor;
But, if I have following code, the execution time is pretty much close to what I see with all the code in.
void __attribute__((optimize("O0"))) quantizeDequantize(const GPUDevice& d, TTypes<float>::ConstMatrix inputs,
float delta, float offset, float minVal, float maxVal,
TTypes<float>::Matrix outputs, int channel)
{
outputs.chip<1>(channel).device(d) = inputs.chip<1>(channel);
}
The above two experiments are leading me to infer the following,
Eigen backend would not run any operations if we are not using the intermediate results to generate the output. Is it correct?
If above is True, how does Eigen library know the graph. Does it figure these details out at compile time similar to how GCC optimizes the code?
Does adding attribute((optimize("O0"))) make any difference to the way Eigen backend executes the above code?
Eigen seems to have answered these questions here: https://eigen.tuxfamily.org/dox/TopicLazyEvaluation.html
Related
I'm trying to compile old Qt project and i encounter this error:
error: cannot convert 'float*' to 'qreal* {aka double*}' in
initialization
Here's the fragment of code:
void Camera::loadProjectionMatrix()
{
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
qreal *dataMat = projectionMatrix_.data();
GLfloat matriceArray[16];
for (int i= 0; i < 16; ++i)
matriceArray[i] = dataMat[i];
glMultMatrixf(matriceArray);
}
What are my options to overcome this error?
The projection matrix will return float* to you as per documentation:
float * QMatrix4x4::data()
Returns a pointer to the raw data of this matrix.
The best practice would be to eliminate the qreal usage in your codebase regardless this case. When the contributors went through the Qt 5 refactoring, the qreal ancient concept was dropped as much as possible and definitely should not be used much in new code where the API deals with float.
The recommendation is to use float these days in such cases. This is a bit historical, really. Back then, it made sense to define qreal to double where available, but float where not, e.g. ARM platforms. See the old documentation:
typedef qreal
Typedef for double on all platforms except for those using CPUs with ARM architectures. On ARM-based platforms, qreal is a typedef for float for performance reasons.
In Qt 5, the documentation is slightly different, although the main concept seems to have remained the same:
typedef qreal
Typedef for double unless Qt is configured with the -qreal float option.
I would fix your code the following way:
void Camera::loadProjectionMatrix()
{
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
float *dataMat = projectionMatrix_.data();
GLfloat matriceArray[16];
for (int i= 0; i < 16; ++i)
matriceArray[i] = dataMat[i];
glMultMatrixf(matriceArray);
}
Strictly speaking, you could also go an alternative way to solve the issue, namely by using this method rather than data():
float & QMatrix4x4::operator()(int row, int column)
Returns a reference to the element at position (row, column) in this matrix so that the element can be assigned to.
In which case, you could even eliminate the dataMat variable and assign the items directly to your matriceArray in the iteration.
Going even further than that, you should consider using a Qt library for this common task, namely e.g. the opengl classes either in QtGui or Qt3D. It would make more sense to mess with low-level opengl API calls if you do something custom.
Apparently, projectionMatrix_.data() returns a float*, and you cannot assign a float* to a double* (which is what qreal* is in this case).
Use
float *dataMat = projectionMatrix_.data();
or
auto dataMat = projectionMatrix_.data();
instead. The latter sometimes has the advantage that it might still be correct code if the return type of the function changes for some reason, although that is nothing to expect from a mature library. Additionally, you cannot get the type wrong on accident.
I am trying to get the minimum value from a collection of float values, by taking advantage of the Atomic operations provided by CUDA. . I cannot use reduction because of memory constraints. However, I get the error message: Instruction '{atom,red}.shared' requires .target sm_12 or higher when I try compiling the code below with a __Shared__ variable passed as the "SharedMem" arguement.
I have a 9400m GPU which has compute capability of 1.1.
__device__ static float* atomicMin(float* SharedMem, float value, float *old)
{
old[0] = *SharedMem;
float assumed;
if (old[0] <= value)
{
return old;
}
do
{
assumed = old[0];
old[0] = ::atomicCAS((unsigned int*)SharedMem, __float_as_int(assumed), __float_as_int(value));
} while (old[0] != assumed);
return old;
}
Take for example calling the function "getMin_Kernel" below:
__shared__ __device__ float LowestDistance;
__global__ void getMin_Kernel(float* AllFloats, int* NumberOfFloats)
{
int j = (blockDim.x * blockIdx.x + threadIdx.x);
if (j < NumberOfFloats[0])
{
float myFloat;
myFloat=*(atomicMin(&LowestDistance, NumberOfFloats[0], &myFloat));
}
}
However, if I pass a non-shared variable it compiles without issues, however, I get a runtime error. I am guessing the run time error occurs because atomicCAS requires a global or shared variable. Can anyone please help with a way to get around the compilation error.
Thanks.
This table http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__feature-support-per-compute-capability provides a full description of various compute capabilities and their matching feature support.
Thanks guys I didn't notice the extra bullet points in the documentation stating the conditions for atomicCas and shared memory variables. I'm still learning the ropes of CUDA.
Thanks.
float data = matrixm.ptr<float>(i)[j]; - working
float data = matrixm.at<float>(i,j); - working
float data = matrixm.data[i*matrixm.step+j*matrix.elemSize()] - is not giving correct output
How can we access floating point data directly without using templates (.at, .ptr)
So you want the functionality of at<> without using it?
Here are the interesting src lines for at<>:
template<typename _Tp> inline _Tp& Mat::at(int i0, int i1)
...
return ((_Tp*)(data + step.p[0]*i0))[i1];
So you should have:
float data = ((float*)(matrixm.data + matrixm.step.p[0]*i))[j];
But how does this have any advantage over calling at<>?
you have to cast the data (type unsigned char* pointer) pointer to float pointer first:
float data = ((float*)matrixm.data)[j+i*matrixm.cols];
This works only if the image is contiguous. Or cast it after:
float data = (float)(matrix.data[i*matrixm.step[0]+j*matrix.elemSize()]])
This works also for conguos images.
For me it doesn t look like a XY problem. It looks like you are new in the world of OpenCV and you wannt to understand how it works.
ptr should be the best way (with a good performance); OpenCV doesn't guarantee the Mat would be a continued range on ram, for some large enough image, it can be divided by rows, and save in different location. That's why you need to use ptr to get the location of one row.
There is a piece of code confuse me, which runs in windows!
Here is the code:
#define point_float2uint(x) *((unsigned int *)&x)
float divide_1000(float y)
{
float v = y / 1000.0f;
return v;
}
float divide_1000(int y)
{
float v = float(y) / 1000.0f;
return v;
}
void float_test(void)
{
int num[5] = {67975500, 67251500, 67540620, 69435500, 70171500};
for (int i = 0; i < 5; ++i)
{
int a = num[i];
float af_f = divide_1000(float(a));
float af_i = divide_1000((a));
printf("src num:%d, af_f:%f, %x, af_i:%f, %x\n", num[i], af_f, point_float2uint(af_f), af_i, point_float2uint(af_i));
}
}
Here is the output, compiled by vs2005:
src num:67975500, af_f:67975.507813, 4784c3c1, af_i:67975.500000, 4784c3c0
src num:67251500, af_f:67251.507813, 478359c1, af_i:67251.500000, 478359c0
src num:67540620, af_f:67540.625000, 4783ea50, af_i:67540.617188, 4783ea4f
src num:69435500, af_f:69435.507813, 47879dc1, af_i:69435.500000, 47879dc0
src num:70171500, af_f:70171.507813, 47890dc1, af_i:70171.500000, 47890dc0
The question is: why I use the "divide_1000", get the different result in windows? This is not what I want!
And I find that not all the integer result in different, but some just like the code above.
Here is the the output, comipled by gcc4.4.5 in debian:
src num:67975500, af_f:67975.507812, 4784c3c1, af_i:67975.507812, 4784c3c1
src num:67251500, af_f:67251.507812, 478359c1, af_i:67251.507812, 478359c1
src num:67540620, af_f:67540.625000, 4783ea50, af_i:67540.625000, 4783ea50
src num:69435500, af_f:69435.507812, 47879dc1, af_i:69435.507812, 47879dc1
src num:70171500, af_f:70171.507812, 47890dc1, af_i:70171.507812, 47890dc1
I get the same result in useing different function "divide_1000". That's what I want.
There are quite a few code generation settings involved here that affect the outcome. The difference that you report is observable in non-optimized code under default floating point model (i.e. "precise" model) when using the "classic" FPU instructions for floating-point computations.
The compiler translates the first call literally: the original integer value is first converted to float - 4-byte floating-point value - stored in memory (as function argument). This conversion rounds the value to +6.7975504e+7, which is already not precise. Later that float value is read form memory inside the first function and used for further computations.
The second call passes an int value to the function, which is directly loaded into high-precision FPU register and used for further computations. Even though you specified an explicit conversion from int to float inside the second function, the compiler decided to ignore your request. This value is never literally converted to float, meaning that the aforementioned loss of precision never occurs.
That is what is causing the difference you observed.
If you rewrite your second function as
float divide_1000(int y)
{
float fy = y;
float v = fy / 1000.0f;
return v;
}
i.e. add an additional step that saves the float value to a named location in memory, the compiler will perform that step in non-optimized code. This will cause the results to become identical.
Again, the above applies to the code compiled without optimizations, when the compiler normally attempts to translate all statements very closely (but not always exactly). In optimized code the compiler eliminates the "unnecessary" intermediate conversions to float and all "unnecessary" intermediate memory stores in both cases, producing identical results.
You might also want to experiment with other floating-point models (i.e. "strict" and "fast") to see how it affects the results. These floating-point models exist specifically to deal with issues like the one you observed.
If you change code generation settings of the compiler and make it use SSE instructions for floating-point arithmetic, the results might also change (in my experiment the difference disappears when SSE2 instruction set is used instead of FPU instructions).
I have a piece of C++ CUDA code which I have to write declaring the data variable in float. I also have to rewrite the code declaring the data variable in double.
What is a good design to handle a situation like this in CUDA?
I do not want to have two sets of same code because then in the future for any change I will have to have to change two sets of otherwise identical code. I also want to keep the code clean without too many #ifdef to change between float and double within the code.
Can anyone please suggest any good (in terms of maintenance and "easy to read") design?
CUDA supports type templating, and it is without doubt the most efficient way to implement kernel code where you need to handle multiple types in the same code.
As a trivial example, consider a simple BLAS AXPY type kernel:
template<typename Real>
__global__ void axpy(const Real *x, Real *y, const int n, const Real a)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for(; tid<n; tid += stride) {
Real yval = y[tid];
yval += a * x[tid];
y[tid] = yval;
}
}
This templated kernel can be instantiated for both double and single precision without loss of generality:
template axpy<float>(const float *, float *, const int, const float);
template axpy<double>(const double *, double *, const int, const double);
The thrust template library, which ships with all recent versions of the CUDA toolkit, makes extensive use of this facility for implementing type agnostic algorithms.
In addition to templating, you may be able to achieve what you want with a single typedef:
typedef float mysize; // or double
Then just use mysize throughout where you would use float or double.
You might be interested in the simpleTemplates sample code, and there are other templatized CUDA examples as well, in addtion to thrust where, as talonmies states, it's used extensively. Thrust provides many other benefits as well to C++ programmers.