weird performance in C++ (VC 2010) - c++

I have this loop written in C++, that compiled with MSVC2010 takes a long time to run. (300ms)
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
float val = 0;
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
val += original[k*w + m] * ds[k - i + hr][m - j + hr];
}
}
heat_map[i*w + j] = val;
}
}
}
It seemed a bit strange to me, so I did some tests then changed a few bits to inline assembly: (specifically, the code that sums "val")
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
__asm {
fldz
}
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
float val = original[k*w + m] * ds[k - i + hr][m - j + hr];
__asm {
fld val
fadd
}
}
}
float val1;
__asm {
fstp val1
}
heat_map[i*w + j] = val1;
}
}
}
Now it runs in half the time, 150ms. It does exactly the same thing, but why is it twice as quick? In both cases it was run in Release mode with optimizations on. Am I doing anything wrong in my original C++ code?

I suggest you try different floating-point calculation models supported by the compiler - precise, strict or fast (see /fp option) - with your original code before making any conclusions. I suspect that your original code was compiled with some overly restrictive floating-point model (not followed by your assembly in the second version of the code), which is why the original is much slower.
In other words, if the original model was indeed too restrictive, then you were simply comparing apples to oranges. The two versions didn't really do the same thing, even though it might seem so at the first sight.
Note, for example, that in the first version of the code the intermediate sum is accumulated in a float value. If it was compiled with precise model, the intermediate results would have to be rounded to the precision of float type, even if the variable val was optimized away and the internal FPU register was used instead. In your assembly code you don't bother to round the accumulated result, which is what could have contributed to its better performance.
I'd suggest you compile both versions of the code in /fp:fast mode and see how their performances compare in that case.

A few things to check out:
You need to check that is actually is the same code. As in, are your inline assembly statements exactly the same as those generated by the compiler? I can see three potential differences (potential because they may be optimised out). The first is the initial setting of val to zero, the second is the extra variable val1 (unlikely since it will most likely just change the constant subtraction of the stack pointer), the third is that your inline assembly version may not put the interim results back into val.
You need to make sure your sample space is large. You didn't mention whether you'd done only one run of each version or a hundred runs but, the more runs, the better, so as to remove the effect of "noise" in your statistics.
An even better measurement would be CPU time rather than elapsed time. Elapsed time is subject to environmental changes (like your virus checker or one of your services deciding to do something at the time you're testing). The large sample space will alleviate, but not necessarily solve, this.

Related

Optimizing MatMult with AVX

I decided to play a little bit with AVX. For this reason I wrote a simple matrix multiplication "benchmark code" and started applying some optimizations to it - just to see how fast I can make it. Below is my naive implementation, followed by the simplest AVX one I could think of:
void mmult_naive()
{
int i, j, k = 0;
// Traverse through each row element of Matrix A
for (i = 0; i < SIZE; i++) {
// Traverse through each column element of Matrix B
for (j = 0; j < SIZE; j++) {
for (k = 0; k < SIZE; k++) {
matrix_C[i][j] += matrix_A[i][k] * matrix_B[k][j];
}
}
}
}
AVX:
void mmult_avx_transposed()
{
__m256 row_vector_A;
__m256 row_vector_B;
int i, j, k = 0;
__m256 int_prod;
// Transpose matrix B
transposeMatrix();
for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++) {
int_prod = _mm256_setzero_ps();
for (k = 0; k < (SIZE / 8); k++) {
row_vector_A = _mm256_load_ps(&matrix_A[i][k * 8]);
row_vector_B = _mm256_load_ps(&T_matrix[j][k * 8]);
int_prod = _mm256_fmadd_ps(row_vector_A, row_vector_B, int_prod);
}
matrix_C[i][j] = hsum_single_avx(int_prod);
}
}
}
I chose to transpose the second matrix to make it easier to load the values from the memory to the vector registers. This part works fine, gives all the nice
expected speed up and makes me happy.
While measuring the execution time for some larger matrix sizes (NxN matrices, N>1024) I thought the transpose might not be necessary if I find a "smarter" way to access the elements. The transpose function itself was roughly a 4-5% of the execution time, so it looked like a low-hanging fruit.
I replaced the second _mm256_load_ps with the following line and got rid of the transposeMatrix():
row_vector_B = _mm256_setr_ps(matrix_B[k * 8][j], matrix_B[(k * 8) + 1][j], matrix_B[(k * 8) + 2][j], matrix_B[(k * 8) + 3][j],
matrix_B[(k * 8) + 4][j], matrix_B[(k * 8) + 5][j], matrix_B[(k * 8) + 6][j], matrix_B[(k * 8) + 7][j]);
But now the code runs even worse! The results I got are the following:
MMULT_NAIVE execution time: 195,499 us
MMULT_AVX_TRANSPOSED execution time: 127,802 us
MMULT_AVX_INDEXED execution time: 1,482,524 us
I wanted to see if I had better luck with clang but it only made things worse:
MMULT_NAIVE execution time: 2,027,125 us
MMULT_AVX_TRANSPOSED execution time: 125,781 us
MMULT_AVX_INDEXED execution time: 1,798,410 us
My questions are in fact two:
Why the indexed part runs slower?
What's going on with clang? Apparently, even the "slow version" is much slower.
Everything was compiled with -O3, -mavx2 and -march=native on an i7-8700, with Arch Linux. g++ was in version 12.1.0 and clang in version 14.0.6.

Are floating point operations faster than int operations?

So I'm doing a little benchmark for measuring operations per second for different operator/type combinations on C++ and now I'm stuck. My tests for +/int and +/float looks like
int int_plus(){
int x = 1;
for (int i = 0; i < NUM_ITERS; ++i){
x += i;
}
return x;
}
float float_plus(){
float x = 1.0;
for (int i = 0; i < NUM_ITERS; ++i){
x += i;
}
return x;
}
And time measurement looks like
//same for float
start = chrono::high_resolution_clock::now();
int res_int = int_plus();
end = chrono::high_resolution_clock::now();
diff = end - start;
ops_per_sec = NUM_ITERS / (diff.count() / 1000);
When I run tests I get
3.65606e+08 ops per second for int_plus
3.98838e+08 ops per second for float plus
But as I understand float operations is always slower than int operations, but my tests show greater value on float type.
So there is the question: Am I wrong or there's something wrong with code? Or, maybe, something else?
There's a few things that could be going on. Optimization can be part of it. Using a #define constant could be letting the compiler do who-knows-what.
Note that the loop code is also being counted. Now, that's a constant for both loops, but it's part of your time, and that means you're doing a lot of int operations, not just 1 * NUM_ITERS.
If NUM_ITERS is relatively small, then the execution time is going to be very low, and that means the overhead of a method call probably dwarfs the cost of the operations inside the method.
Optimization level will also matter.
I'm not sure what else.

writing slower than the operation itself?

I am struggling to understand behavior of my functions.
My code is written in C++ in visual studio 2012. Running on Windows 7 64 bit. I am working with 2D arrays of float numbers. when I time my function I see that the time for function is reduced by 10X or more if I just stop writing my results to the output pointer. Does that mean that writing is slow?
Here is an example:
void TestSpeed(float** pInput, float** pOutput)
{
UINT32 y, x, i, j;
for (y = 3; y < 100-3; y++)
{
for (x = 3; x < 100-3; x++)
{
float fSum = 0;
for (i = y - 3; i <= y+3; i++)
{
for (j = x-3; j <= x+3; j++)
{
fSum += pInput[y][x]*exp(-(pInput[y][x]-pInput[i][j])*(pInput[y][x]-pInput[i][j]));
}
}
pOutput[y][x] = fSum;
}
}
}
If I comment out the line "pOutput[y][x] = fSum;" then the functions runs very quick. Why is that?
I am calling 2-3 such functions sequentially. Would it help to use stack instead of heap to write chunk of results and passing it onto next function and then write back to heap buffer after that chunk is ready?
In some cases I saw that if I replace pOutput[y][x] by a line buffer allocated on stack like,
float fResult[100] and use it to store results works faster for larger data size.
Your code makes a lot of operation and it needs time. Depending on what you are doing with the output you may consider the diagonalization or decomposition of your input matrix. Or you can look for values in yor output which are n times an other value etc and don't calculate the exponential for theese.

Matrix multiplication in a cpp file for Matlab

How would I do a matrix multiplication in cpp format that would after be compiled into a mex file?
My normal matrix multiplication in a Matlab script is as follow:
cMatrix = (1 / r) * pfMatrix * wcMatrix; %here pfMatrix is 2x3 and wcMatrix is 3x8
% Hence cMatrix is 2x8
% r is a scalar
The pfMatrix, wcMatrix and r are declared correctly in the cpp file and they have the same values as in the script. However cMatrix doesn't give me the same results. Here the implementation of the Matrix multiplication in the cpp :
int i, n, j;
for (i = 0; i<1; i++)
{
for (n = 0; n<7; n++)
{
for (j = 0; j<2; j++)
{
d->cMatrix[i][n] += (d->pfMatrix[i][j]) * (d->wcMatrix[j][n]);
}
d->cMatrix[i][n] = (1 / d->r) * d->cMatrix[i][n];
}
}
Edit:
I modified the loop following Ben Voigt answer. The results in cMatrix are still not identical to the one calculated from the Matlab script.
For example :
pfMatrix = [7937.91049469652,0,512;0,7933.81033431703,384];
wcMatrix = [-0.880633810389421,-1.04063381038942,-1.04063381038942,-0.880633810389421,-0.815633810389421,-1.10563381038942,-1.10563381038942,-0.815633810389421;-0.125,-0.125,0.125,0.125,-0.29,-0.29,0.29,0.29;100,100,100,100,100,100,100,100];
r = 100;
In this case, cMatrix(1,1) is :
(pfMatrix(1,1)*wcMatrix(1,1) + pfMatrix(1,2)*wcMatrix(2,1) + pfMatrix(1,3)*wcMatrix(3,1)) / r = 442.09
However, with the mex file the equivalent result is 959.
Edit #2:
I found the error in an element of pfMatrix that was not declared correctly (missing a division by 2). So the answer of Ben Voigt is working correctly. However, there is still a slight difference between the two results (Matlab script gives 442 and the mex gives 447, could it be a results of different data type?).
Edit #3:
Found the error and it was not related with the matrix multiplication loop.
Using your result matrix as scratch space is not a great idea. The compiler has to worry about aliasing, which means it can't optimize.
Try an explicit working variable, which also provides a convenient place to zero it:
for (int i = 0; i < 2; ++i) {
for (int n = 0; n < 8; ++n) {
double accum = 0.0;
for (int j = 0; j < 3; ++j) {
accum += (d->pfMatrix[i][j]) * (d->wcMatrix[j][n]);
}
d->cMatrix[i][n] = accum / d->r;
}
}
Your ranges were also wrong, which I've fixed.
(Also note that good performance on large matrices requires banding to get good cache behavior, however that shouldn't be an issue on a product of this size.)
A multiplication between matrices must be in this way: A[m][n] * B[n][p] = R[m][p].
The conditions that you wrote in the for loops are not correct and doesn't respect the matrix dimensions.
Look also at the Eigen libraries, which are open-source and provide a simple way to do the matrix multiplications.

CUDA combining thread independent(??) variables during execution

Guys I apologize if the title is confusing. I though long and hard and couldn't come up with proper way to phrase the question in a single line. So here's more detail. I am doing a basic image subtraction where the second image has been modified and I need to find the ratio of how much change was done to the image. for this I used the following code. Both images are 128x1024.
for(int i = 0; i < 128; i++)
{
for(int j = 0; j < 1024; j++)
{
den++;
diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
if(diff[i * 1024 + j] < error)
{
num++;
}
}
}
ratio = num/den;
The above code works fine on the CPU but I want to try to do this on CUDA. For this I can setup CUDA to do the basic subtraction of the images (code below) but I can't figure out how to do the conditional if statement to get my ratio out.
__global__ void calcRatio(float *orig, float *modified, int size, float *result)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if(index < size)
result[index] = orig[index] - modified[index];
}
So, up to this point it works but I cannot figure out how to parrallelize the num and den counters in each thread to calculate the ratio at the end of all the thread executions. To me it feels like the num and den counders are independent of the threads as every time I have tried to use them it seems they get incremented only once.
Any help will be appreciated as I am just starting out in CUDA and every example I see online never seems to apply to what I need to do.
EDIT: Fixed my naive code. Forgot to type one of the main condition in the code. It was a long long day.
for(int i = 0; i < 128; i++)
{
for(int j = 0; j < 1024; j++)
{
if(modified[i * 1024 + j] < 400.0) //400.0 threshold value to ignore noise
{
den++;
diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
if(diff[i * 1024 + j] < error)
{
num++;
}
}
}
}
ratio = num/den;
The operation you need to use to perform global summation across all the threads is known as a "parallel reduction". While you could use atomic operations to do this, I would not recommend it. There is a reduction kernel and a very good paper discussing the technique in the CUDA SDK, it is worth reading.
If I were writing code to do what you want, it would probably look like this:
template <int blocksize>
__global__ void calcRatio(float *orig, float *modified, int size, float *result,
int *count, const float error)
{
__shared__ volatile float buff[blocksize];
int index = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
int count = 0;
for(int i=index; i<n; i+=stride) {
val = orig[index] - modified[index];
count += (val < error);
result[index] = val;
}
buff[threadIdx.x] = count;
__syncthreads();
// Parallel reduction in shared memory using 1 warp
if (threadId.x < warpSize) {
for(int i=threadIdx.x + warpSize; i<blocksize; i+= warpSize) {
buff[threadIdx.x] += buff[i];
if (threadIdx.x < 16) buff[threadIdx.x] +=buff[threadIdx.x + 16];
if (threadIdx.x < 8) buff[threadIdx.x] +=buff[threadIdx.x + 8];
if (threadIdx.x < 4) buff[threadIdx.x] +=buff[threadIdx.x + 4];
if (threadIdx.x < 2) buff[threadIdx.x] +=buff[threadIdx.x + 2];
if (threadIdx.x == 0) count[blockIdx.x] = buff[0] + buff[1];
}
}
The first stanza does what your serial code does - computes a difference and a thread local total of elements which are less than error. Note I have written this version so that each thread is designed to process more than one entry of the input data. This has been done to help offset the computational cost of the parallel reduction that follows, and the idea is that you would use fewer blocks and threads than there were input data set entries.
The second stanza is the reduction itself, done in shared memory. It is effectively a "tree like" operation where the size of the set of thread local subtotals within a single block of threads is first summed down to 32 subtotals, then the subtotals are combined until there is the final subtotal for the block, and that is then stored is the total for the block. You will wind up with a small list of sub totals in count, one for each block you launched, which can be copied back to the host and the final result you need calculated there.
Please note I coded this in the browser and haven't compiled it, there might be errors, but it should give an idea about how an "advanced" version of what you are trying to do would work.
The denominator is pretty simple, since it's just the size.
The numerator is more troublesome, since its value for a given thread depends on all previous values. You're going to have to do that operation serially.
The thing you're looking for is probably atomicAdd. It's very slow, though.
I think you'd find this question relevant. Your num is basically global data.
CUDA array-to-array sum
Alternatively, you could dump the results of the error check into an array. Counting the results could then be parallelized. It would be a little tricky, but I think something like this would scale up: http://tekpool.wordpress.com/2006/09/25/bit-count-parallel-counting-mit-hakmem/