Optimizing MatMult with AVX - c++

I decided to play a little bit with AVX. For this reason I wrote a simple matrix multiplication "benchmark code" and started applying some optimizations to it - just to see how fast I can make it. Below is my naive implementation, followed by the simplest AVX one I could think of:
void mmult_naive()
{
int i, j, k = 0;
// Traverse through each row element of Matrix A
for (i = 0; i < SIZE; i++) {
// Traverse through each column element of Matrix B
for (j = 0; j < SIZE; j++) {
for (k = 0; k < SIZE; k++) {
matrix_C[i][j] += matrix_A[i][k] * matrix_B[k][j];
}
}
}
}
AVX:
void mmult_avx_transposed()
{
__m256 row_vector_A;
__m256 row_vector_B;
int i, j, k = 0;
__m256 int_prod;
// Transpose matrix B
transposeMatrix();
for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++) {
int_prod = _mm256_setzero_ps();
for (k = 0; k < (SIZE / 8); k++) {
row_vector_A = _mm256_load_ps(&matrix_A[i][k * 8]);
row_vector_B = _mm256_load_ps(&T_matrix[j][k * 8]);
int_prod = _mm256_fmadd_ps(row_vector_A, row_vector_B, int_prod);
}
matrix_C[i][j] = hsum_single_avx(int_prod);
}
}
}
I chose to transpose the second matrix to make it easier to load the values from the memory to the vector registers. This part works fine, gives all the nice
expected speed up and makes me happy.
While measuring the execution time for some larger matrix sizes (NxN matrices, N>1024) I thought the transpose might not be necessary if I find a "smarter" way to access the elements. The transpose function itself was roughly a 4-5% of the execution time, so it looked like a low-hanging fruit.
I replaced the second _mm256_load_ps with the following line and got rid of the transposeMatrix():
row_vector_B = _mm256_setr_ps(matrix_B[k * 8][j], matrix_B[(k * 8) + 1][j], matrix_B[(k * 8) + 2][j], matrix_B[(k * 8) + 3][j],
matrix_B[(k * 8) + 4][j], matrix_B[(k * 8) + 5][j], matrix_B[(k * 8) + 6][j], matrix_B[(k * 8) + 7][j]);
But now the code runs even worse! The results I got are the following:
MMULT_NAIVE execution time: 195,499 us
MMULT_AVX_TRANSPOSED execution time: 127,802 us
MMULT_AVX_INDEXED execution time: 1,482,524 us
I wanted to see if I had better luck with clang but it only made things worse:
MMULT_NAIVE execution time: 2,027,125 us
MMULT_AVX_TRANSPOSED execution time: 125,781 us
MMULT_AVX_INDEXED execution time: 1,798,410 us
My questions are in fact two:
Why the indexed part runs slower?
What's going on with clang? Apparently, even the "slow version" is much slower.
Everything was compiled with -O3, -mavx2 and -march=native on an i7-8700, with Arch Linux. g++ was in version 12.1.0 and clang in version 14.0.6.

Related

Optimizing the Vivado HLS code to reduce the latency for image processing algorithm

I am trying to implement an image processing algorithm for a gamut mapping filter for Hardware using Vivado HLS. I have created a synthesizable version from a Halide code. But it is taking way too long for an image of (256x512) it was taking around 135 seconds which shouldn't be the case. I have used some optimizing techniques like pipelining the innermost loop, By pipelining, I have set the target(initiation interval) of II=1 for the innermost loop but the acheived II is 6. From the warnings thrown by the compiler, I have understood that it is because of accessing of the weights like ctrl_pts & weights, From the tutorials, I have seen, using array partitioning and array reshaping would help with the faster accessing of the weights. I have shared the code I have used to synthesize below:
//header
include "hls_stream.h"
#include <ap_fixed.h>
//#include <ap_int.h>
#include "ap_int.h"
typedef ap_ufixed<24,24> bit_24;
typedef ap_fixed<11,8> fix;
typedef unsigned char uc;
typedef ap_uint<24> stream_width;
//typedef hls::stream<uc> Stream_t;
typedef hls::stream<stream_width> Stream_t;
struct pixel_f
{
float r;
float g;
float b;
};
struct pixel_8
{
uc r;
uc g;
uc b;
};
void gamut_transform(int rows,int cols,Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts);
//core
//include the header
#include "gamut_header.h"
#include "hls_math.h"
void gamut_transform(int rows,int cols, Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts)
{
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
//#pragma HLS INTERFACE fifo port=out
#pragma HLS dataflow
pixel_8 input;
pixel_8 new_pix;
bit_24 temp_in,temp_out;
pixel_f buff_1,buff_2,buff_3,buff_4,buff_5;
float dist;
for (int i = 0; i < 256; i++)
{
for (int j = 0; i < 512; i++)
{
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
for(int idx =0; idx < 3702; idx++)
{
buff_2.r = buff_1.r - ctrl_pts[idx][0];
buff_2.g = buff_1.g - ctrl_pts[idx][1];
buff_2.b = buff_1.b - ctrl_pts[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights[idx][0] * dist);
buff_3.g = buff_2.g + (weights[idx][1] * dist);
buff_3.b = buff_2.b + (weights[idx][2] * dist);
}
buff_4.r = buff_3.r + coefs[0][0] + buff_1.r* coefs[1][0] + buff_1.g * coefs[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs[0][1] + buff_1.r* coefs[1][1] + buff_1.g * coefs[2][1] + buff_1.b* coefs[3][1];
buff_4.b = buff_3.b + coefs[0][2] + buff_1.r* coefs[1][2] + buff_1.g * coefs[2][2] + buff_1.b* coefs[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
}
Even with the achieved II=6, the time taken is around 6 seconds; The given target is to have the time taken in milliseconds. I tried to do pipelining for the second most inner loop, but I am running out of resources on my board when I do that as the third most inner loop is being unrolled. I am using zynq ultra-scale board which has a fair amount of resources. Any suggestions on optimizing the code will be highly appreciated.
Also, can anyone suggest what type of interface would be best suited for ctrl_pts,weights and coefs, For reading the image I understood that streaming interface helps, and for reading small values like the number of rows and columns, Axi lite is preferred? Is there a type of interface that I can use for the mentioned variables so that it can go hand in hand with array partitioning and array reshaping?
Any suggestions will be highly appreciated,
Thanks in advance
Edit: I understand that the fixed-point representation can bring down the latency further, But my first goal is to get the floating-point representation with the best result and then analyzing the performance with fixed point representation
There are some steps you can do to optimize your design, but bear in mind that if you really need a floating square root operation, that will most likely have a huge latency penalty (unless properly pipelined, of course).
Your code might have a typo in the second inner loop: the index should be j right?
Data Locality
First off: ctrl_pts is read multiple time from the main memory (I assume). Since it's reused 256x512 times, it would be better to store it into a local buffer on the FPGA (like a BRAM, but it can be inferred), like so:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
}
}
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
// ...
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
// ...
Same reasoning goes for coefs and weights, just store them in a local variable before running the rest of the code.
To access the arguments you can use a master AXI4 interface m_axi and configure it accordingly. Once the algorithm is dealing with the local buffers, HLS should be able to automatically partition the buffers accordingly. If not, you can place the ARRAY_PARTITION complete dim=0 pragmas to force it.
Dataflow
Because of the way your algorithm works, another thing you could try is to break down the main loops (256x512) into three smaller processes running in dataflow, and so in parallel (+3 if you include the setup ones)
The whole code will look something like this (I hope it renders correctly):
[Compute buff_1]-->[FIFO1]-->[compute buff_3]-->[FIFO2a]-->[compute buff_4 and buff_5 + stream out]
L-------------------------------->[FIFO2b]----^
One tricky thing would be to stream buff_1 to both the next processes.
Possible Code
I won't try this code, so there might be compilations errors along the way, but the whole accelerator code would look something like this:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
weights_local[i][j] = weights[i][j];
}
}
for(int i =0; i < 4; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
coefs_local[i][j] = coefs[i][j];
}
}
Process_1:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
fifo_1.write(buff_1); // <--- WRITE TO FIFOs
fifo_2b.write(buff_1);
}
}
Process_2:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
for(int idx =0; idx < 3702; idx++) {
#pragma HLS LOOP_FLATTEN // <-- It shouldn't be necessary, since the if statements already help
#pragma HLS PIPELINE II=1 // <-- The PIPELINE directive can go here
if (idx == 0) {
buff_1 = fifo_1.read(); // <--- READ FROM FIFO
}
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
buff_2.g = buff_1.g - ctrl_pts_local[idx][1];
buff_2.b = buff_1.b - ctrl_pts_local[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights_local[idx][0] * dist);
buff_3.g = buff_2.g + (weights_local[idx][1] * dist);
buff_3.b = buff_2.b + (weights_local[idx][2] * dist);
if (idx == 3702 - 1) {
fifo_2a.write(buff_3); // <-- WRITE TO FIFO
}
}
}
}
Process_3:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
buff_3 = fifo_2a.read(); // <--- READ FROM FIFO
buff_1 = fifo_2b.read(); // <--- READ FROM FIFO
buff_4.r = buff_3.r + coefs_local[0][0] + buff_1.r* coefs_local[1][0] + buff_1.g * coefs_local[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs_local[0][1] + buff_1.r* coefs_local[1][1] + buff_1.g * coefs_local[2][1] + buff_1.b* coefs_local[3][1];
buff_4.b = buff_3.b + coefs_local[0][2] + buff_1.r* coefs_local[1][2] + buff_1.g * coefs_local[2][2] + buff_1.b* coefs_local[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
Be extremely careful in sizing the depth of the FIFOs, since Process 2 (the one with the sqrt operation) might have a slower data consumption and production rates! Also, FIFO 2b needs to account for that delay. If the rates are not matching, there will be deadlocks. Make sure to have a meaningful testbench and to cosimulate your design.
(The FIFO's depth can be changed with the pragma #pragma HLS STREAM variable=fifo_1 depth=N).
Final Thoughs
There might be further smaller/detailed optimizations that can be performed along the way, but I would first start from the ones above, being the heaviest. Just bear in minds that floating point processing is not optimal on FPGAs (as you noted) and is usually avoided.
EDITs: I tried the code with the modifications above and I've achieved II=1 with decent resouce usage.
Since the II is now one, the ideal number of cycles the accelerator would take is 256x512 and I'm close to that: ideal 402,653,184 versus mine 485,228,587). One crazy idea I now have to propose to you is to split the Process_2 inner-most loop into two parallel branches (or even more than 2 actually), feeding their own FIFOs. Process_1 would supply the two branches while an additional process/loop will alternatively read from the two FIFOs the 256x512 elements and supply them in the correct order to Process_3. In this way, the total amount of cycles required should halve, since Process_2 is the slowest process in the dataflow (and so improving it will improve the whole design). One possible drawback of this approach will be a higher amount of area/resource required on the FPGA.
Good luck.

Weird but close fft and ifft of image in c++

I wrote a program that loads, saves, and performs the fft and ifft on black and white png images. After much debugging headache, I finally got some coherent output only to find that it distorted the original image.
input:
fft:
ifft:
As far as I have tested, the pixel data in each array is stored and converted correctly. Pixels are stored in two arrays, 'data' which contains the b/w value of each pixel and 'complex_data' which is twice as long as 'data' and stores real b/w value and imaginary parts of each pixel in alternating indices. My fft algorithm operates on an array structured like 'complex_data'. After code to read commands from the user, here's the code in question:
if (cmd == "fft")
{
if (height > width) size = height;
else size = width;
N = (int)pow(2.0, ceil(log((double)size)/log(2.0)));
temp_data = (double*) malloc(sizeof(double) * width * 2); //array to hold each row of the image for processing in FFT()
for (i = 0; i < (int) height; i++)
{
for (j = 0; j < (int) width; j++)
{
temp_data[j*2] = complex_data[(i*width*2)+(j*2)];
temp_data[j*2+1] = complex_data[(i*width*2)+(j*2)+1];
}
FFT(temp_data, N, 1);
for (j = 0; j < (int) width; j++)
{
complex_data[(i*width*2)+(j*2)] = temp_data[j*2];
complex_data[(i*width*2)+(j*2)+1] = temp_data[j*2+1];
}
}
transpose(complex_data, width, height); //tested
free(temp_data);
temp_data = (double*) malloc(sizeof(double) * height * 2);
for (i = 0; i < (int) width; i++)
{
for (j = 0; j < (int) height; j++)
{
temp_data[j*2] = complex_data[(i*height*2)+(j*2)];
temp_data[j*2+1] = complex_data[(i*height*2)+(j*2)+1];
}
FFT(temp_data, N, 1);
for (j = 0; j < (int) height; j++)
{
complex_data[(i*height*2)+(j*2)] = temp_data[j*2];
complex_data[(i*height*2)+(j*2)+1] = temp_data[j*2+1];
}
}
transpose(complex_data, height, width);
free(temp_data);
free(data);
data = complex_to_real(complex_data, image.size()/4); //tested
image = bw_data_to_vector(data, image.size()/4); //tested
cout << "*** fft success ***" << endl << endl;
void FFT(double* data, unsigned long nn, int f_or_b){ // f_or_b is 1 for fft, -1 for ifft
unsigned long n, mmax, m, j, istep, i;
double wtemp, w_real, wp_real, wp_imaginary, w_imaginary, theta;
double temp_real, temp_imaginary;
// reverse-binary reindexing to separate even and odd indices
// and to allow us to compute the FFT in place
n = nn<<1;
j = 1;
for (i = 1; i < n; i += 2) {
if (j > i) {
swap(data[j-1], data[i-1]);
swap(data[j], data[i]);
}
m = nn;
while (m >= 2 && j > m) {
j -= m;
m >>= 1;
}
j += m;
};
// here begins the Danielson-Lanczos section
mmax = 2;
while (n > mmax) {
istep = mmax<<1;
theta = f_or_b * (2 * M_PI/mmax);
wtemp = sin(0.5 * theta);
wp_real = -2.0 * wtemp * wtemp;
wp_imaginary = sin(theta);
w_real = 1.0;
w_imaginary = 0.0;
for (m = 1; m < mmax; m += 2) {
for (i = m; i <= n; i += istep) {
j = i + mmax;
temp_real = w_real * data[j-1] - w_imaginary * data[j];
temp_imaginary = w_real * data[j] + w_imaginary * data[j-1];
data[j-1] = data[i-1] - temp_real;
data[j] = data[i] - temp_imaginary;
data[i-1] += temp_real;
data[i] += temp_imaginary;
}
wtemp = w_real;
w_real += w_real * wp_real - w_imaginary * wp_imaginary;
w_imaginary += w_imaginary * wp_real + wtemp * wp_imaginary;
}
mmax=istep;
}}
My ifft is the same only with the f_or_b set to -1 instead of 1. My program calls FFT() on each row, transposes the image, calls FFT() on each row again, then transposes back. Is there maybe an error with my indexing?
Not an actual answer as this question is Debug only so some hints instead:
your results are really bad
it should look like this:
first line is the actual DFFT result
Re,Im,Power is amplified by a constant otherwise you would see a black image
the last image is IDFFT of the original not amplified Re,IM result
the second line is the same but the DFFT result is wrapped by half size of image in booth x,y to match the common results in most DIP/CV texts
As you can see if you IDFFT back the wrapped results the result is not correct (checker board mask)
You have just single image as DFFT result
is it power spectrum?
or you forget to include imaginary part? to view only or perhaps also to computation somewhere as well?
is your 1D **DFFT working?**
for real data the result should be symmetric
check the links from my comment and compare the results for some sample 1D array
debug/repair your 1D FFT first and only then move to the next level
do not forget to test Real and complex data ...
your IDFFT looks BW (no gray) saturated
so did you amplify the DFFT results to see the image and used that for IDFFT instead of the original DFFT result?
also check if you do not round to integers somewhere along the computation
beware of (I)DFFT overflows/underflows
If your image pixel intensities are big and the resolution of image too then your computation could loss precision. Newer saw this in images but if your image is HDR then it is possible. This is a common problem with convolution computed by DFFT for big polynomials.
Thank you everyone for your opinions. All that stuff about memory corruption, while it makes a point, is not the root of the problem. The sizes of data I'm mallocing are not overly large, and I am freeing them in the right places. I had a lot of practice with this while learning c. The problem was not the fft algorithm either, nor even my 2D implementation of it.
All I missed was the scaling by 1/(M*N) at the very end of my ifft code. Because the image is 512x512, I needed to scale my ifft output by 1/(512*512). Also, my fft looks like white noise because the pixel data was not rescaled to fit between 0 and 255.
Suggest you look at the article http://www.yolinux.com/TUTORIALS/C++MemoryCorruptionAndMemoryLeaks.html
Christophe has a good point but he is wrong about it not being related to the problem because it seems that in modern times using malloc instead of new()/free() does not initialise memory or select best data type which would result in all problems listed below:-
Possibly causes are:
Sign of a number changing somewhere, I have seen similar issues when a platform invoke has been used on a dll and a value is passed by value instead of reference. It is caused by memory not necessarily being empty so when your image data enters it will have boolean maths performed on its values. I would suggest that you make sure memory is empty before you put your image data there.
Memory rotating right (ROR in assembly langauge) or left (ROL) . This will occur if data types are being used which do not necessarily match, eg. a signed value entering an unsigned data type or if the number of bits is different in one variable to another.
Data being lost due to an unsigned value entering a signed variable. Outcomes are 1 bit being lost because it will be used to determine negative or positive, or at extremes if twos complement takes place the number will become inverted in meaning, look for twos complement on wikipedia.
Also see how memory should be cleared/assigned before use. http://www.cprogramming.com/tutorial/memory_debugging_parallel_inspector.html

Matrix multiplication in a cpp file for Matlab

How would I do a matrix multiplication in cpp format that would after be compiled into a mex file?
My normal matrix multiplication in a Matlab script is as follow:
cMatrix = (1 / r) * pfMatrix * wcMatrix; %here pfMatrix is 2x3 and wcMatrix is 3x8
% Hence cMatrix is 2x8
% r is a scalar
The pfMatrix, wcMatrix and r are declared correctly in the cpp file and they have the same values as in the script. However cMatrix doesn't give me the same results. Here the implementation of the Matrix multiplication in the cpp :
int i, n, j;
for (i = 0; i<1; i++)
{
for (n = 0; n<7; n++)
{
for (j = 0; j<2; j++)
{
d->cMatrix[i][n] += (d->pfMatrix[i][j]) * (d->wcMatrix[j][n]);
}
d->cMatrix[i][n] = (1 / d->r) * d->cMatrix[i][n];
}
}
Edit:
I modified the loop following Ben Voigt answer. The results in cMatrix are still not identical to the one calculated from the Matlab script.
For example :
pfMatrix = [7937.91049469652,0,512;0,7933.81033431703,384];
wcMatrix = [-0.880633810389421,-1.04063381038942,-1.04063381038942,-0.880633810389421,-0.815633810389421,-1.10563381038942,-1.10563381038942,-0.815633810389421;-0.125,-0.125,0.125,0.125,-0.29,-0.29,0.29,0.29;100,100,100,100,100,100,100,100];
r = 100;
In this case, cMatrix(1,1) is :
(pfMatrix(1,1)*wcMatrix(1,1) + pfMatrix(1,2)*wcMatrix(2,1) + pfMatrix(1,3)*wcMatrix(3,1)) / r = 442.09
However, with the mex file the equivalent result is 959.
Edit #2:
I found the error in an element of pfMatrix that was not declared correctly (missing a division by 2). So the answer of Ben Voigt is working correctly. However, there is still a slight difference between the two results (Matlab script gives 442 and the mex gives 447, could it be a results of different data type?).
Edit #3:
Found the error and it was not related with the matrix multiplication loop.
Using your result matrix as scratch space is not a great idea. The compiler has to worry about aliasing, which means it can't optimize.
Try an explicit working variable, which also provides a convenient place to zero it:
for (int i = 0; i < 2; ++i) {
for (int n = 0; n < 8; ++n) {
double accum = 0.0;
for (int j = 0; j < 3; ++j) {
accum += (d->pfMatrix[i][j]) * (d->wcMatrix[j][n]);
}
d->cMatrix[i][n] = accum / d->r;
}
}
Your ranges were also wrong, which I've fixed.
(Also note that good performance on large matrices requires banding to get good cache behavior, however that shouldn't be an issue on a product of this size.)
A multiplication between matrices must be in this way: A[m][n] * B[n][p] = R[m][p].
The conditions that you wrote in the for loops are not correct and doesn't respect the matrix dimensions.
Look also at the Eigen libraries, which are open-source and provide a simple way to do the matrix multiplications.

CUDA combining thread independent(??) variables during execution

Guys I apologize if the title is confusing. I though long and hard and couldn't come up with proper way to phrase the question in a single line. So here's more detail. I am doing a basic image subtraction where the second image has been modified and I need to find the ratio of how much change was done to the image. for this I used the following code. Both images are 128x1024.
for(int i = 0; i < 128; i++)
{
for(int j = 0; j < 1024; j++)
{
den++;
diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
if(diff[i * 1024 + j] < error)
{
num++;
}
}
}
ratio = num/den;
The above code works fine on the CPU but I want to try to do this on CUDA. For this I can setup CUDA to do the basic subtraction of the images (code below) but I can't figure out how to do the conditional if statement to get my ratio out.
__global__ void calcRatio(float *orig, float *modified, int size, float *result)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if(index < size)
result[index] = orig[index] - modified[index];
}
So, up to this point it works but I cannot figure out how to parrallelize the num and den counters in each thread to calculate the ratio at the end of all the thread executions. To me it feels like the num and den counders are independent of the threads as every time I have tried to use them it seems they get incremented only once.
Any help will be appreciated as I am just starting out in CUDA and every example I see online never seems to apply to what I need to do.
EDIT: Fixed my naive code. Forgot to type one of the main condition in the code. It was a long long day.
for(int i = 0; i < 128; i++)
{
for(int j = 0; j < 1024; j++)
{
if(modified[i * 1024 + j] < 400.0) //400.0 threshold value to ignore noise
{
den++;
diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
if(diff[i * 1024 + j] < error)
{
num++;
}
}
}
}
ratio = num/den;
The operation you need to use to perform global summation across all the threads is known as a "parallel reduction". While you could use atomic operations to do this, I would not recommend it. There is a reduction kernel and a very good paper discussing the technique in the CUDA SDK, it is worth reading.
If I were writing code to do what you want, it would probably look like this:
template <int blocksize>
__global__ void calcRatio(float *orig, float *modified, int size, float *result,
int *count, const float error)
{
__shared__ volatile float buff[blocksize];
int index = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
int count = 0;
for(int i=index; i<n; i+=stride) {
val = orig[index] - modified[index];
count += (val < error);
result[index] = val;
}
buff[threadIdx.x] = count;
__syncthreads();
// Parallel reduction in shared memory using 1 warp
if (threadId.x < warpSize) {
for(int i=threadIdx.x + warpSize; i<blocksize; i+= warpSize) {
buff[threadIdx.x] += buff[i];
if (threadIdx.x < 16) buff[threadIdx.x] +=buff[threadIdx.x + 16];
if (threadIdx.x < 8) buff[threadIdx.x] +=buff[threadIdx.x + 8];
if (threadIdx.x < 4) buff[threadIdx.x] +=buff[threadIdx.x + 4];
if (threadIdx.x < 2) buff[threadIdx.x] +=buff[threadIdx.x + 2];
if (threadIdx.x == 0) count[blockIdx.x] = buff[0] + buff[1];
}
}
The first stanza does what your serial code does - computes a difference and a thread local total of elements which are less than error. Note I have written this version so that each thread is designed to process more than one entry of the input data. This has been done to help offset the computational cost of the parallel reduction that follows, and the idea is that you would use fewer blocks and threads than there were input data set entries.
The second stanza is the reduction itself, done in shared memory. It is effectively a "tree like" operation where the size of the set of thread local subtotals within a single block of threads is first summed down to 32 subtotals, then the subtotals are combined until there is the final subtotal for the block, and that is then stored is the total for the block. You will wind up with a small list of sub totals in count, one for each block you launched, which can be copied back to the host and the final result you need calculated there.
Please note I coded this in the browser and haven't compiled it, there might be errors, but it should give an idea about how an "advanced" version of what you are trying to do would work.
The denominator is pretty simple, since it's just the size.
The numerator is more troublesome, since its value for a given thread depends on all previous values. You're going to have to do that operation serially.
The thing you're looking for is probably atomicAdd. It's very slow, though.
I think you'd find this question relevant. Your num is basically global data.
CUDA array-to-array sum
Alternatively, you could dump the results of the error check into an array. Counting the results could then be parallelized. It would be a little tricky, but I think something like this would scale up: http://tekpool.wordpress.com/2006/09/25/bit-count-parallel-counting-mit-hakmem/

weird performance in C++ (VC 2010)

I have this loop written in C++, that compiled with MSVC2010 takes a long time to run. (300ms)
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
float val = 0;
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
val += original[k*w + m] * ds[k - i + hr][m - j + hr];
}
}
heat_map[i*w + j] = val;
}
}
}
It seemed a bit strange to me, so I did some tests then changed a few bits to inline assembly: (specifically, the code that sums "val")
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
__asm {
fldz
}
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
float val = original[k*w + m] * ds[k - i + hr][m - j + hr];
__asm {
fld val
fadd
}
}
}
float val1;
__asm {
fstp val1
}
heat_map[i*w + j] = val1;
}
}
}
Now it runs in half the time, 150ms. It does exactly the same thing, but why is it twice as quick? In both cases it was run in Release mode with optimizations on. Am I doing anything wrong in my original C++ code?
I suggest you try different floating-point calculation models supported by the compiler - precise, strict or fast (see /fp option) - with your original code before making any conclusions. I suspect that your original code was compiled with some overly restrictive floating-point model (not followed by your assembly in the second version of the code), which is why the original is much slower.
In other words, if the original model was indeed too restrictive, then you were simply comparing apples to oranges. The two versions didn't really do the same thing, even though it might seem so at the first sight.
Note, for example, that in the first version of the code the intermediate sum is accumulated in a float value. If it was compiled with precise model, the intermediate results would have to be rounded to the precision of float type, even if the variable val was optimized away and the internal FPU register was used instead. In your assembly code you don't bother to round the accumulated result, which is what could have contributed to its better performance.
I'd suggest you compile both versions of the code in /fp:fast mode and see how their performances compare in that case.
A few things to check out:
You need to check that is actually is the same code. As in, are your inline assembly statements exactly the same as those generated by the compiler? I can see three potential differences (potential because they may be optimised out). The first is the initial setting of val to zero, the second is the extra variable val1 (unlikely since it will most likely just change the constant subtraction of the stack pointer), the third is that your inline assembly version may not put the interim results back into val.
You need to make sure your sample space is large. You didn't mention whether you'd done only one run of each version or a hundred runs but, the more runs, the better, so as to remove the effect of "noise" in your statistics.
An even better measurement would be CPU time rather than elapsed time. Elapsed time is subject to environmental changes (like your virus checker or one of your services deciding to do something at the time you're testing). The large sample space will alleviate, but not necessarily solve, this.