Accessing memory allocated on CUDA [closed] - c++

I don't really have any experience with CUDA. I have C++ script that looks like the following
for (int i = 0; i < n; ++i) {
// out_data here is a pointer to some chunk of memory on a CPU
out_data[i] = manipulate_out_data_val(out_data[i]);
This is currently set up for CPUs. I would like to adapt this to work with GPU allocated arrays, i.e., if out_data was allocated on GPU, how can do I write the above loop?
I tried porting it over as is with a GPU-allocated array, and the program seg-faults.
I'm not sure if this is relevant, but manipulate_out_data_val applies a constant scaling factor to the input value and then adds a constant to the resulting scaled value.

So firstly, I will convert your function into a CUDA kernel which looks something like this.
__global__ void manipulate_out_data_val(int *array)
// Assuming `20` is just a scaling factor.
array[threadIdx.x] *= 20;
Please note that for loops will not be used anymore because of the threadIdx parameter that is provided by CUDA. The thread's index replaces the i from your for loop. Please refer to this document to learn more about the CUDA's threading model.
Lets assume the array can store up to 100 integers.
int n = 100;
int bytes = n * sizeof(int);
Initialise an array on the CPU first.
int *arr_cpu;
arr_cpu = (int *)malloc(bytes);
for(int i = 0;i < n;i++) {
arr_cpu[i] = i;
Allocate some memory on the GPU
int *arr_gpu;
cudaMalloc((void **)&arr_gpu, n*sizeof(int));
Now, you can copy your CPU array to this allocated GPU memory using the cudaMemcpu function. Note that Host indicates CPU and Device indicates GPU as stated here
cudaMemcpy(arr_gpu, arr_cpu, n * sizeof(int), cudaMemcpyHostToDevice);
Finally you can run your kernel. Note the number 1 in kernel syntax is number of blocks and n is number of threads per block.
manipulate_out_data_val<<<1, n>>>(arr_gpu);
Wait until the kernel is finished running
Finally, you can move the array from GPU back to CPU
cudaMemcpy(arr_cpu, arr_gpu, n * sizeof(int), cudaMemcpyDeviceToHost);
Please find the whole code here:
#include <iostream>
#include <cuda.h>
using namespace std;
__global__ void manipulate_out_data_val(int *array)
// Can add your constant scaling logic here
array[threadIdx.x] *= 20;
int main(int argc,char **argv)
int n = 100;
int bytes = n * sizeof(int);
int *arr_cpu;
arr_cpu = (int *)malloc(bytes);
for(int i=0;i<n;i++)
int *arr_gpu;
cudaMalloc((void **)&arr_gpu, n*sizeof(int));
printf("Copying to device..\n");
cudaMemcpy(arr_gpu, arr_cpu, n * sizeof(int), cudaMemcpyHostToDevice);
manipulate_out_data_val<<<1, n>>>(arr_gpu);
cudaMemcpy(arr_cpu, arr_gpu, n * sizeof(int), cudaMemcpyDeviceToHost);
for(int i=0;i<n;i++)
printf("%d,", arr_cpu[i]);
return 0;
Build and run using:
# is the file containing the code
nvcc -o program
# Run
The above code has been tested on CUDA 11.4.


Why is this simple CUDA kernel getting a wrong result?

I am a newbie with CUDA. I'm learning some basic things because I want to use CUDA in other project. I have wrote this code in order to add all the elements from a squared matrix 8x8 which has been filled with 1's so the result must be 64.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
const int SIZE = 64;
__global__ void add_matrix_values(int* matrix, int sum, int c)
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.x * blockDim.x;
sum += matrix[i*c+j];
int main()
int* device_matrix;
int* host_matrix;
int c = 8; //Squared matrix cxc
int device_c = 8;
int device_sum = 0;
int host_sum = 0;
//Allocate host memory
host_matrix = (int*)malloc(sizeof(int)*SIZE);
//Fill the matrix values with 1's
for(auto i = 0; i < SIZE; i++)
host_matrix[i] = 1;
//Allocate device memory
cudaMalloc((void**) &device_matrix,sizeof(int)*SIZE);
cudaMalloc((void**) &device_sum, sizeof(int));
cudaMalloc((void**) &device_c,sizeof(int));
//Fill device_matrix with host_matrix values
//Initialize device_sum with a 0
//Initialize device_c with the correct value
//4 blocks with 16 threads every single block ¿Is this correct?
add_matrix_values<<<4,16>>>(device_matrix, device_sum,device_c);
std::cout<<"The value is: "<<host_sum<<std::endl;
return 0;
The result must be 64 but I'm getting wrong numbers.
migue#migue  ~/Escritorio  ./program
The value is: 32762
migue#migue  ~/Escritorio  ./program
The value is: 32608
migue#migue  ~/Escritorio  ./program
The value is: 32559
I dont't know what I'm doing wrong. It could be the gridSize and the blockSize ? or It could be the i and j operation in the cuda Kernel ?
I dont understand very well that terms.
There are a number of issues:
You are creating a 1-D grid (grid configuration, block configuration) so your 2-D indexing in kernel code (i,j, or x,y) doesn't make any sense
You are passing sum by value. You cannot retrieve a result that way. Changes in the kernel to sum won't be reflected in the calling environment. This is a C++ concept, not specific to CUDA. Use a properly allocated pointer instead.
In a CUDA multithreading environment, you cannot have multiple threads update the same location/value without any control. CUDA does not sort out that kind of access for you. You must use a parallel reduction technique, and a simplistic approach here could be to use atomics. You can find many questions here on the cuda tag discussing parallel reductions.
You're generally confusing pass by value and pass by pointer. Items passed by value can be ordinary host variables. You generally don't need a cudaMalloc allocation for those. You also don't use cudaMalloc on any kind of variable except a pointer.
Your use of cudaMemcpy is incorrect. There is no need to take the address of the pointers.
The following code has the above items addressed:
$ cat
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
const int SIZE = 64;
__global__ void add_matrix_values(int* matrix, int *sum, int c)
int i = threadIdx.x + blockIdx.x * blockDim.x;
atomicAdd(sum, matrix[i]);
int main()
int* device_matrix;
int* host_matrix;
int device_c = 8;
int *device_sum;
int host_sum = 0;
//Allocate host memory
host_matrix = (int*)malloc(sizeof(int)*SIZE);
//Fill the matrix values with 1's
for(auto i = 0; i < SIZE; i++)
host_matrix[i] = 1;
//Allocate device memory
cudaMalloc((void**) &device_matrix,sizeof(int)*SIZE);
cudaMalloc((void**) &device_sum, sizeof(int));
//Fill device_matrix with host_matrix values
//Initialize device_sum with a 0
//4 blocks with 16 threads every single block ¿Is this correct?
add_matrix_values<<<4,16>>>(device_matrix, device_sum,device_c);
std::cout<<"The value is: "<<host_sum<<std::endl;
return 0;
$ nvcc -o t135
$ cuda-memcheck ./t135
The value is: 64
========= ERROR SUMMARY: 0 errors

CUDA kernel returns nothing

I'm using CUDA Toolkit 8 with Visual Studio Community 2015. When I try simple vector addition from NVidia's PDF manual (minus error checking which I don't have the *.h's for) it always comes back as undefined values, which means the output array was never filled. When I pre-fill it with 0's, that's all I get at the end.
Others have had this problem and some people are saying it's caused by compiling for the wrong compute capability. However, I am using an NVidia GTX 750 Ti, which is supposed to be Compute Capability 5. I have tried compiling for Compute Capability 2.0 (the minimum for my SDK) and 5.0.
I also cannot make any of the precompiled examples work, such as vectoradd.exe which says, "Failed to allocate device vector A (error code initialization error)!" And oceanfft.exe says, "Error unable to find GLSL vertex and fragment shaders!" which doesn't make sense because GLSL and fragment shading are very basic features.
My driver version is 361.43 and other apps such as Blender Cycles in CUDA mode and Stellarium work perfectly.
Here is the code that should work:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>
#include <algorithm>
#define N 10
__global__ void add(int *a, int *b, int *c) {
int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
int main(void) {
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_c, N * sizeof(int));
// fill the arrays 'a' and 'b' on the CPU
for (int i = 0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy(dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int),cudaMemcpyHostToDevice);
add << <N, 1 >> >(dev_a, dev_b, dev_c);
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy(c, dev_c, N * sizeof(int),cudaMemcpyDeviceToHost);
// display the results
for (int i = 0; i<N; i++) {
printf("%d + %d = %d\n", a[i], b[i], c[i]);
// free the memory allocated on the GPU
return 0;
I'm trying to develop CUDA apps so any help would be greatly appreciated.
This was apparently caused by using an incompatible driver version with the CUDA 8 toolkit. Installing the driver distributed with the version 8 toolkit solved thr problem.
[Answer assembled from comments and added as a community wiki entry to get the question off the unanswered queue for the CUDA tag]

CUDA First-chance Exception Stack overflow Error

CUDA/C++ noob here.
The error I receive on attempting to debug my CUDA project is:
First-chance exception at 0x000000013F889467 in simple6.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000223000).
The program '[2668] simple6.exe' has exited with code 0 (0x0).
From research on the web, it seems that I have some large variables that are too large for the "stack" and need to be moved to the "heap".
Can someone please provide me the appropriate code modifications?
My code is below. The point of this kernel is to use h_S and h_TM to create a bunch of values and write these values into h_F at the very end. This is why h_F is never copied into the GPU.
int main()
int blockSize= 1024;
int gridSize = 1;
const int reps = 1024;
const int iterations = 18000;
int h_F [reps * iterations] = {0};
int h_S [reps] = {0}; // not actually zeros in my code this just simplifies things
int h_TM [2592] = {0} // not actually zeros in my code this just simplifies things
// Device input vectors
float *d_F;
double *d_S;
float *d_TM;
//Select GPU
// Allocate memory for each vector on GPU
cudaMalloc((void**)&d_F, iterations * reps * sizeof(float));
cudaMalloc((void**)&d_S, reps * sizeof(double));
cudaMalloc((void**)&d_TM, 2592 * sizeof(float));
// Copy host vectors to device
cudaMemcpy( d_S, h_S, reps * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( d_TM, h_TM, 2592 * sizeof(float), cudaMemcpyHostToDevice);
// Execute the kernel
myKern<<<gridSize, blockSize>>>(d_TM, d_F, d_S, reps);
// Copy array back to host
cudaMemcpy( h_F, d_F, iterations * reps * sizeof(float), cudaMemcpyDeviceToHost );
// Release device memory
return 0;
Also, related, but would making these huge input arrays "shared" variables solve my problem?
Many thanks.
So I read through your code and it seems like only one of those 3 arrays are actually going to be causing the stack overflow error. This is assuming your reps doesn't get too big. This array causing the problem is h_F. All you have to do is declare h_F so that it gets placed on the heap instead of the stack, as you said.
This is literally a one line change.
Simply declare h_F like this:
float *h_F = new float[(reps * iterations)];
Good luck!

Unable to find simple sum of 1 to 100 numbers in CUDA?

I am working on image processing algorithm using CUDA. In my algorithm i want to find sum of all pixels of image using CUDA kernel. so i made kernel method in cuda for measure sum of all pixels of 16 bit gray scale image, but i got wrong answer.
So i make simple program in cuda for find sum of 1 to 100 numbers and my code is below.
In my code i got not exact sum of that 1 to 100 numbers using GPU, but i got exact sum of that 1 to 100 numbers using CPU. So what i had done in that code ?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <conio.h>
#include <malloc.h>
#include <limits>
#include <math.h>
using namespace std;
__global__ void computeMeanValue1(double *pixels,double *sum){
int x = threadIdx.x;
sum[0] = sum[0] + (pixels[(x)]);
int main(int argc, char **argv)
double *data;
double *dev_data;
double *dev_total;
double *total;
data=new double[(100) * sizeof(double)];
total=new double[(1) * sizeof(double)];
double cpuSum=0.0;
for(int i=0;i<100;i++){
cout<<"CPU total = "<<cpuSum<<std::endl;
cudaMalloc( (void**)&dev_data, 100 * sizeof(double));
cudaMalloc( (void**)&dev_total, 1 * sizeof(double));
cudaMemcpy(dev_data, data, 100 * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(total, dev_total, 1* sizeof(double), cudaMemcpyDeviceToHost);
cout<<"GPU total = "<<total[0]<<std::endl;
return 0;
All your threads are writing to the same memory location at the same time.
sum[0] = sum[0] + (pixels[(x)]);
You can't do this and expect to get the correct result. Your kernel needs to take a different approach to avoid writing to the same memory from different threads. The pattern usually employed for doing this is reduction. Simply put with a reduction each thread is responsible for summing a block of elements within the array and then storing the result. By employing a series of these reduction operations its possible to sum the entire contents of the array.
__global__ void block_sum(const float *input,
float *per_block_results,
const size_t n)
extern __shared__ float sdata[];
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
// load input into __shared__ memory
float x = 0;
if(i < n)
x = input[i];
sdata[threadIdx.x] = x;
// contiguous range pattern
for(int offset = blockDim.x / 2;
offset > 0;
offset >>= 1)
if(threadIdx.x < offset)
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
// wait until all threads in the block have
// updated their partial sums
// thread 0 writes the final result
if(threadIdx.x == 0)
per_block_results[blockIdx.x] = sdata[0];
Each thread writes to a different location in sdata[threadIdx.x] there is no race condition. Threads are free to access other elements in sdata because they only read from them so there are no race conditions. Note the use of __syncthreads() to ensure that the operations to load data into sdata are complete before the threads start to read the data and the second call to __syncthreads() to ensure that all the summation operations have completed before copying the final result from sdata[0]. Note that only thread 0 writes its result to per_block_results[blockIdx.x], so there is no race condition there either.
You can find the complete sample code for the above on Google Code (I did not write this). This slide deck has a reasonable summary of reductions in CUDA. It includes diagrams which really help in understanding how the interleaved memory reads and writes do not conflict with each other.
You can find lots of other material on efficient implementations of reduction on GPUs. Ensuring that your implementation makes most efficient use of memory is key to getting the best performance out of a memory bound operation like reduction.
In GPU code, we have multiple threads executing in parallel. If all of those threads attempt to update the same location in memory, we have undefined behavior, unless we use special operations, called atomics to do the update.
In your case, since sum is updated by all threads, and sum is a double quantity, we can use the special custom atomic function described in the programming guide to accomplish this.
If I replace your kernel code with the following:
__device__ double atomicAdd(double* address, double val)
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val +
} while (assumed != old);
return __longlong_as_double(old);
__global__ void computeMeanValue1(double *pixels,double *sum){
int x = threadIdx.x;
atomicAdd(sum, pixels[x]);
And initialize the sum value to zero before the kernel:
double gpuSum = 0.0;
cudaMemcpy(dev_total, &gpuSum, sizeof(double), cudaMemcpyHostToDevice);
Then I think you'll get matching results.
As #AdeMiller pointed out, the faster way to perform parallel sums like this is via classical parallel reduction.
There is a CUDA sample code that demonstrates this and an accompanying presentation that covers the methodology.

Fastest way to calculate minimum euclidean distance between two matrices containing high dimensional vectors

I started a similar question on another thread, but then I was focusing on how to use OpenCV. Having failed to achieve what I originally wanted, I will ask here exactly what I want.
I have two matrices. Matrix a is 2782x128 and Matrix b is 4000x128, both unsigned char values. The values are stored in a single array. For each vector in a, I need the index of the vector in b with the closest euclidean distance.
Ok, now my code to achieve this:
#include <windows.h>
#include <stdlib.h>
#include <stdio.h>
#include <cstdio>
#include <math.h>
#include <time.h>
#include <sys/timeb.h>
#include <iostream>
#include <fstream>
#include "main.h"
using namespace std;
void main(int argc, char* argv[])
int a_size;
unsigned char* a = NULL;
read_matrix(&a, a_size,"matrixa");
int b_size;
unsigned char* b = NULL;
read_matrix(&b, b_size,"matrixb");
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
int* indexes = NULL;
min_distance_loop(&indexes, b, b_size, a, a_size);
QueryPerformanceCounter( &liEnd );
cout << "loop time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
if (a)
if (b)
if (indexes)
void read_matrix(unsigned char** matrix, int& matrix_size, char* matrixPath)
ofstream myfile;
float f;
FILE * pFile;
pFile = fopen (matrixPath,"r");
fscanf (pFile, "%d", &matrix_size);
*matrix = new unsigned char[matrix_size*128];
for (int i=0; i<matrix_size*128; ++i)
unsigned int matPtr;
fscanf (pFile, "%u", &matPtr);
matrix[i]=(unsigned char)matPtr;
fclose (pFile);
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
const int descrSize = 128;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
unsigned char* dataPtr;
unsigned char* vocPtr;
for (int i=0; i<a_size; ++i)
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
dataPtr = &a[dataIndex];
vocPtr = &b[vocIndex];
for (int k=0; k<descrSize; ++k)
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
// if distance smaller
if (distance<min_distance)
min_distance = distance;
(*indexes)[i] = j;
And attached are the files with sample matrices.
I am using windows.h just to calculate the consuming time, so if you want to test the code in another platform than windows, just change windows.h header and change the way of calculating the consuming time.
This code in my computer is about 0.5 seconds. The problem is that I have another code in Matlab that makes this same thing in 0.05 seconds. In my experiments, I am receiving several matrices like matrix a every second, so 0.5 seconds is too much.
Now the matlab code to calculate this:
aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b';
d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
[minz index]=min(d,[],2);
Ok. Matlab code is using that (x-a)^2 = x^2 + a^2 - 2ab.
So my next attempt was to do the same thing. I deleted my own code to make the same calculations, but It was 1.2 seconds approx.
Then, I tried to use different external libraries. The first attempt was Eigen:
const int descrSize = 128;
MatrixXi a(a_size, descrSize);
MatrixXi b(b_size, descrSize);
MatrixXi ab(a_size, b_size);
unsigned char* dataPtr = matrixa;
for (int i=0; i<nframes; ++i)
for (int j=0; j<descrSize; ++j)
unsigned char* vocPtr = matrixb;
for (int i=0; i<vocabulary_size; ++i)
for (int j=0; j<descrSize; ++j)
b(i,j)=(int)*vocPtr ++;
ab = a*b.transpose();
MatrixXi aa = a.rowwise().sum();
MatrixXi bb = b.rowwise().sum();
MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2();
int* index = NULL;
index = (int*)malloc(nframes*sizeof(int));
for (int i=0; i<nframes; ++i)
This Eigen code costs 1.2 approx for just the line that says: ab = a*b.transpose();
A similar code using opencv was used also, and the cost of the ab = a*b.transpose(); was 0.65 seconds.
So, It is real annoying that matlab is able to do this same thing so quickly and I am not able in C++! Of course being able to run my experiment would be great, but I think the lack of knowledge is what really is annoying me. How can I achieve at least the same performance than in Matlab? Any kind of soluting is welcome. I mean, any external library (free if possible), loop unrolling things, template things, SSE intructions (I know they exist), cache things. As I said, my main purpose is increase my knowledge for being able to code thinks like this with a faster performance.
Thanks in advance
EDIT: more code suggested by David Hammen. I casted the arrays to int before making any calculations. Here is the code:
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
const int descrSize = 128;
int* a_int;
int* b_int;
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
a_int = (int*)malloc(a_size*descrSize*sizeof(int));
b_int = (int*)malloc(b_size*descrSize*sizeof(int));
for(int i=0; i<descrSize*a_size; ++i)
for(int i=0; i<descrSize*b_size; ++i)
QueryPerformanceCounter( &liEnd );
cout << "Casting time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
/*unsigned char* dataPtr;
unsigned char* vocPtr;*/
int* dataPtr;
int* vocPtr;
for (int i=0; i<a_size; ++i)
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
dataPtr = &a_int[dataIndex];
vocPtr = &b_int[vocIndex];
for (int k=0; k<descrSize; ++k)
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
// if distance smaller
if (distance<min_distance)
min_distance = distance;
(*indexes)[i] = j;
The entire process is now 0.6, and the casting loops at the beginning are 0.001 seconds. Maybe I did something wrong?
EDIT2: Anything about Eigen? When I look for external libs they always talk about Eigen and their speed. I made something wrong? Here a simple code using Eigen that shows it is not so fast. Maybe I am missing some config or some flag, or ...
MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 500);
MatrixXd X;
This code is about 0.9 seconds.
As you observed, your code is dominated by the matrix product that represents about 2.8e9 arithmetic operations. Yopu say that Matlab (or rather the highly optimized MKL) computes it in about 0.05s. This represents a rate of 57 GFLOPS showing that it is not only using vectorization but also multi-threading. With Eigen, you can enable multi-threading by compiling with OpenMP enabled (-fopenmp with gcc). On my 5 years old computer (2.66Ghz Core2), using floats and 4 threads, your product takes about 0.053s, and 0.16s without OpenMP, so there must be something wrong with your compilation flags. To summary, to get the best of Eigen:
compile in 64bits mode
use floats (doubles are twice as slow owing to vectorization)
enable OpenMP
if your CPU has hyper-threading, then either disable it or define the OMP_NUM_THREADS environment variable to the number of physical cores (this is very important, otherwise the performance will be very bad!)
if you have other task running, it might be a good idea to reduce OMP_NUM_THREADS to nb_cores-1
use the most recent compiler that you can, GCC, clang and ICC are best, MSVC is usually slower.
One thing that is definitely hurting you in your C++ code is that it has a boatload of char to int conversions. By boatload, I mean up to 2*2782*4000*128 char to int conversions. Those char to int conversions are slow, very slow.
You can reduce this to (2782+4000)*128 such conversions by allocating a pair of int arrays, one 2782*128 and the other 4000*128, to contain the cast-to-integer contents of your char* a and char* b arrays. Work with these int* arrays rather than your char* arrays.
Another problem might be your use of int versus long. I don't work on windows, so this might not be applicable. On the machines I work on, int is 32 bits and long is now 64 bits. 32 bits is more than enough because 255*255*128 < 256*256*128 = 223.
That obviously isn't the problem.
What's striking is that the code in question is not calculating that huge 2728 by 4000 array that the Matlab code is creating. What's even more striking is that Matlab is most likely doing this with doubles rather than ints -- and it's still beating the pants off the C/C++ code.
One big problem is cache. That 4000*128 array is far too big for level 1 cache, and you are iterating over that big array 2782 times. Your code is doing far too much waiting on memory. To overcome this problem, work with smaller chunks of the b array so that your code works with level 1 cache for as long as possible.
Another problem is the optimization if (distance>min_distance) break;. I suspect that this is actually a dis-optimization. Having if tests inside your innermost loop is oftentimes a bad idea. Blast through that inner product as fast as possible. Other than wasted computations, there is no harm in getting rid of this test. Sometimes it is better to make apparently unneeded computations if doing so can remove a branch in an innermost loop. This is one of those cases. You might be able to solve your problem just by eliminating this test. Try doing that.
Getting back to the cache problem, you need to get rid of this branch so that you can split the operations over the a and b matrix into smaller chunks, chunks of no more than 256 rows at a time. That's how many rows of 128 unsigned chars fit into one of the two modern Intel chip's L1 caches. Since 250 divides 4000, look into logically splitting that b matrix into 16 chunks. You may well want to form that big 2872 by 4000 array of inner products, but do so in small chunks. You can add that if (distance>min_distance) break; back in, but do so at a chunk level rather than at the byte by byte level.
You should be able to beat Matlab because it almost certainly is working with doubles, but you can work with unsigned chars and ints.
Matrix multiply generally uses the worst possible cache access pattern for one of the two matrices, and the solution is to transpose one of the matrices and use a specialized multiply algorithm that works on data stored that way.
Your matrix already IS stored transposed. By transposing it into the normal order and then using a normal matrix multiply, your are absolutely killing performance.
Write your own matrix multiply loop that inverts the order of indices to the second matrix (which has the effect of transposing it, without actually moving anything around and breaking cache behavior). And pass your compiler whatever options it has for enabling auto-vectorization.