Fastest way to calculate the abs()-values of a complex array - c++

I want to calculate the absolute values of the elements of a complex array in C or C++. The easiest way would be
for(int i = 0; i < N; i++)
b[i] = cabs(a[i]);
But for large vectors that will be slow. Is there a way to speed that up (by using parallelization, for example)? Language can be either C or C++.

Given that all loop iterations are independent, you can use the following code for parallelization:
#pragma omp parallel for
for(int i = 0; i < N; i++)
b[i] = cabs(a[i]);
Of course, for using this you should enable OpenMP support while compiling your code (usually by using /openmp flag or setting the project options).
You can find several examples of OpenMP usage in wiki.

Or use Concurrency::parallele_for like that :
Concurrency::parallel_for(0, N, [&a, &b](int i)
b[i] = cabs(a[i]);

Use vector operations.
If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.
Libmvec is vector math library added in Glibc 2.22.
Vector math library was added to support SIMD constructs of OpenMP4.0
(#2.8 in by adding
vector implementations of vector math functions.
Vector math functions are vector variants of corresponding scalar math
operations implemented using SIMD ISA extensions (e.g. SSE or AVX for
x86_64). They take packed vector arguments, perform the operation on
each element of the packed vector argument, and return a packed vector
result. Using vector math functions is faster than repeatedly calling
the scalar math routines.
Also, see Parallel for vs omp simd: when to use each?
If you're running on Solaris, you can explicitly use vhypot() from the math vector library to operate on a vector of complex numbers to obtain the absolute value of each:
These functions evaluate the function hypot(x, y) for an entire vector
of values at once. ...
The source code for libmvec can be found at and the vhypot() code specifically at I don't recall if Sun Microsystems ever provided a Linux version of or not.

Using #pragma simd (even with -Ofast) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use
//struct of arrays of four complex numbers
struct c4 {
float x[4]; // real values of four complex numbers
float y[4]; // imaginary values of four complex numbers
Here is code showing how you could do this with SSE for the x86 instruction set.
#include <stdio.h>
#include <x86intrin.h>
#define N 10
struct c4{
float x[4];
float y[4];
static inline void cabs_soa4(struct c4 *a, float *b) {
__m128 x4 = _mm_loadu_ps(a->x);
__m128 y4 = _mm_loadu_ps(a->y);
__m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
_mm_storeu_ps(b, b4);
int main(void)
int n4 = ((N+3)&-4)/4; //choose next multiple of 4 and divide by 4
printf("%d\n", n4);
struct c4 a[n4]; //array of struct of arrays
for(int i=0; i<n4; i++) {
for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
float b[4*n4];
for(int i=0; i<n4; i++) {
cabs_soa4(&a[i], &b[4*i]);
for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
It may help to unroll the loop a few times. In any case all this is moot for large N because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this
for(int i = 0; i < nchunks; i++) {
for(int j = 0; j < chunk_size; j++) {
b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
foo(&b[i*chunck_size]); // foo is computationally intensive.
I did not implement an array of struct of array here but it should be easy to adjust the code for that.

If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.
So, if you want to run them in parallel you would do:
#include <cilk/cilk.h>
cilk_for(int i = 0; i < N; i++)
b[i] = cabs(a[i]);
or if you want to test SIMD:
#pragma simd
for(int i = 0; i < N; i++)
b[i] = cabs(a[i]);
But, the nicest part of Cilk is that you can just do:
b[:] = cabs(a[:])
In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel).
Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.

Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:
#include <future>
int main()
// Create async calculations
std::future<void> *futures = new std::future<void>[N];
for (int i = 0; i < N; ++i)
futures[i] = std::async([&a, &b, i]
b[i] = std::sqrt(a[i]);
// Wait for calculation of all async procedures
for (int i = 0; i < N; ++i)
return 0;
IdeOne live code
We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful:


Why do I get a larger relative speedup vs. scalar from SIMD intrinsics with larger arrays?

I want to learn SIMD programming. Now I have some interesting moment in my code.
I just want to measure the time of work of my code. I try to apply some base function for my array with a particular size.
Firstly I try to use function that was written with SIMD instructions and after that I try to use usual aproach. And I compare time of this two realizations the same function.
I defined performance like (time without sse) / (time using sse).
But when my size is 8 , I have performance is 1.3, and when my size = 512 - I have Performance = 3, if I have size = 1000 performance = 4, if size = 4000 -> performance = 5.
I don't understand why my performance is increasing when size of array is increasing.
My code
void init(double* v, size_t size) {
for (int i = 0; i < size; ++i) {
v[i] = i / 10.0;
void sub_func_sse(double* v, int start_idx) {
__m256d vector = _mm256_loadu_pd(v + start_idx);
__m256d base = _mm256_set_pd(2.0, 2.0, 2.0, 2.0);
for (int i = 0; i < 128; ++i) {
vector = _mm256_mul_pd(vector, base);
_mm256_storeu_pd(v + start_idx, vector);
void sub_func(double& item) {
for (int k = 0; k < 128; ++k) {
item *= 2.0;
int main() {
const size_t size = 8;
double* v = new double[size];
init(v, size);
const int num_repeat = 2000;//I should repeat my measuraments
//because I want to get average time - it is more clear information
double total_time_sse = 0;
for (int p = 0; p < num_repeat; ++p) {
init(v, size);
TimerHc t;
for (int i = 0; i < size; i += 8) {
sub_func_sse(v, i);
total_time_sse += t.toc();
double total_time = 0;
for (int p = 0; p < num_repeat; ++p) {
init(v, size);
TimerHc t;
for (int i = 0; i < size; ++i) {
total_time += t.toc();
std::cout << "time using sse = " << total_time_sse / num_repeat << std::endl <<
"time without sse = " << total_time / num_repeat << std::endl;
I defined performance like (time without sse) / (time using sse).
What you measure is speedup.
The speedup you can expect from applying parallelizations is modelled by Amdahl's law. It relates the savings in those parts that can be made faster (by parallelization or other means) to the total speedup. Amdahl's law can be rather intimidating, because it basically says that making parts faster will not always gain you a total speedup. The limit in achievable speedup is determined by the relative fraction of the workload that can be parallelized.
Gustavon's law takes a different point of view. In a nutshell, it states that you just have to increase the workload to make efficient use of parallelization. More workload in total has typically less impact on overhead from parallelization and the non-parallel part of computations, hence (according to Amdahl's law) results in more efficient use of parallelism.
...and in some sense, that's what you are observing here. The bigger your array, the more impact parallelization has.
PS: This is just some handwaving to explain why the effect you see is not too surprising. Luckily there is another answer which addresses your specific benchmark in more detail.
You're probably a victim of CPU frequency scaling; for stable results you should disable dynamic frequency scaling and turbo boost, or at least warm up the CPU before starting the measurement.
Since you start by measuring SSE performance and then proceed to regular performance, the CPU frequency is low in the beginning, so SSE performance appears worse.
Having said that, there are a few other issues with your approach:
The overhead of high_frequency_clock::now() calls compared to the work being measured is high; move the time measurement to outside the for (..num_repeat loop, i.e. time the entire loop, not individual iterations (then optionally divide the measured time by the number of iterations).
The results of the computation are never used; the compiler is free to optimize the work out entirely. Make sure to "use" the result, e.g. by printing it.
It is quite inefficient to multiply a double by 2.0. Indeed, the non-SSE version is optimized to an ADD instead (item *= 2.0 ==> vaddsd xmm0, xmm0, xmm0). So your hand-made SSE version is losing out.
An optimizing compiler will probably auto-vectorize your non-SSE code. To be sure, always check the generated assembly. Link to godbolt
Use a benchmarking framework like Google Benchmark; it will help you avoid many pitfalls associated with code benchmarking.

How to parallelize a loop?

I'm using OpenMP on C++ and I want to parallelize very simple loop. But I can't do it correctly. All time I get wrong result.
A[i,j] =A[i-2,j] +A[i,j-2];
int const N = 10;
int arr[N][N];
#pragma omp parallel for
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
arr[i][j] = 1;
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
arr[i][j] = arr[i-2][j] +arr[i][j-2];
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
printf_s("%d ",arr[i][j]);
Do you have any suggestions how I can do it? Thank you!
serial and parallel run will give different. result because in
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
arr[i][j] = arr[i-2][j] +arr[i][j-2];
you update arr[i]. so you change data used by the other thread. it will lead to a read over write data race!
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
arr[i][j] = arr[i-2][j] +arr[i][j-2];
is always going to be a source of grief and unpredictable output. The OpenMP run time is going to hand each thread a range of values for i and leave them to it. There will be no determinism in the relative order in which threads update arr. For example, while thread 1 is updating elements with i = 2,3,4,5,...,100 (or whatever) and thread 2 is updating elements with i = 102,103,104,...,200 the program does not determine whether thread 1 updates arr[i,:] = 100 before or after thread 2 wants to use the updated values in arr. You have written a code with a classic data race.
You have a number of options to fix this:
You could tie yourself in knots trying to ensure that the threads update arr in the right (ie sequential) order. The end result would be an OpenMP program that runs more slowly than the sequential program. DO NOT TAKE THIS OPTION.
You could make 2 copies of arr and always update from one to the other, then from the other to the one. Something like (very pseudo-code)
for ...
old = 0
new = 1
arr[i][j][new] = arr[i-2][j][old] +arr[i][j-2][old];
old = 1
new = 0
Of course, this second approach trades space for time but that's often a reasonable trade-off.
You may find that adding an extra plane to arr doesn't immediately speed things up because it wrecks the spatial locality of values pulled into cache. Experiment a bit with this, possibly make [old] the first index element rather than the last.
Since updating each element in the array depends on the values found in elements 2 rows/columns away you're effectively splitting the array up like a chess-board, into white and black elements. You could use 2 threads, one on each 'colour', without the threads racing for access to the same data. Again, though, the disruption of spatial locality in the cache might have a bad impact on speed.
If any other options occur to me I'll edit them in.
To parallelize the loop nest in the question is tricky, but doable. Lamport's paper "The Parallel Execution of DO Loops" covers the technique. Basically you have to rotate your (i,j) coordinates by 45 degrees into a new coordinate system (k,l), where k=i+j and l=i-j.
Though to actually get speedup, the iterations likely have to be grouped into tiles, which makes the code even uglier (four nested loops).
A completely different approach is to solve the problem recursively, using OpenMP tasking. The recursion is:
if( too small to be worth parallelizing ) {
do serially
} else {
// Recursively:
Do upper left quadrant
Do lower left and upper right quadrants in parallel
Do lower right quadrant
As a practical matter, the ratio of arithmetic operations to memory accesses is so low that it is going to be difficult to get speedup out of the example.
If you ask about parallelism in general, then one more possible answer is vectorization. You could achieve some relatively poor vector parallelizm (something like 2x speedup or so) without
changing the data structure and codebase. This is possible using OpenMP4.0 or CilkPlus pragma simd or similar (with safelen/vectorlength(2))
Well, you really have data dependence (both inner and outer loops), but it belongs to «WAR»[ (write after read) dependencies sub-category, which is blocker for using «omp parallel for» «as is» but not necessarily a problem for «pragma omp simd» loops.
To make this working you will need x86 compilers supporting pragma simd either via OpenMP4 or via CilkPlus (very recent gcc or Intel compiler).

How to hint OpenMP Stride?

I am trying to understand the conceptual reason why OpenMP breaks loop vectorization. Also any suggestions for fixing this would be helpful. I am considering manually parallelizing this to fix this issue, but that would certainly not be elegant and result in a massive amount of code bloat, as my code consists of several such sections that lend themselves to vectorization and parallelization.
I am using
Microsoft (R) C/C++ Optimizing Compiler Version 17.00.60315.1 for x64
With OpenMP:
info C5002: loop not vectorized due to reason '502'
Without OpenMP:
info C5001: loop vectorized
The VS vectorization page says this error happens when:
Induction variable is stepped in some manner other than a simple +1
Can I force it to step in stride 1?
The loop
#pragma omp parallel for
for (int j = 0; j < H*W; j++)//A,B,C,D,IN are __restricted
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
Best Effort(?)
#pragma omp parallel
{// This seems to vectorize, but it still requires quite a lot of boiler code
int middle = H*W/2;
#pragma omp sections nowait
#pragma omp section
for (int j = 0; j < middle; j++)
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
#pragma omp section
for (int j = middle; j < H*W; j++)
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
I recommend that you do the vectorization manually. One reason is that auto-vectorization does not seem to handle carried loop dependencies well (loop unrolling).
To avoid code bloat and arcane intrinsics I use Agner Fog's vectorclass. In my experience it's just as fast as using intrinsics and it automatically takes advantage of SSE2-AVX2 (AVX2 is tested on a Intel emulator) depending on how you compile. I have written GEMM code using the vectorclass that works on SSE2 up to AVX2 and when I run on a system with AVX my code is already faster than Eigen which only uses SSE. Here is your function with the vectorclass (I did not try unrolling the loop).
#include "omp.h"
#include "math.h"
#include "vectorclass.h"
#include "vectormath.h"
void loop(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
#pragma omp parallel for
for (int j = 0; j < H*W; j+=8)//A,B,C,D,IN are __restricted, W*H must be a multiple of 8
Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
Vec8f invec = atan(Gs, Gc);[j]);
When doing the vectorization yourself you have to be careful with array bounds. In the function above HW needs to be a multiple of 8. There are several solutions for that but the easiest and most efficient solution is to make the arrays (A,B,C,D,in) a bit larger (maximum 7 floats larger) if necessary to be a multiple of 8. However, another solution is to use the following code which does not require WH to be a multiple of 8 but it's not as pretty.
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
void loop_fix(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
#pragma omp parallel for
for (int j = 0; j < ROUND_DOWN(H*W,8); j+=8)//A,B,C,D,IN are __restricted
Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
Vec8f invec = atan(Gs, Gc);[j]);
for(int j=ROUND_DOWN(H*W,8); j<H*W; j++) {
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
One challenge with doing the vectorization yourself is finding a SIMD math library (e.g. for atan2f). The vectorclass supports 3 options. Non-SIMD, LIBM by AMD, and SVML by Intel (I used the non-SIMD option in the code above).
SIMD math libraries for SSE and AVX
Some last comments you might want to consider. Visual Studio has auto-parallelization (off by default) as well as auto-vectorization (on by default, at least in release mode). You can try this instead of OpenMP to reduce code bloat.
Additionally, Microsoft has the parallel patterns library. It's worth looking into since Microsoft's OpenMP support is limited. It's nearly as easy as OpenMP to use. It's possible that one of these options works better with auto-vectorization (though I doubt it). Like I said, I would do the vectorization manually with the vectorclass.
You may try loop unrolling instead of sections:
#pragma omp parallel for
for (int j = 0; j < H*W; j += outer_stride)//A,B,C,D,IN are __restricted
for (int ii = 0; ii < outer_stride; ii++) {
float Gs = D[j+ii]-B[j+ii];
float Gc = A[j+ii]-C[j+ii];
in[j+ii] = atan2f(Gs,Gc);
where outer_stride is a suitable multiple of your SIMD line. Also, you may find this answer useful.

Matrix Multiplication optimization via matrix transpose

I am working on an assignment where I transpose a matrix to reduce cache misses for a matrix multiplication operation. From what I understand from a few classmates, I should get 8x improvement. However, I am only getting 2x ... what might I be doing wrong?
Full Source on GitHub
void transpose(int size, matrix m) {
int i, j;
for (i = 0; i < size; i++)
for (j = 0; j < size; j++)
std::swap(m.element[i][j], m.element[j][i]);
void mm(matrix a, matrix b, matrix result) {
int i, j, k;
int size = a.size;
long long before, after;
before = wall_clock_time();
// Do the multiplication
transpose(size, b); // transpose the matrix to reduce cache miss
for (i = 0; i < size; i++)
for (j = 0; j < size; j++) {
int tmp = 0; // save memory writes
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
after = wall_clock_time();
fprintf(stderr, "Matrix multiplication took %1.2f seconds\n", ((float)(after - before))/1000000000);
Am I doing things right so far?
FYI: The next optimization I need to do is use SIMD/Intel SSE3
Am I doing things right so far?
No. You have a problem with your transpose. You should have seen this problem before you started worrying about performance. When you are doing any kind of hacking around for optimizations it always a good idea to use the naive but suboptimal implementation as a test. An optimization that achieves a factor of 100 speedup is worthless if it doesn't yield the right answer.
Another optimization that will help is to pass by reference. You are passing copies. In fact, your matrix result may never get out because you are passing copies. Once again, you should have tested.
Yet another optimization that will help the speedup is to cache some pointers. This is still quite slow:
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
An optimizer might see a way around the pointer problems, but probably not. At least not if you don't use the nonstandard __restrict__ keyword to tell the compiler that your matrices don't overlap. Cache pointers so you don't have to do a.element[i], b.element[j], and result.element[i]. And it still might help to tell the compiler that these arrays don't overlap with the __restrict__ keyword.
After looking over the code, it needs help. A minor comment first. You aren't writing C++. Your code is C with a tiny hint of C++. You're using struct rather than class, malloc rather than new, typedef struct rather than just struct, C headers rather than C++ headers.
Because of your implementation of your struct matrix, my comment on slowness due to copy constructors was incorrect. That it was incorrect is even worse! Using the implicitly-defined copy constructor in conjunction with classes or structs that contain naked pointers is playing with fire. You will get burned very badly if someone calls m(a, a, a_squared) to get the square of matrix a. You will get burned even worse if some expects m(a, a, a) to do an in-place computation of a2.
Mathematically, your code only covers a tiny portion of the matrix multiplication problem. What if someone wants to multiply a 100x1000 matrix by a 1000x200 matrix? That's perfectly valid, but your code doesn't handle it because your code only works with square matrices. On the other hand, your code will let someone multiply a 100x100 matrix by a 200x200 matrix, which doesn't make a bit of sense.
Structurally, your code has close to a 100% guarantee that it will be slow because of your use of ragged arrays. malloc can spritz the rows of your matrices all across memory. You'll get much better performance if the matrix is internally represented as a contiguous array but is accessed as if it were a NxM matrix. C++ provides some nice mechanisms for doing just that.
If your assignment implies that you MUST transpose, then, of course, you should correct your transpose procedure. As it stands, it does the transpose TWO times, resulting in no transpose at all. The j=loop should not read
j=0; j<size; j++
j=0; j<i; j++
Transposing is not necessary to avoid processing the elements of one of the factor-matrices in the "wrong" order. Just interchange the j-loop and the k-loop. Leaving aside for the moment any (other) performance-tuning, the basic loop-structure should be:
for (int i=0; i<size; i++)
for (int k=0; k<size; k++)
double tmp = a[i][k];
for (int j=0; j<size; j++)
result[i][j] += tmp * b[k][j];

Fastest way to calculate minimum euclidean distance between two matrices containing high dimensional vectors

I started a similar question on another thread, but then I was focusing on how to use OpenCV. Having failed to achieve what I originally wanted, I will ask here exactly what I want.
I have two matrices. Matrix a is 2782x128 and Matrix b is 4000x128, both unsigned char values. The values are stored in a single array. For each vector in a, I need the index of the vector in b with the closest euclidean distance.
Ok, now my code to achieve this:
#include <windows.h>
#include <stdlib.h>
#include <stdio.h>
#include <cstdio>
#include <math.h>
#include <time.h>
#include <sys/timeb.h>
#include <iostream>
#include <fstream>
#include "main.h"
using namespace std;
void main(int argc, char* argv[])
int a_size;
unsigned char* a = NULL;
read_matrix(&a, a_size,"matrixa");
int b_size;
unsigned char* b = NULL;
read_matrix(&b, b_size,"matrixb");
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
int* indexes = NULL;
min_distance_loop(&indexes, b, b_size, a, a_size);
QueryPerformanceCounter( &liEnd );
cout << "loop time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
if (a)
if (b)
if (indexes)
void read_matrix(unsigned char** matrix, int& matrix_size, char* matrixPath)
ofstream myfile;
float f;
FILE * pFile;
pFile = fopen (matrixPath,"r");
fscanf (pFile, "%d", &matrix_size);
*matrix = new unsigned char[matrix_size*128];
for (int i=0; i<matrix_size*128; ++i)
unsigned int matPtr;
fscanf (pFile, "%u", &matPtr);
matrix[i]=(unsigned char)matPtr;
fclose (pFile);
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
const int descrSize = 128;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
unsigned char* dataPtr;
unsigned char* vocPtr;
for (int i=0; i<a_size; ++i)
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
dataPtr = &a[dataIndex];
vocPtr = &b[vocIndex];
for (int k=0; k<descrSize; ++k)
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
// if distance smaller
if (distance<min_distance)
min_distance = distance;
(*indexes)[i] = j;
And attached are the files with sample matrices.
I am using windows.h just to calculate the consuming time, so if you want to test the code in another platform than windows, just change windows.h header and change the way of calculating the consuming time.
This code in my computer is about 0.5 seconds. The problem is that I have another code in Matlab that makes this same thing in 0.05 seconds. In my experiments, I am receiving several matrices like matrix a every second, so 0.5 seconds is too much.
Now the matlab code to calculate this:
aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b';
d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
[minz index]=min(d,[],2);
Ok. Matlab code is using that (x-a)^2 = x^2 + a^2 - 2ab.
So my next attempt was to do the same thing. I deleted my own code to make the same calculations, but It was 1.2 seconds approx.
Then, I tried to use different external libraries. The first attempt was Eigen:
const int descrSize = 128;
MatrixXi a(a_size, descrSize);
MatrixXi b(b_size, descrSize);
MatrixXi ab(a_size, b_size);
unsigned char* dataPtr = matrixa;
for (int i=0; i<nframes; ++i)
for (int j=0; j<descrSize; ++j)
unsigned char* vocPtr = matrixb;
for (int i=0; i<vocabulary_size; ++i)
for (int j=0; j<descrSize; ++j)
b(i,j)=(int)*vocPtr ++;
ab = a*b.transpose();
MatrixXi aa = a.rowwise().sum();
MatrixXi bb = b.rowwise().sum();
MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2();
int* index = NULL;
index = (int*)malloc(nframes*sizeof(int));
for (int i=0; i<nframes; ++i)
This Eigen code costs 1.2 approx for just the line that says: ab = a*b.transpose();
A similar code using opencv was used also, and the cost of the ab = a*b.transpose(); was 0.65 seconds.
So, It is real annoying that matlab is able to do this same thing so quickly and I am not able in C++! Of course being able to run my experiment would be great, but I think the lack of knowledge is what really is annoying me. How can I achieve at least the same performance than in Matlab? Any kind of soluting is welcome. I mean, any external library (free if possible), loop unrolling things, template things, SSE intructions (I know they exist), cache things. As I said, my main purpose is increase my knowledge for being able to code thinks like this with a faster performance.
Thanks in advance
EDIT: more code suggested by David Hammen. I casted the arrays to int before making any calculations. Here is the code:
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
const int descrSize = 128;
int* a_int;
int* b_int;
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
a_int = (int*)malloc(a_size*descrSize*sizeof(int));
b_int = (int*)malloc(b_size*descrSize*sizeof(int));
for(int i=0; i<descrSize*a_size; ++i)
for(int i=0; i<descrSize*b_size; ++i)
QueryPerformanceCounter( &liEnd );
cout << "Casting time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
/*unsigned char* dataPtr;
unsigned char* vocPtr;*/
int* dataPtr;
int* vocPtr;
for (int i=0; i<a_size; ++i)
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
dataPtr = &a_int[dataIndex];
vocPtr = &b_int[vocIndex];
for (int k=0; k<descrSize; ++k)
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
// if distance smaller
if (distance<min_distance)
min_distance = distance;
(*indexes)[i] = j;
The entire process is now 0.6, and the casting loops at the beginning are 0.001 seconds. Maybe I did something wrong?
EDIT2: Anything about Eigen? When I look for external libs they always talk about Eigen and their speed. I made something wrong? Here a simple code using Eigen that shows it is not so fast. Maybe I am missing some config or some flag, or ...
MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 500);
MatrixXd X;
This code is about 0.9 seconds.
As you observed, your code is dominated by the matrix product that represents about 2.8e9 arithmetic operations. Yopu say that Matlab (or rather the highly optimized MKL) computes it in about 0.05s. This represents a rate of 57 GFLOPS showing that it is not only using vectorization but also multi-threading. With Eigen, you can enable multi-threading by compiling with OpenMP enabled (-fopenmp with gcc). On my 5 years old computer (2.66Ghz Core2), using floats and 4 threads, your product takes about 0.053s, and 0.16s without OpenMP, so there must be something wrong with your compilation flags. To summary, to get the best of Eigen:
compile in 64bits mode
use floats (doubles are twice as slow owing to vectorization)
enable OpenMP
if your CPU has hyper-threading, then either disable it or define the OMP_NUM_THREADS environment variable to the number of physical cores (this is very important, otherwise the performance will be very bad!)
if you have other task running, it might be a good idea to reduce OMP_NUM_THREADS to nb_cores-1
use the most recent compiler that you can, GCC, clang and ICC are best, MSVC is usually slower.
One thing that is definitely hurting you in your C++ code is that it has a boatload of char to int conversions. By boatload, I mean up to 2*2782*4000*128 char to int conversions. Those char to int conversions are slow, very slow.
You can reduce this to (2782+4000)*128 such conversions by allocating a pair of int arrays, one 2782*128 and the other 4000*128, to contain the cast-to-integer contents of your char* a and char* b arrays. Work with these int* arrays rather than your char* arrays.
Another problem might be your use of int versus long. I don't work on windows, so this might not be applicable. On the machines I work on, int is 32 bits and long is now 64 bits. 32 bits is more than enough because 255*255*128 < 256*256*128 = 223.
That obviously isn't the problem.
What's striking is that the code in question is not calculating that huge 2728 by 4000 array that the Matlab code is creating. What's even more striking is that Matlab is most likely doing this with doubles rather than ints -- and it's still beating the pants off the C/C++ code.
One big problem is cache. That 4000*128 array is far too big for level 1 cache, and you are iterating over that big array 2782 times. Your code is doing far too much waiting on memory. To overcome this problem, work with smaller chunks of the b array so that your code works with level 1 cache for as long as possible.
Another problem is the optimization if (distance>min_distance) break;. I suspect that this is actually a dis-optimization. Having if tests inside your innermost loop is oftentimes a bad idea. Blast through that inner product as fast as possible. Other than wasted computations, there is no harm in getting rid of this test. Sometimes it is better to make apparently unneeded computations if doing so can remove a branch in an innermost loop. This is one of those cases. You might be able to solve your problem just by eliminating this test. Try doing that.
Getting back to the cache problem, you need to get rid of this branch so that you can split the operations over the a and b matrix into smaller chunks, chunks of no more than 256 rows at a time. That's how many rows of 128 unsigned chars fit into one of the two modern Intel chip's L1 caches. Since 250 divides 4000, look into logically splitting that b matrix into 16 chunks. You may well want to form that big 2872 by 4000 array of inner products, but do so in small chunks. You can add that if (distance>min_distance) break; back in, but do so at a chunk level rather than at the byte by byte level.
You should be able to beat Matlab because it almost certainly is working with doubles, but you can work with unsigned chars and ints.
Matrix multiply generally uses the worst possible cache access pattern for one of the two matrices, and the solution is to transpose one of the matrices and use a specialized multiply algorithm that works on data stored that way.
Your matrix already IS stored transposed. By transposing it into the normal order and then using a normal matrix multiply, your are absolutely killing performance.
Write your own matrix multiply loop that inverts the order of indices to the second matrix (which has the effect of transposing it, without actually moving anything around and breaking cache behavior). And pass your compiler whatever options it has for enabling auto-vectorization.