Slow std::vector vs [] in C++ - Why? - c++

I am a bit rusty with C++ - having used it 20 years ago. I am trying to understand why std::vector is so much slower than native arrays in the following code. Can anyone explain it to me? I would much prefer using the standard libraries but not at the cost of this performance penalty:
const int grid_e_rows = 50;
const int grid_e_cols = 50;
int H(std::vector<std::vector<int>> &sigma) {
int h = 0;
for (int r = 0; r < grid_e_rows; ++r) {
int r2 = (r + 1) % grid_e_rows;
for (int c = 0; c < grid_e_cols; ++c) {
int c2 = (c + 1) % grid_e_cols;
h += 1 * sigma[r][c] * sigma[r][c2] + 1 * sigma[r][c] * sigma[r2][c];
return -h;
int main() {
auto start = std::chrono::steady_clock::now();
std::vector<std::vector<int>> sigma_a(grid_e_rows, std::vector<int>(grid_e_cols));
for (int i=0;i<600000;i++)
auto end = std::chrono::steady_clock::now();
std::cout << "Calculation completed in " << std::chrono::duration_cast<std::chrono::seconds>(end - start).count()
<< " seconds";
return 0;
Output is:
Calculation completed in 23 seconds
const int grid_e_rows = 50;
const int grid_e_cols = 50;
typedef int (*Sigma)[grid_e_rows][grid_e_cols];
int H(Sigma sigma) {
int h = 0;
for (int r = 0; r < grid_e_rows; ++r) {
int r2 = (r + 1) % grid_e_rows;
for (int c = 0; c < grid_e_cols; ++c) {
int c2 = (c + 1) % grid_e_cols;
h += 1 * (*sigma)[r][c] * (*sigma)[r][c2] + 1 * (*sigma)[r][c] * (*sigma)[r2][c];
return -h;
int main() {
auto start = std::chrono::steady_clock::now();
int sigma_a[grid_e_rows][grid_e_cols];
for (int i=0;i<600000;i++)
auto end = std::chrono::steady_clock::now();
std::cout << "Calculation completed in " << std::chrono::duration_cast<std::chrono::seconds>(end - start).count()
<< " seconds";
return 0;
Output is:
Calculation completed in 6 seconds
Any help would be appreciated.

First, you're timing the initialization. For the array case, there is none (the array is completely uninitialized). In the vector case, the vector is initialized to zero and then copied into each row.
But the primary reason is cache locality. The array case is a single block of 50*50 integers which are all continuous in memory, and they can trivially fit in L1D cache. In the vector case, each row is allocated dynamically which means their addresses are almost certainly not contiguous and are instead spread all over the program's address space. Accessing one does not pull the adjacent rows into the cache.
Also, because the rows are relatively small, cache space is wasted on adjacent unrelated data, meaning even after you've touched everything to pull it into memory it may not fit in L1 anymore. And lastly, the access pattern is a lot less linear, and it may be beyond the capability of a hardware prefetcher to predict.

You are not compiling with optimizations.
With vector of vector
With array
To give you a small taste of what the optimizer might be doing for you, consider the following modification to your H() function for the vector of vector case.
int H(std::vector<std::vector<int>> &arg) {
int h = 0;
auto sigma =;
for (int r = 0; r < grid_e_rows; ++r) {
int r2 = (r + 1) % grid_e_rows;
auto sr = sigma[r].data();
auto sr2 = sigma[r2].data();
for (int c = 0; c < grid_e_cols; ++c) {
int c2 = (c + 1) % grid_e_cols;
h += 1 * sr[c] * sr[c2] + 1 * sr[c] * sr2[c];
return -h;
You will find that without optimizations, this version will run closer to the performance of your array version.


OpenCV - Accessing Mat data using for loop

I'm trying to create a convolution function but I'm having trouble during the access to the kernel data (cv::Mat).
I create the 3x3 kernel:
cv::Mat krn(3, 3, CV_32FC1);
krn = krn/9;
And I try to loop over it. Next the image Mat will be the image to which I want to apply the convolution operator and output will be the result of convolution:
for (int r = 0; r < image.rows - krn.rows; ++r) {
for (int c = 0; c < image.cols - krn.cols; ++c) {
int sum = 0;
for (int rs = 0; rs < krn.rows; ++rs) {
for (int cs = 0; cs < krn.cols; ++cs) {
sum +=[rs * krn.cols + cs] *[(r + rs) * image.cols + c + cs];
}[(r+1)*src.cols + c + 1]=sum; // assuming 3x3 kernel
However the output is not as desired (only randomic black and white pixel).
However, if I change my code this way:
for (int r = 0; r < image.rows - krn.rows; ++r) {
for (int c = 0; c < image.cols - krn.cols; ++c) {
int sum = 0;
for (int rs = 0; rs < krn.rows; ++rs) {
for (int cs = 0; cs < krn.cols; ++cs) {
sum += 0.11 *[(r + rs) * image.cols + c + cs]; // CHANGE HERE
}[(r+1)*src.cols + c + 1]=sum; // assuming 3x3 kernel
Using 0.11 instead of the kernel values seems to give the correct output.
For this reason I think I'm doing something wrong accessing the kernel's data.
P.S: I cannot use<float>(rs,cs).
Instead of needlessly using memcpy, you can just cast the pointer. I'll use a C-style cast because why not.
cv::Mat krn = 1 / (cv::Mat_<float>(3,3) <<
1, 2, 3,
4, 5, 6,
7, 8, 9);
for (int i = 0; i < krn.rows; i += 1)
for (int j = 0; j < krn.cols; j += 1)
// to see clearly what's happening
uint8_t *byteptr = + krn.step[0] * i + krn.step[1] * j;
float *floatptr = (float*) byteptr;
// or in one step:
float *floatptr = (float*) ( + krn.step[0] * i + krn.step[1] * j);
cout << "<float>(" << i << "," << j << ") = " << (*floatptr) << endl;
}<float>(0,0) = 1<float>(0,1) = 0.5<float>(0,2) = 0.333333<float>(1,0) = 0.25<float>(1,1) = 0.2<float>(1,2) = 0.166667<float>(2,0) = 0.142857<float>(2,1) = 0.125<float>(2,2) = 0.111111
Note that pointer arithmetic may not be obvious. if you have a uint8_t*, adding 1 moves it by one uint8_t, and if you have a float*, adding 1 moves it by one float which is four bytes. The step[] contains offsets expressed in bytes.
Consult the documentation for details, which include information on the step[] array that contains the strides/steps to calculate the offset given a tuple of indices into the matrix.
cv::Mat::data is pointer of type uchar.
By data[y * cols + x] you access some byte of stored float values in krn. To get full float values use at method template:<float>(rs,cs)
Consider changing type of sum variable to be real. Without this, you may lose partial results when calculating convolution .
So, if you cannot use at, just read 4 bytes from data pointer:
float v = 0.0;
memcpy(&v, + (rs * krn.step + cs * sizeof(float)), 4);
step - means total bytes occupied by one line in mat.

How is numpy so fast?

I'm trying to understand how numpy can be so fast, based on my shocking comparison with optimized C/C++ code which is still far from reproducing numpy's speed.
Consider the following example:
Given a 2D array with shape=(N, N) and dtype=float32, which represents a list of N vectors of N dimensions, I am computing the pairwise differences between every pair of vectors. Using numpy broadcasting, this simply writes as:
def pairwise_sub_numpy( X ):
return X - X[:, None, :]
Using timeit I can measure the performance for N=512: it takes 88 ms per call on my laptop.
Now, in C/C++ a naive implementation writes as:
#define X(i, j) _X[(i)*N + (j)]
#define res(i, j, k) _res[((i)*N + (j))*N + (k)]
float* pairwise_sub_naive( const float* _X, int N )
float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++)
res(i,j,k) = X(i,k) - X(j,k);
return _res;
Compiling using gcc 7.3.0 with -O3 flag, I get 195 ms per call for pairwise_sub_naive(X), which is not too bad given the simplicity of the code, but about 2 times slower than numpy.
Now I start getting serious and add some small optimizations, by indexing the row vectors directly:
float* pairwise_sub_better( const float* _X, int N )
float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
for (int i = 0; i < N; i++) {
const float* xi = & X(i,0);
for (int j = 0; j < N; j++) {
const float* xj = & X(j,0);
float* r = &res(i,j,0);
for (int k = 0; k < N; k++)
r[k] = xi[k] - xj[k];
return _res;
The speed stays the same at 195 ms, which means that the compiler was able to figure that much. Let's now use SIMD vector instructions:
float* pairwise_sub_simd( const float* _X, int N )
float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
// create caches for row vectors which are memory-aligned
float* xi = (float*)aligned_alloc(32, N * sizeof(float));
float* xj = (float*)aligned_alloc(32, N * sizeof(float));
for (int i = 0; i < N; i++) {
memcpy(xi, & X(i,0), N*sizeof(float));
for (int j = 0; j < N; j++) {
memcpy(xj, & X(j,0), N*sizeof(float));
float* r = &res(i,j,0);
for (int k = 0; k < N; k += 256/sizeof(float)) {
const __m256 A = _mm256_load_ps(xi+k);
const __m256 B = _mm256_load_ps(xj+k);
_mm256_store_ps(r+k, _mm256_sub_ps( A, B ));
return _res;
This only yields a small boost (178 ms instead of 194 ms per function call).
Then I was wondering if a "block-wise" approach, like what is used to optimize dot-products, could be beneficials:
float* pairwise_sub_blocks( const float* _X, int N )
float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
#define B 8
float cache1[B*B], cache2[B*B];
for (int bi = 0; bi < N; bi+=B)
for (int bj = 0; bj < N; bj+=B)
for (int bk = 0; bk < N; bk+=B) {
// load first 8x8 block in the cache
for (int i = 0; i < B; i++)
for (int k = 0; k < B; k++)
cache1[B*i + k] = X(bi+i, bk+k);
// load second 8x8 block in the cache
for (int j = 0; j < B; j++)
for (int k = 0; k < B; k++)
cache2[B*j + k] = X(bj+j, bk+k);
// compute local operations on the caches
for (int i = 0; i < B; i++)
for (int j = 0; j < B; j++)
for (int k = 0; k < B; k++)
res(bi+i,bj+j,bk+k) = cache1[B*i + k] - cache2[B*j + k];
return _res;
And surprisingly, this is the slowest method so far (258 ms per function call).
To summarize, despite some efforts with some optimized C++ code, I can't come anywhere close the 88 ms / call that numpy achieves effortlessly. Any idea why?
Note: By the way, I am disabling numpy multi-threading and anyway, this kind of operation is not multi-threaded.
Edit: Exact code to benchmark the numpy code:
import numpy as np
def pairwise_sub_numpy( X ):
return X - X[:, None, :]
N = 512
X = np.random.rand(N,N).astype(np.float32)
import timeit
times = timeit.repeat('pairwise_sub_numpy( X )', globals=globals(), number=1, repeat=5)
print(f">> best of 5 = {1000*min(times):.3f} ms")
Full benchmark for C code:
#include <stdio.h>
#include <string.h>
#include <xmmintrin.h> // compile with -mavx -msse4.1
#include <pmmintrin.h>
#include <immintrin.h>
#include <time.h>
#define X(i, j) _x[(i)*N + (j)]
#define res(i, j, k) _res[((i)*N + (j))*N + (k)]
float* pairwise_sub_naive( const float* _x, int N )
float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++)
res(i,j,k) = X(i,k) - X(j,k);
return _res;
float* pairwise_sub_better( const float* _x, int N )
float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
for (int i = 0; i < N; i++) {
const float* xi = & X(i,0);
for (int j = 0; j < N; j++) {
const float* xj = & X(j,0);
float* r = &res(i,j,0);
for (int k = 0; k < N; k++)
r[k] = xi[k] - xj[k];
return _res;
float* pairwise_sub_simd( const float* _x, int N )
float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
// create caches for row vectors which are memory-aligned
float* xi = (float*)aligned_alloc(32, N * sizeof(float));
float* xj = (float*)aligned_alloc(32, N * sizeof(float));
for (int i = 0; i < N; i++) {
memcpy(xi, & X(i,0), N*sizeof(float));
for (int j = 0; j < N; j++) {
memcpy(xj, & X(j,0), N*sizeof(float));
float* r = &res(i,j,0);
for (int k = 0; k < N; k += 256/sizeof(float)) {
const __m256 A = _mm256_load_ps(xi+k);
const __m256 B = _mm256_load_ps(xj+k);
_mm256_store_ps(r+k, _mm256_sub_ps( A, B ));
return _res;
float* pairwise_sub_blocks( const float* _x, int N )
float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
#define B 8
float cache1[B*B], cache2[B*B];
for (int bi = 0; bi < N; bi+=B)
for (int bj = 0; bj < N; bj+=B)
for (int bk = 0; bk < N; bk+=B) {
// load first 8x8 block in the cache
for (int i = 0; i < B; i++)
for (int k = 0; k < B; k++)
cache1[B*i + k] = X(bi+i, bk+k);
// load second 8x8 block in the cache
for (int j = 0; j < B; j++)
for (int k = 0; k < B; k++)
cache2[B*j + k] = X(bj+j, bk+k);
// compute local operations on the caches
for (int i = 0; i < B; i++)
for (int j = 0; j < B; j++)
for (int k = 0; k < B; k++)
res(bi+i,bj+j,bk+k) = cache1[B*i + k] - cache2[B*j + k];
return _res;
int main()
const int N = 512;
float* _x = (float*) malloc( N * N * sizeof(float) );
for( int i = 0; i < N; i++)
for( int j = 0; j < N; j++)
X(i,j) = ((i+j*j+17*i+101) % N) / float(N);
double best = 9e9;
for( int i = 0; i < 5; i++)
struct timespec start, stop;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start);
//float* res = pairwise_sub_naive( _x, N );
//float* res = pairwise_sub_better( _x, N );
//float* res = pairwise_sub_simd( _x, N );
float* res = pairwise_sub_blocks( _x, N );
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &stop);
double t = (stop.tv_sec - start.tv_sec) * 1e6 + (stop.tv_nsec - start.tv_nsec) / 1e3; // in microseconds
if (t < best) best = t;
free( res );
printf("Best of 5 = %f ms\n", best / 1000);
free( _x );
return 0;
Compiled using gcc 7.3.0 gcc -Wall -O3 -mavx -msse4.1 -o test_simd test_simd.c
Summary of timings on my machine:
88 ms
C++ naive
194 ms
C++ better
195 ms
178 ms
C++ blocked
258 ms
C++ blocked (gcc 8.3.1)
217 ms
As pointed out by some of the comments numpy uses SIMD in its implementation and it does not allocate memory at the point of computation. If I eliminate the memory allocation from your implementation, pre-allocating all the buffers ahead of the computation then I get a better time compared to numpy even with the scaler version(that is the one without any optimizations).
Also in terms of SIMD and why your implementation does not perform much better than the scaler is because your memory access patterns are not ideal for SIMD usage - you do memcopy and you load into SIMD registers from locations that are far apart from each other - e.g. you fill vectors from line 0 and line 511, which might not play well with the cache or with the SIMD prefetcher.
There is also a mistake in how you load the SIMD registers(if I understood correctly what you're trying to compute): a 256 bit SIMD register can load 8 single-precision floating-point numbers 8 * 32 = 256, but in your loop you jump k by "256/sizeof(float)" which is 256/4 = 64; _x and _res are float pointers and the SIMD intrinsics expect also float pointers as arguments so instead of reading all elements from those lines every 8 floats you read them every 64 floats.
The computation can be optimized further by changing the access patterns but also by observing that you repeat some computations: e.g. when iterating with line0 as a base you compute line0 - line1 but at some future time, when iterating with line1 as a base, you need to compute line1 - line0 which is basically -(line0 - line1), that is for each line after line0 a lot of results could be reused from previous computations.
A lot of times SIMD usage or parallelization requires one to change how data is accessed or reasoned about in order to provide meaningful improvements.
Here is what I have done as a first step based on your initial implementation and it is faster than the numpy(don't mind the OpenMP stuff as it's not how its supposed to be done, I just wanted to see how it behaves trying the naive way).
Time scaler version: 55 ms
Time SIMD version: 53 ms
**Time SIMD 2 version: 33 ms**
Time SIMD 3 version: 168 ms
Time OpenMP version: 59 ms
Python numpy
>> best of 5 = 88.794 ms
#include <cstdlib>
#include <xmmintrin.h> // compile with -mavx -msse4.1
#include <pmmintrin.h>
#include <immintrin.h>
#include <numeric>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <cstring>
using namespace std;
float* pairwise_sub_naive (const float* input, float* output, int n)
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++)
output[(i * n + j) * n + k] = input[i * n + k] - input[j * n + k];
return output;
float* pairwise_sub_simd (const float* input, float* output, int n)
for (int i = 0; i < n; i++)
const int idxi = i * n;
for (int j = 0; j < n; j++)
const int idxj = j * n;
const int outidx = idxi + j;
for (int k = 0; k < n; k += 8)
__m256 A = _mm256_load_ps(input + idxi + k);
__m256 B = _mm256_load_ps(input + idxj + k);
_mm256_store_ps(output + outidx * n + k, _mm256_sub_ps( A, B ));
return output;
float* pairwise_sub_simd_2 (const float* input, float* output, int n)
float* line_buffer = (float*) aligned_alloc(32, n * sizeof(float));
for (int i = 0; i < n; i++)
const int idxi = i * n;
for (int j = 0; j < n; j++)
const int idxj = j * n;
const int outidx = idxi + j;
for (int k = 0; k < n; k += 8)
__m256 A = _mm256_load_ps(input + idxi + k);
__m256 B = _mm256_load_ps(input + idxj + k);
_mm256_store_ps(line_buffer + k, _mm256_sub_ps( A, B ));
memcpy(output + outidx * n, line_buffer, n);
return output;
float* pairwise_sub_simd_3 (const float* input, float* output, int n)
for (int i = 0; i < n; i++)
const int idxi = i * n;
for (int k = 0; k < n; k += 8)
__m256 A = _mm256_load_ps(input + idxi + k);
for (int j = 0; j < n; j++)
const int idxj = j * n;
const int outidx = (idxi + j) * n;
__m256 B = _mm256_load_ps(input + idxj + k);
_mm256_store_ps(output + outidx + k, _mm256_sub_ps( A, B ));
return output;
float* pairwise_sub_openmp (const float* input, float* output, int n)
int i, j;
#pragma omp parallel for private(j)
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
const int idxi = i * n;
const int idxj = j * n;
const int outidx = idxi + j;
for (int k = 0; k < n; k += 8)
__m256 A = _mm256_load_ps(input + idxi + k);
__m256 B = _mm256_load_ps(input + idxj + k);
_mm256_store_ps(output + outidx * n + k, _mm256_sub_ps( A, B ));
/*for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
for (int k = 0; k < n; k++)
output[(i * n + j) * n + k] = input[i * n + k] - input[j * n + k];
return output;
int main ()
constexpr size_t n = 512;
constexpr size_t input_size = n * n;
constexpr size_t output_size = n * n * n;
float* input = (float*) aligned_alloc(32, input_size * sizeof(float));
float* output = (float*) aligned_alloc(32, output_size * sizeof(float));
float* input_simd = (float*) aligned_alloc(32, input_size * sizeof(float));
float* output_simd = (float*) aligned_alloc(32, output_size * sizeof(float));
float* input_par = (float*) aligned_alloc(32, input_size * sizeof(float));
float* output_par = (float*) aligned_alloc(32, output_size * sizeof(float));
iota(input, input + input_size, float(0.0));
fill(output, output + output_size, float(0.0));
iota(input_simd, input_simd + input_size, float(0.0));
fill(output_simd, output_simd + output_size, float(0.0));
iota(input_par, input_par + input_size, float(0.0));
fill(output_par, output_par + output_size, float(0.0));
std::chrono::milliseconds best_scaler{100000};
for (int i = 0; i < 5; ++i)
auto start = chrono::high_resolution_clock::now();
pairwise_sub_naive(input, output, n);
auto stop = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
if (duration < best_scaler)
best_scaler = duration;
cout << "Time scaler version: " << best_scaler.count() << " ms\n";
std::chrono::milliseconds best_simd{100000};
for (int i = 0; i < 5; ++i)
auto start = chrono::high_resolution_clock::now();
pairwise_sub_simd(input_simd, output_simd, n);
auto stop = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
if (duration < best_simd)
best_simd = duration;
cout << "Time SIMD version: " << best_simd.count() << " ms\n";
std::chrono::milliseconds best_simd_2{100000};
for (int i = 0; i < 5; ++i)
auto start = chrono::high_resolution_clock::now();
pairwise_sub_simd_2(input_simd, output_simd, n);
auto stop = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
if (duration < best_simd_2)
best_simd_2 = duration;
cout << "Time SIMD 2 version: " << best_simd_2.count() << " ms\n";
std::chrono::milliseconds best_simd_3{100000};
for (int i = 0; i < 5; ++i)
auto start = chrono::high_resolution_clock::now();
pairwise_sub_simd_3(input_simd, output_simd, n);
auto stop = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
if (duration < best_simd_3)
best_simd_3 = duration;
cout << "Time SIMD 3 version: " << best_simd_3.count() << " ms\n";
std::chrono::milliseconds best_par{100000};
for (int i = 0; i < 5; ++i)
auto start = chrono::high_resolution_clock::now();
pairwise_sub_openmp(input_par, output_par, n);
auto stop = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
if (duration < best_par)
best_par = duration;
cout << "Time OpenMP version: " << best_par.count() << " ms\n";
cout << "Verification\n";
if (equal(output, output + output_size, output_simd))
cout << "PASSED\n";
cout << "FAILED\n";
return 0;
Edit: Small correction as there was a wrong call related to the second version of SIMD implementation.
As you can see now, the second implementation is the fastest as it behaves the best from the point of view of the locality of reference of the cache. Examples 2 and 3 of SIMD implementations are there to illustrate for you how changing memory access patterns to influence the performance of your SIMD optimizations.
To summarize(knowing that I'm far from being complete in my advice) be mindful of your memory access patterns and of the loads and stores to\from the SIMD unit; the SIMD is a different hardware unit inside the processor's core so there is a penalty in shuffling data back and forth, hence when you load a register from memory try to do as many operations as possible with that data and do not be too eager to store it back(of course, in your example that might be all you need to do with the data). Be mindful also that there is a limited number of SIMD registers available and if you load too many then they will "spill", that is they will be stored back to temporary locations in main memory behind the scenes killing all your gains. SIMD optimization, it's a true balance act!
There is some effort to put a cross-platform intrinsics wrapper into the standard(I developed myself a closed source one in my glorious past) and even it's far from being complete, it's worth taking a look at(read the accompanying papers if you're truly interested to learn how SIMD works).
This is a complement to the answer posted by #celakev .
I think I finally got to understand what exactly was the issue. The issue was not about allocating the memory in the main function that does the computation.
What was actually taking time is to access new (fresh) memory. I believe that the malloc call returns pages of memory which are virtual, i.e. that does not corresponds to actual physical memory -- until it is explicitly accessed. What actually takes time is the process of allocating physical memory on the fly (which I think is OS-level) when it is accessed in the function code.
Here is a proof. Consider the two following trivial functions:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
float* just_alloc( size_t N )
return (float*) aligned_alloc( 32, sizeof(float)*N );
void just_fill( float* _arr, size_t N )
for (size_t i = 0; i < N; i++)
_arr[i] = 1;
#define Time( code_to_benchmark, cleanup_code ) \
do { \
double best = 9e9; \
for( int i = 0; i < 5; i++) { \
struct timespec start, stop; \
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start); \
code_to_benchmark; \
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &stop); \
double t = (stop.tv_sec - start.tv_sec) * 1e3 + (stop.tv_nsec - start.tv_nsec) / 1e6; \
printf("Time[%d] = %f ms\n", i, t); \
if (t < best) best = t; \
cleanup_code; \
} \
printf("Best of 5 for '" #code_to_benchmark "' = %f ms\n\n", best); \
} while(0)
int main()
const size_t N = 512;
Time( float* arr = just_alloc(N*N*N), free(arr) );
float* arr = just_alloc(N*N*N);
Time( just_fill(arr, N*N*N), ; );
return 0;
I get the following timings, which I now detail for each of the calls:
Time[0] = 0.000931 ms
Time[1] = 0.000540 ms
Time[2] = 0.000523 ms
Time[3] = 0.000524 ms
Time[4] = 0.000521 ms
Best of 5 for 'float* arr = just_alloc(N*N*N)' = 0.000521 ms
Time[0] = 189.822237 ms
Time[1] = 45.041083 ms
Time[2] = 46.331428 ms
Time[3] = 44.729433 ms
Time[4] = 42.241279 ms
Best of 5 for 'just_fill(arr, N*N*N)' = 42.241279 ms
As you can see, allocating memory is blazingly fast, but the first time that the memory is accessed, it is 5 times slower than the other times. So, basically the reason that my code was slow was because i was each time reallocating fresh memory that had no physical address yet. (Correct me if I'm wrong but I think that's the gist of it!)
A bit late to the party, but I wanted to add a pairwise method with Eigen, which is supposed to give C++ a high-level algebra manipulation capability and use SIMD under the hood. Just like numpy.
Here is the implementation
#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>
#include <Eigen/Dense>
auto pairwise_eigen(const Eigen::MatrixXf &input, std::vector<Eigen::MatrixXf> &output) {
for (int k = 0; k < input.cols(); ++k)
output[k] = input
// subtract matrix with repeated k-th column
- input.col(k) * Eigen::RowVectorXf::Ones(input.cols());
int main() {
constexpr size_t n = 512;
// allocate input and output
Eigen::MatrixXf input = Eigen::MatrixXf::Random(n, n);
std::vector<Eigen::MatrixXf> output(n);
std::chrono::milliseconds best_eigen{100000};
for (int i = 0; i < 5; ++i) {
auto start = std::chrono::high_resolution_clock::now();
pairwise_eigen(input, output);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end-start);
if (duration < best_eigen)
best_eigen = duration;
std::cout << "Time Eigen version: " << best_eigen.count() << " ms\n";
return 0;
The full benchmark tests suggested by #celavek on my system are
Time scaler version: 57 ms
Time SIMD version: 58 ms
Time SIMD 2 version: 40 ms
Time SIMD 3 version: 58 ms
Time OpenMP version: 58 ms
Time Eigen version: 76 ms
Numpy >> best of 5 = 118.489 ms
Whit Eigen there is still a noticeable improvement with respect to Numpy, but not so impressive compared to the "raw" implementations (there is certainly some overhead).
An extra optimization is to allocate the output vector with copies of the input and then subtract directly from each vector entry, simply replacing the following lines
// inside the pairwise method
for (int k = 0; k < input.cols(); ++k)
output[k] -= input.col(k) * Eigen::RowVectorXf::Ones(input.cols());
// at allocation time
std::vector<Eigen::MatrixXf> output(n, input);
This pushes the best of 5 down to 60 ms.

keep the signed value that has minimal absolute value in two matrix in OpenCV

In OpenCV, I have two matrix One and Two which are the same size. I want to find the signed value that has minimal absolute value in both matrix and keep it in matrix One. For this, I use following code:
for (int i = 0; i < One.rows; ++i)
p_two = Two.ptr<float>(i);
for (int j = 0; j < One.cols; ++j)
p[j] = p_two[j];
This code seems to be the bottleneck in my program. Does anyone know how to improve the performance? Thanks a lot!
Your code is not the bottleneck of your program. It's indeed very fast. You need to profile your code to see where the actual bottleneck is.
You can optimize it a little in case your matrices are continuous (which is very often in practice), like:
int rows = one.rows;
int cols = one.cols;
if (one.isContinuous() && two.isContinuous())
cols = rows * cols;
rows = 1;
for (int r = 0; r < rows; ++r)
float* pone = one.ptr<float>(r);
float* ptwo = two.ptr<float>(r);
for (int c = 0; c < cols; ++c)
if (fabs(ptwo[c]) < fabs(pone[c]))
pone[c] = ptwo[c];
Here a small evaluation also against the good alternative method proposed by #s1h in the comments:
two.copyTo(one, abs(two) < abs(one));
Time (in ms)
Size: Yuanhao s1h Miki
[3 x 3] 0.000366543 0.117294 0.000366543
[10 x 10] 0.00109963 0.0157614 0.00109963
[100 x 100] 0.0964009 0.139653 0.112529
[1280 x 720] 8.70577 11.0267 8.65372
[1000 x 1000] 9.66538 13.5068 9.02026
[1920 x 1080] 16.5681 26.9706 15.7412
[4096 x 3112] 104.423 135.629 102.595
[5000 x 5000] 196.124 277.457 187.203
You see that your method is very fast. Mine is a little bit faster. #s1h is slower, but more concise and easy to read.
You can evaulate the results on your PC with this:
#include <opencv2/opencv.hpp>
#include <iostream>
using namespace std;
using namespace cv;
int main()
vector<Size> sizes{ Size(3, 3), Size(10, 10), Size(100, 100), Size(1280, 720), Size(1000, 1000), Size(1920, 1080), Size(4096, 3112), Size(5000, 5000) };
cout << "Size: \t\tYuanhao \ts1h \t\tMiki" << endl;
for (int is = 0; is < sizes.size(); ++is)
Size sz = sizes[is];
cout << sz << "\t";
Mat1f img1(sz);
randu(img1, Scalar(-100), Scalar(100));
Mat1f img2(sz);
randu(img2, Scalar(-100), Scalar(100));
Mat1f one = img1.clone();
Mat1f two = img2.clone();
double tic = double(getTickCount());
for (int r = 0; r < one.rows; ++r)
float* pone = one.ptr<float>(r);
float* ptwo = two.ptr<float>(r);
for (int c = 0; c < one.cols; ++c)
if (fabs(ptwo[c]) < fabs(pone[c]))
pone[c] = ptwo[c];
double toc = (double(getTickCount()) - tic) * 1000. / getTickFrequency();
cout << toc << " \t";
Mat1f one = img1.clone();
Mat1f two = img2.clone();
double tic = double(getTickCount());
two.copyTo(one, abs(two) < abs(one));
double toc = (double(getTickCount()) - tic) * 1000. / getTickFrequency();
cout << toc << " \t";
Mat1f one = img1.clone();
Mat1f two = img2.clone();
double tic = double(getTickCount());
int rows = one.rows;
int cols = one.cols;
if (one.isContinuous() && two.isContinuous())
cols = rows * cols;
rows = 1;
for (int r = 0; r < rows; ++r)
float* pone = one.ptr<float>(r);
float* ptwo = two.ptr<float>(r);
for (int c = 0; c < cols; ++c)
if (fabs(ptwo[c]) < fabs(pone[c]))
pone[c] = ptwo[c];
double toc = (double(getTickCount()) - tic) * 1000. / getTickFrequency();
cout << toc << " \t";
cout << endl;
return 0;

Seeking knowledge on array of arrays memory performance

Context: Multichannel real time digital audio processing.
Access pattern: "Column-major", like so:
for (int sample = 0; sample < size; ++sample)
for (int channel = 0; channel < size; ++channel)
auto data = arr[channel][sample];
// do some computations
I'm seeking advice on how to make the life easier for the CPU and memory, in general. I realize interleaving the data would be better, but it's not possible.
My theory is, that as long as you sequentially access memory for a while, the CPU will prefetch it - will this hold for N (channel) buffers? What about size of the buffers, any "breaking points"?
Will it be very beneficial to have the channels in contiguous memory (increasing locality), or does that only hold for very small buffers (like, size of cache lines)? We could be talking buffersizes > 100 kb apart.
I guess there would also be a point where the time of the computational part makes memory optimizations negligible - ?
Is this a case, where manual prefetching makes sense?
I could test/profile my own system, but I only have that - 1 system. So any design choices I make may only positively affect that particular system. Any knowledge on these matters are appreciated, links, literature etc., platform specific knowledge.
Let me know if the question is too vague, I primarily thought it would be nice to have some wiki-ish experience / info on this area.
I created a program, that tests the three cases I mentioned (distant, adjecant and contiguous mentioned in supposedly increasing performance order), which tests these patterns on small and big data sets. Maybe people will run it and report anomalies.
#include <iostream>
#include <chrono>
#include <algorithm>
const int b = 196000;
const int s = 64 / sizeof(float);
const int extra_it = 16;
float sbuf1[s];
float bbuf1[b];
int main()
float sbuf2[s];
float bbuf2[b];
float * sbuf3 = new float[s];
float * bbuf3 = new float[b];
float * sbuf4 = new float[s * 3];
float * bbuf4 = new float[b * 3];
float use = 0;
while (1)
using namespace std;
int c;
bool sorb;
cout << "small or big test (0/1)? ";
if (!(cin >> sorb))
return -1;
cout << endl << "test distant buffers (0), contiguous access (1) or adjecant access (2)? ";
if (!(cin >> c))
return -1;
auto t = std::chrono::high_resolution_clock::now();
if (c == 0)
// "worst case scenario", 3 distant buffers constantly touched
if (sorb)
for (int k = 0; k < b * extra_it; ++k)
for (int i = 0; i < s; ++i)
sbuf1[i] = k; // static memory
sbuf2[i] = k; // stack memory
sbuf3[i] = k; // heap memory
for (int k = 0; k < s * extra_it; ++k)
for (int i = 0; i < b; ++i)
bbuf1[i] = k; // static memory
bbuf2[i] = k; // stack memory
bbuf3[i] = k; // heap memory
else if (c == 1)
// "best case scenario", only contiguous memory touched, interleaved
if (sorb)
for (int k = 0; k < b * extra_it; ++k)
for (int i = 0; i < s * 3; i += 3)
sbuf4[i] = k;
sbuf4[i + 1] = k;
sbuf4[i + 2] = k;
for (int k = 0; k < s * extra_it; ++k)
for (int i = 0; i < b * 3; i += 3)
bbuf4[i] = k;
bbuf4[i + 1] = k;
bbuf4[i + 2] = k;
else if (c == 2)
// "compromise", adjecant memory buffers touched
if (sorb)
auto b1 = sbuf4;
auto b2 = sbuf4 + s;
auto b3 = sbuf4 + s * 2;
for (int k = 0; k < b * extra_it; ++k)
for (int i = 0; i < s; ++i)
b1[i] = k;
b2[i] = k;
b3[i] = k;
auto b1 = bbuf4;
auto b2 = bbuf4 + b;
auto b3 = bbuf4 + b * 2;
for (int k = 0; k < s * extra_it; ++k)
for (int i = 0; i < b; ++i)
b1[i] = k;
b2[i] = k;
b3[i] = k;
cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - t).count() << " ms" << endl;
// basically just touching the buffers, avoiding clever optimizations
use += std::accumulate(sbuf1, sbuf1 + s, 0);
use += std::accumulate(sbuf2, sbuf2 + s, 0);
use += std::accumulate(sbuf3, sbuf3 + s, 0);
use += std::accumulate(sbuf4, sbuf4 + s * 3, 0);
use -= std::accumulate(bbuf1, bbuf1 + b, 0);
use -= std::accumulate(bbuf2, bbuf2 + b, 0);
use -= std::accumulate(bbuf3, bbuf3 + b, 0);
use -= std::accumulate(bbuf4, bbuf4 + b * 3, 0);
std::cout << use;
On my Intel i7-3740qm surprisingly, distant buffers consistently outperforms the more locality-friendly tests. It is close, however.

fftshift/ifftshift C/C++ source code [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Does anyone know if there is any free and open source library that has implemented these two functions the way they are defined in matlab?
FFTHIFT / IFFTSHIFT is a fancy way of doing CIRCSHIFT.
You can verify that FFTSHIFT can be rewritten as CIRCSHIFT as following.
You can define macros in C/C++ to punt FFTSHIFT to CIRCSHIFT.
A = rand(m, n);
mm = floor(m / 2);
nn = floor(n / 2);
% All three of the following should provide zeros.
circshift(A,[mm, nn]) - fftshift(A)
circshift(A,[mm, 0]) - fftshift(A, 1)
circshift(A,[ 0, nn]) - fftshift(A, 2)
Similar equivalents can be found for IFFTSHIFT.
Circular shift can be implemented very simply with the following code (Can be improved with parallel versions ofcourse).
template<class ty>
void circshift(ty *out, const ty *in, int xdim, int ydim, int xshift, int yshift)
for (int i = 0; i < xdim; i++) {
int ii = (i + xshift) % xdim;
for (int j = 0; j < ydim; j++) {
int jj = (j + yshift) % ydim;
out[ii * ydim + jj] = in[i * ydim + j];
And then
#define fftshift(out, in, x, y) circshift(out, in, x, y, (x/2), (y/2))
#define ifftshift(out, in, x, y) circshift(out, in, x, y, ((x+1)/2), ((y+1)/2))
This was done a bit impromptu. Bear with me if there are any formatting / syntactical problems.
Possible this code may help. It perform fftshift/ifftshift only for 1D array within one buffer. Algorithm of forward and backward fftshift for even number of elements is fully identical.
void swap(complex *v1, complex *v2)
complex tmp = *v1;
*v1 = *v2;
*v2 = tmp;
void fftshift(complex *data, int count)
int k = 0;
int c = (int) floor((float)count/2);
// For odd and for even numbers of element use different algorithm
if (count % 2 == 0)
for (k = 0; k < c; k++)
swap(&data[k], &data[k+c]);
complex tmp = data[0];
for (k = 0; k < c; k++)
data[k] = data[c + k + 1];
data[c + k + 1] = data[k + 1];
data[c] = tmp;
void ifftshift(complex *data, int count)
int k = 0;
int c = (int) floor((float)count/2);
if (count % 2 == 0)
for (k = 0; k < c; k++)
swap(&data[k], &data[k+c]);
complex tmp = data[count - 1];
for (k = c-1; k >= 0; k--)
data[c + k + 1] = data[k];
data[k] = data[c + k];
data[c] = tmp;
Also FFT library (including fftshift operations) for arbitrary points number could be found in Optolithium (under the OptolithiumC/libs/fourier)
Normally, centering the FFT is done with v(k)=v(k)*(-1)**k in
the time domain. Shifting in the frequency domain is a poor substitute, for
mathematical reasons and for computational efficiency.
See pp 27 of:
I am not sure why Matlab documentation does it the way they do,
they give no technical reference.
Or you can do it yourself by typing type fftshift and recoding that in C++. It's not that complicated of Matlab code.
Edit: I've noticed that this answer has been down-voted a few times recently and commented on in a negative way. I recall a time when type fftshift was more revealing than the current implementation, but I could be wrong. If I could delete the answer, I would as it seems no longer relevant.
Here is a version (courtesy of Octave) that implements it without
I tested the code provided here and made an example project to test them. For 1D code one can simply use std::rotate
template <typename _Real>
static inline
void rotshift(complex<_Real> * complexVector, const size_t count)
int center = (int) floor((float)count/2);
if (count % 2 != 0) {
// odd: 012 34 changes to 34 012
std::rotate(complexVector,complexVector + center,complexVector + count);
template <typename _Real>
static inline
void irotshift(complex<_Real> * complexVector, const size_t count)
int center = (int) floor((float)count/2);
// odd: 01 234 changes to 234 01
std::rotate(complexVector,complexVector +center,complexVector + count);
I prefer using std::rotate over the code from Alexei due to its simplicity.
For 2D it gets more complicated. For even numbers it is basically a flip left right and flip upside down. For odd it is the circshift algorithm:
// A =
// 1 2 3
// 4 5 6
// 7 8 9
// fftshift2D(A)
// 9 | 7 8
// --------------
// 3 | 1 2
// 6 | 4 5
// ifftshift2D(A)
// 5 6 | 4
// 8 9 | 7
// --------------
// 2 3 | 1
Here I implemented the circshift code with an interface using only one array for in and output. For even numbers only a single array is required, for odd numbers a second array is temporarily created and copied back to the input array. This causes a performance decrease because of the additional time for copying the array.
template<class _Real>
static inline
void fftshift2D(complex<_Real> *data, size_t xdim, size_t ydim)
size_t xshift = xdim / 2;
size_t yshift = ydim / 2;
if ((xdim*ydim) % 2 != 0) {
// temp output array
std::vector<complex<_Real> > out;
out.resize(xdim * ydim);
for (size_t x = 0; x < xdim; x++) {
size_t outX = (x + xshift) % xdim;
for (size_t y = 0; y < ydim; y++) {
size_t outY = (y + yshift) % ydim;
// row-major order
out[outX + xdim * outY] = data[x + xdim * y];
// copy out back to data
copy(out.begin(), out.end(), &data[0]);
else {
// in and output array are the same,
// values are exchanged using swap
for (size_t x = 0; x < xdim; x++) {
size_t outX = (x + xshift) % xdim;
for (size_t y = 0; y < yshift; y++) {
size_t outY = (y + yshift) % ydim;
// row-major order
swap(data[outX + xdim * outY], data[x + xdim * y]);
template<class _Real>
static inline
void ifftshift2D(complex<_Real> *data, size_t xdim, size_t ydim)
size_t xshift = xdim / 2;
if (xdim % 2 != 0) {
size_t yshift = ydim / 2;
if (ydim % 2 != 0) {
if ((xdim*ydim) % 2 != 0) {
// temp output array
std::vector<complex<_Real> > out;
out.resize(xdim * ydim);
for (size_t x = 0; x < xdim; x++) {
size_t outX = (x + xshift) % xdim;
for (size_t y = 0; y < ydim; y++) {
size_t outY = (y + yshift) % ydim;
// row-major order
out[outX + xdim * outY] = data[x + xdim * y];
// copy out back to data
copy(out.begin(), out.end(), &data[0]);
else {
// in and output array are the same,
// values are exchanged using swap
for (size_t x = 0; x < xdim; x++) {
size_t outX = (x + xshift) % xdim;
for (size_t y = 0; y < yshift; y++) {
size_t outY = (y + yshift) % ydim;
// row-major order
swap(data[outX + xdim * outY], data[x + xdim * y]);
Notice: There are better answers provided, I just keep this here for a while for... I do not know what.
Try this:
template<class T> void ifftShift(T *out, const T* in, size_t nx, size_t ny)
const size_t hlen1 = (ny+1)/2;
const size_t hlen2 = ny/2;
const size_t shft1 = ((nx+1)/2)*ny + hlen1;
const size_t shft2 = (nx/2)*ny + hlen2;
const T* src = in;
for(T* tgt = out; tgt < out + shft1 - hlen1; tgt += ny, src += ny) { // (nx+1)/2 times
copy(src, src+hlen1, tgt + shft2); //1->4
copy(src+hlen1, src+ny, tgt+shft2-hlen2); } //2->3
src = in;
for(T* tgt = out; tgt < out + shft2 - hlen2; tgt += ny, src += ny ){ // nx/2 times
copy(src+shft1, src+shft1+hlen2, tgt); //4->1
copy(src+shft1-hlen1, src+shft1, tgt+hlen2); } //3->2
For matrices with even dimensions you can do it in-place, just passing the same pointer into in and out parameters.
Also note that for 1D arrays fftshift is just std::rotate.
You could also use arrayfire's shift function as replacement for Matlab's circshift and re-implement the rest of the code. This could be useful if you are interested in any of the other features of AF anyway (such as portability to GPU by simply changing a linker flag).
However if all your code is meant to be run on the CPU and is quite sophisticated or you don't want to use any other data format (AF requires af::arrays) stick with one of the other options.
I ended up changing to AF because I would have had to re-implement fftshift as an OpenCL kernel otherwise back in the time.
It will give equivalent result to ifftshift in matlab
ifftshift(vector< vector <double> > Hlow,int RowLineSpace, int ColumnLineSpace)
int pivotRow=floor(RowLineSpace/2);
int pivotCol=floor(ColumnLineSpace/2);
for(int i=pivotRow;i<RowLineSpace;i++){
for(int j=0;j<ColumnLineSpace;j++){
for(int i=0;i<pivotRow;i++){
for(int j=0;j<ColumnLineSpace;j++){
double** arr = new double*[RowLineSpace];
for(int i = 0; i < RowLineSpace; ++i)
arr[i] = new double[ColumnLineSpace];
int i1=0,j1=0;
for(int j=pivotCol;j<ColumnLineSpace;j++){
for(int i=0;i<RowLineSpace;i++){
for(int j=0;j<pivotCol;j++){
for(int i=0;i<RowLineSpace;i++){
for(int i=0;i<RowLineSpace;i++){
for(int j=0;j<ColumnLineSpace;j++){
double value=arr[i][j];
return ifftShiftLow;
Octave uses fftw to implement (i)fftshift.
You can use kissfft. It's reasonable fast, extremely simple to use, and free. Arranging the output like you want it requires only to:
a) shift by (-dim_x/2, -dim_y/2, ...), with periodic boundary conditions
b) FFT or IFFT
c) shift back by (dim_x/2, dim_y/2, ...) , with periodic boundary conditions
d) scale ? (according to your needs IFFT*FFT will scale the function by dim_x*dim_y*... by default)