Corrupt heap after allocating vector on esp32 - c++

I am trying to compute optical flow (lucas kanade - based) on an esp32-cam.
I tried to save memory by operating on 2 small buffer of array only. I still have an error corrupt heap:
bfore allocate out conv
after allocate out conv
bfore allocate out conv
after allocate out conv
bfore allocate out conv
after allocate out conv
bfore allocate out conv
CORRUPT HEAP: multi_heap.c:432 detected at 0x3fff7114 abort() was
called at PC 0x40090a7f on core 0
Here is my code composed of 1D convolution and transpose to perform separate equivalent 2D convolution:
template<typename T>
conv(uint8_t *in, const std::vector<T> &g, const int nf) {
//int const nf = f.size();
int const ng = g.size();
int const n = nf + ng - 1;
uint8_t *f = in;
Serial.println("bfore allocate out conv");
std::vector<T> out(n, T()); // memory leak CORRUPT HEAP
Serial.println("after allocate out conv");
for(auto i(0); i < n; ++i) {
int const jmn = (i >= ng - 1)? i - (ng - 1) : 0;
int const jmx = (i < nf - 1)? i : nf - 1;
for(auto j(jmn); j <= jmx; ++j) {
out[i] += (f[j] * g[i - j]);
out.erase(out.begin(), out.begin() + ng / 2 + 1);
// Rescale to 0..255
auto max = *std::max_element(out.begin(), out.end());
auto min = *std::min_element(out.begin(), out.end());
float x;
for(auto v : out) {
x = (v - min) * 255.0 / max;
*(f++) = (uint8_t)x;
void transpose(uint8_t *f, int w, int h) {
for(auto i(0); i < h; ++i)
for(auto j(0); j < w; ++j)
std::swap(f[w * i + j], f[w * j + i]);
void LK_optical_flow(uint8_t *src1, uint8_t *src2, uint8_t *output, int w, int h)
std::vector<float> Kernel_Dy = {1, 2, 1};
std::vector<float> Kernel_Dx = {-1, 0, 1};
std::vector<float> Kernel_Dt = {1/3.0, 1/3.0, 1/3.0};
uint8_t *fx = src1;
uint8_t *fy = new uint8_t[w * h];
uint8_t *ft = src2;
memcpy(fy, fx, w * h * sizeof(uint8_t));
// Sobel Dx
conv(fx, Kernel_Dx, w*h);
transpose(fx, w, h);
conv(fx, Kernel_Dy, w*h);
transpose(fx, w, h);
// Sobel Dy
conv(fy, Kernel_Dy, w*h);
transpose(fy, w, h);
conv(fy, Kernel_Dx, w*h); // memory leak
transpose(fy, w, h);
// Dt
//conv(src2, Kernel_Dt, w*h);
Apparently the leaks come from the second buffer I allocated pointed by fy during the second call of conv(fy, ...) when it allocate out as vector.
What am I doing wrong?

With w and h not being the same, transpose will access and write to out-of-bounds memory.
From your comment, you have w at 96 and h at about 48. The second parameter to swap in transpose will access up to f[w * (w - 1) + h * (h - 1)] which is past the w * h elements you've allocated. This will change memory that hasn't been allocated, and in your case is corrupting the data your library uses to keep track of allocated memory (which is only detected during an allocation of free, and may not get detected right away).
The solution involves rewriting transpose to properly transpose a rectangular matrix. (This involves swapping w and h for the returned matrix.)


How to access 1D data (or reshape it) from pointer into something like multidimensional array in C++?

I have a pointer that points to the beginning of a 1000+ elements array that is initialized as below:
int numElements = 1200;
auto data = std::unique_ptr<float>{new float[numElements]};
Now I want to 'reshape' it into something like a (20,30,20) tensor, so I can access it the way I want (I can still read while it's 1-D as well but it feels weird). I want to access like this:
data[1][10][12] = 1337.0f;
Is there an efficient way of doing this (fast and short code)?
In the meantime, this is how I do it...
#include <iostream>
using std::cout;
using std::endl;
#include <vector>
using std::vector;
size_t get_index(const size_t x, const size_t y, const size_t z, const size_t x_res, const size_t y_res, const size_t z_res)
return z * y_res * x_res + y * x_res + x;
int main(void)
const size_t x_res = 10;
const size_t y_res = 10;
const size_t z_res = 10;
// Use new[] to allocate, and memset to clear
//float* vf = new float[x_res * y_res * z_res];
//memset(vf, 0, sizeof(float) * x_res * y_res * z_res);
// Better yet, use a vector
vector<float> vf(x_res*y_res*z_res, 0.0f);
for (size_t x = 0; x < x_res; x++)
for (size_t y = 0; y < y_res; y++)
for (size_t z = 0; z < z_res; z++)
size_t index = get_index(x, y, z, x_res, y_res, z_res);
// Do stuff with vf[index] here...
// Make sure to deallocate memory
// delete[] vf;
return 0;

Can MATLAB C generation coder generate C-code that fits embedded system?

I need to convert this code into C code.
Will MATLAB Coder generate C code that are memory safe, e.g they not using calloc or malloc. Misra C standard does not allow coder to use dynamical memory allocation. It's dangerous for embedded system due to memory leaks.
Will MATLAB Coder generate C code with dynamical matrix as argument e.g. functions with arguments foo(float* A, int m, int n) or foo(int m, int n, float A[m][n]) or is fix size example foo(float A[3][5]), only available as option?
Will MATLAB Coder generate C code that can be fitted into an embedded system. How about the internal C++ commands in the .m files such as horzcat, size and vertcat? Will they become 100% portable C-code?
Will MATLAB Coder generate functions that have call by reference? Example foo(float* input, float* output) instead of float* output = foo(float* input)
function [U] = mpc (A, B, C, x, N, r, lb)
## Find matrix
PHI = phiMat(A, C, N);
GAMMA = gammaMat(A, B, C, N);
## Solve first with no constraints
U = solve(PHI, GAMMA, x, N, r, 0, 0, false);
## Then use the last U as upper bound
U = solve(PHI, GAMMA, x, N, r, lb, U(end), true);
function U = solve(PHI, GAMMA, x, N, r, lb, ub, constraints)
## Set U
U = zeros(N, 1);
## Iterate Gaussian Elimination
for i = 1:N
## Solve u
if(i == 1)
u = (r - PHI(i,:)*x)/GAMMA(i,i)
u = (r - PHI(i,:)*x - GAMMA(i,1:i-1)*U(1:i-1) )/GAMMA(i,i)
## Constraints
if(constraints == true)
if(u > ub)
u = ub;
elseif(u < lb)
u = lb;
## Save u
U(i) = u
function PHI = phiMat(A, C, N)
## Create the special Observabillity matrix
PHI = [];
for i = 1:N
PHI = vertcat(PHI, C*A^i);
function GAMMA = gammaMat(A, B, C, N)
## Create the lower triangular toeplitz matrix
GAMMA = [];
for i = 1:N
GAMMA = horzcat(GAMMA, vertcat(zeros((i-1)*size(C*A*B, 1), size(C*A*B, 2)),cabMat(A, B, C, N-i+1)));
function CAB = cabMat(A, B, C, N)
## Create the column for the GAMMA matrix
CAB = [];
for i = 0:N-1
CAB = vertcat(CAB, C*A^i*B);
My C-code. Yes its working!
* Generalized_Predictive_Control.c
* Created on:
* Author:
#include "Generalized_Predictive_Control.h"
* Parameters
int adim;
int ydim;
int rdim;
int horizon;
* Deceleration
static void obsv(float* PHI, const float* A, const float* C);
static void kalman(float* x, const float* A, const float* B, float* u, const float* K, float* y, const float* C);
static void mul(float* A, float* B, float* C, int row_a, int column_a, int column_b);
static void tran(float* A, int row, int column);
static void CAB(float* GAMMA, float* PHI, const float* A, const float* B, const float* C);
static void solve(float* GAMMA, float* PHI, float* x, float* u, float* r, float lb, float ub, int constraintsON);
static void print(float* A, int row, int column);
void GPC(int adim_, int ydim_, int rdim_, int horizon_, const float* A, const float* B, const float* C, const float* D, const float* K, float* u, float* r, float* y, float* x){
* Set the dimensions
adim = adim_;
ydim = ydim_;
rdim = rdim_;
horizon = horizon_;
* Identify the model - Extended Least Square
int n = 5;
float* phi;
float* theta;
//els(phi, theta, n, y, u, P);
* Create a state space model with Observable canonical form
* Create the extended observability matrix
float PHI[horizon*ydim*adim];
memset(PHI, 0, horizon*ydim*adim*sizeof(float));
obsv(PHI, A, C);
* Create the lower triangular toeplitz matrix
float GAMMA[horizon*rdim*horizon*ydim];
memset(GAMMA, 0, horizon*rdim*horizon*ydim*sizeof(float));
* Solve the best input value
solve(GAMMA, PHI, x, u, r, 0, 0, 0);
solve(GAMMA, PHI, x, u, r, 0, *(u), 1);
* Estimate the state vector
kalman(x, A, B, u, K, y, C);
* Identify the model
static void els(float* P, float* phi, float* theta, int polyLength, int totalPolyLength, float* y, float* u, float* e){
* move phi with the inputs, outputs, errors one step to right
for(int i = 0; i < polyLength; i++){
*(phi + i+1 + totalPolyLength*0) = *(phi + i + totalPolyLength*0); // Move one to right for the y's
*(phi + i+1 + totalPolyLength*1) = *(phi + i + totalPolyLength*1); // Move one to right for the u's
*(phi + i+1 + totalPolyLength*2) = *(phi + i + totalPolyLength*2); // Move one to right for the e's
* Add the current y, u and e
(*phi + totalPolyLength*0) = -*(y + 0); // Need to be negative!
(*phi + totalPolyLength*1) = *(u + 0);
(*phi + totalPolyLength*2) = *(e + 0);
* phi'*theta
float y_est = 0;
for(int i = 0; i < totalPolyLength; i++){
y_est += *(phi + i) * *(theta + i);
float epsilon = *(y + 0) - y_est; // In this case, y is only one element array
* phi*epsilon
float phi_epsilon[totalPolyLength];
memset(phi_epsilon, 0, totalPolyLength*sizeof(float));
for(int i = 0; i < totalPolyLength; i++){
*(phi_epsilon + i) = *(phi + i) * epsilon;
* P_vec = P*phi_epsilon
float P_vec[totalPolyLength];
memset(P_vec, 0, totalPolyLength*sizeof(float));
mul(P, phi_epsilon, P_vec, totalPolyLength, totalPolyLength, 1);
* Update our estimated vector theta = theta + P_vec
for(int i = 0; i < totalPolyLength; i++){
*(theta + i) = *(theta + i) + *(P_vec + i);
* Update P = P - (P*phi*phi'*P)/(1 + phi'*P*phi)
// Create phi'
float phiT[totalPolyLength];
memset(phiT, 0, totalPolyLength*sizeof(float));
memcpy(phiT, phi, totalPolyLength*sizeof(float));
tran(phiT, totalPolyLength, 1);
// phi'*P
float phiT_P[totalPolyLength];
memset(phiT_P, 0, totalPolyLength*sizeof(float));
mul(phiT, P, phiT_P, 1, totalPolyLength, totalPolyLength);
// phi*phi'*P
float phi_phiT_P[totalPolyLength*totalPolyLength];
memset(phi_phiT_P, 0, totalPolyLength*totalPolyLength*sizeof(float));
mul(phi, phiT_P, phi_phiT_P, totalPolyLength, 1, totalPolyLength);
// P*phi*phi'*P
float P_phi_phiT_P[totalPolyLength*totalPolyLength];
memset(P_phi_phiT_P, 0, totalPolyLength*totalPolyLength*sizeof(float));
mul(P, phi_phiT_P, P_phi_phiT_P, totalPolyLength, totalPolyLength, totalPolyLength);
// P*phi
float P_phi[totalPolyLength];
memset(P_phi, 0, totalPolyLength*sizeof(float));
mul(P, phi, P_phi, totalPolyLength, totalPolyLength, 1);
// phi'*P*phi
float phiT_P_phi[1];
memset(phiT_P_phi, 0, 1*sizeof(float));
mul(phiT, P_phi, phiT_P_phi, 1, totalPolyLength, 1);
// P = P - (P_phi_phiT_P) / (1+phi'*P*phi)
for(int i = 0; i < totalPolyLength*totalPolyLength; i++){
*(P + i) = *(P + i) - *(P_phi_phiT_P + i) / (1 + *(phiT_P_phi));
* This will solve if GAMMA is square!
static void solve(float* GAMMA, float* PHI, float* x, float* u, float* r, float lb, float ub, int constraintsON){
* Now we are going to solve on the form
* Ax=b, where b = (R*r-PHI*x) and A = GAMMA and x = U
* R_vec = R*r
float R_vec[horizon*ydim];
memset(R_vec, 0, horizon*ydim*sizeof(float));
for(int i = 0; i < horizon*ydim; i++){
for (int j = 0; j < rdim; j++) {
*(R_vec + i + j) = *(r + j);
i += rdim-1;
* PHI_vec = PHI*x
float PHI_vec[horizon*ydim];
memset(PHI_vec, 0, horizon * ydim * sizeof(float));
mul(PHI, x, PHI_vec, horizon*ydim, adim, 1);
* Solve now (R_vec - PHI_vec) = GAMMA*U
* Notice that this is ONLY for Square GAMMA with lower triangular toeplitz matrix e.g SISO case
* This using Gaussian Elimination backward substitution
float U[horizon];
float sum = 0.0;
memset(U, 0, horizon*sizeof(float));
for(int i = 0; i < horizon; i++){
for(int j = 0; j < i; j++){
sum += *(GAMMA + i*horizon + j) * *(U + j);
float newU = (*(R_vec + i) - *(PHI_vec + i) - sum) / (*(GAMMA + i*horizon + i));
if(constraintsON == 1){
if(newU > ub)
newU = ub;
if(newU < lb)
newU = lb;
*(U + i) = newU;
sum = 0.0;
//print(U, horizon, 1);
* Set last U to u
if(constraintsON == 0){
*(u + 0) = *(U + horizon - 1);
*(u + 0) = *(U + 0);
* Lower traingular toeplitz of extended observability matrix
static void CAB(float* GAMMA, float* PHI, const float* A, const float* B, const float* C){
* First create the initial C*A^0*B == C*I*B == C*B
float CB[ydim*rdim];
memset(CB, 0, ydim*rdim*sizeof(float));
mul((float*)C, (float*)B, CB, ydim, adim, rdim);
* Take the transpose of CB so it will have dimension rdim*ydim instead
tran(CB, ydim, rdim);
* Create the CAB matrix from PHI*B
float PHIB[horizon*ydim*rdim];
mul(PHI, (float*) B, PHIB, horizon*ydim, adim, rdim); // CAB = PHI*B
tran(PHIB, horizon*ydim, rdim);
* We insert GAMMA = [CB PHI;
* 0 CB PHI;
* 0 0 CB PHI;
* 0 0 0 CB PHI] from left to right
for(int i = 0; i < horizon; i++) {
for(int j = 0; j < rdim; j++) {
memcpy(GAMMA + horizon*ydim*(i*rdim+j) + ydim*i, CB + ydim*j, ydim*sizeof(float)); // Add CB
memcpy(GAMMA + horizon*ydim*(i*rdim+j) + ydim*i + ydim, PHIB + horizon*ydim*j, (horizon-i-1)*ydim*sizeof(float)); // Add PHI*B
* Transpose of gamma
tran(GAMMA, horizon*rdim, horizon*ydim);
//print(CB, rdim, ydim);
//print(PHIB, rdim, horizon*ydim);
//print(GAMMA, horizon*ydim, horizon*rdim);
* Transpose
static void tran(float* A, int row, int column) {
float B[row*column];
float* transpose;
float* ptr_A = A;
for (int i = 0; i < row; i++) {
transpose = &B[i];
for (int j = 0; j < column; j++) {
*transpose = *ptr_A;
transpose += row;
// Copy!
memcpy(A, B, row*column*sizeof(float));
* [C*A^1; C*A^2; C*A^3; ... ; C*A^horizon] % Extended observability matrix
static void obsv(float* PHI, const float* A, const float* C){
* This matrix will A^(i+1) all the time
float A_pow[adim*adim];
memset(A_pow, 0, adim * adim * sizeof(float));
float A_copy[adim*adim];
memcpy(A_copy, (float*) A, adim * adim * sizeof(float));
* Temporary matrix
float T[ydim*adim];
memset(T, 0, ydim * adim * sizeof(float));
* Regular T = C*A^(1+i)
mul((float*) C, (float*) A, T, ydim, adim, adim);
* Insert temporary T into PHI
memcpy(PHI, T, ydim*adim*sizeof(float));
* Do the rest C*A^(i+1) because we have already done i = 0
for(int i = 1; i < horizon; i++){
mul((float*) A, A_copy, A_pow, adim, adim, adim); // Matrix power A_pow = A*A_copy
mul((float*) C, A_pow, T, ydim, adim, adim); // T = C*A^(1+i)
memcpy(PHI + i*ydim*adim, T, ydim*adim*sizeof(float)); // Insert temporary T into PHI
memcpy(A_copy, A_pow, adim * adim * sizeof(float)); // A_copy <- A_pow
* x = Ax - KCx + Bu + Ky % Kalman filter
static void kalman(float* x, const float* A, const float* B, float* u, const float* K, float* y, const float* C) {
* Compute the vector A_vec = A*x
float A_vec[adim*1];
memset(A_vec, 0, adim*sizeof(float));
mul((float*) A, x, A_vec, adim, adim, 1);
* Compute the vector B_vec = B*u
float B_vec[adim*1];
memset(B_vec, 0, adim*sizeof(float));
mul((float*) B, u, B_vec, adim, rdim, 1);
* Compute the vector C_vec = C*x
float C_vec[ydim*1];
memset(C_vec, 0, ydim*sizeof(float));
mul((float*) C, x, C_vec, ydim, adim, 1);
* Compute the vector KC_vec = K*C_vec
float KC_vec[adim*1];
memset(KC_vec, 0, adim*sizeof(float));
mul((float*) K, C_vec, KC_vec, adim, ydim, 1);
* Compute the vector Ky_vec = K*y
float Ky_vec[adim*1];
memset(Ky_vec, 0, adim*sizeof(float));
mul((float*) K, y, Ky_vec, adim, ydim, 1);
* Now add x = A_vec - KC_vec + B_vec + Ky_vec
for(int i = 0; i < adim; i++){
*(x + i) = *(A_vec + i) - *(KC_vec + i) + *(B_vec + i) + *(Ky_vec + i);
* C = A*B
static void mul(float* A, float* B, float* C, int row_a, int column_a, int column_b) {
// Data matrix
float* data_a = A;
float* data_b = B;
for (int i = 0; i < row_a; i++) {
// Then we go through every column of b
for (int j = 0; j < column_b; j++) {
data_a = &A[i * column_a];
data_b = &B[j];
*C = 0; // Reset
// And we multiply rows from a with columns of b
for (int k = 0; k < column_a; k++) {
*C += *data_a * *data_b;
data_b += column_b;
C++; // ;)
* Print matrix or vector - Just for error check
static void print(float* A, int row, int column) {
for (int i = 0; i < row; i++) {
for (int j = 0; j < column; j++) {
printf("%0.18f ", *(A++));
Disclaimer: I work on MATLAB Coder
There is a configuration setting to tell MATLAB Coder to generate code without using dynamically allocated memory or issue an error telling you why it can't do so.
cfg = coder.config('lib');
cfg.DynamicMemoryAllocation = 'Off';
codegen -config cfg ...
MATLAB Coder supports generating code with fixed-size arrays, variable-sized arrays, and dynamically allocated arrays. The various generated signature formats are shown in the documentation. For non-dynamically allocated variable-sized arrays, a common signature is something like: foo(x_data[100], x_size[2])
Yes, the generated code is generally portable and independent of MATLAB for the hardware you specify when generating code. The full list of available functions and classes supported for code generation is listed here. In a very small number of cases, the generated code needs to depend on libraries from MATLAB. Those cases will be called out in the documentation. Fundamental operations like horzcat and vertcat produce portable code that is independent of MATLAB.
Yes. For array outputs and MATLAB functions with multiple outputs, the generated code will return outputs by reference. It also supports passing an argument by reference in some cases when the corresponding MATLAB function has the same variable as an input and output: function A = foo(A,B) with a call like: y = foo(y,z); can produce something like void foo(double A[100], const double B[20]); where A is an input and output.

How to MPI_Gather a 2d array of structs with C++?

I am trying to render a fractal calculated using MPI. I used the answer to the following question as reference: sending blocks of 2D array in C using MPI
My problem is, that merge of data via MPI_Gatherv calculated by all the processes does not seem to work properly, because my main process always renders a black screen.
I have the following struct defined:
typedef struct Point {
float r,g,b,x,y;
} Point;
In my main I try to create an MPI_Datatype for the struct:
MPI_Datatype struct_type;
MPI_Datatype struct_members[1] = {MPI_FLOAT};
MPI_Aint offsets[1] = {0};
int struct_blengths[1] = {5};
int struct_items = 1;
MPI_Type_create_struct(struct_items, struct_blengths, offsets, struct_members, &struct_type);
I have a global variable for the calculation result:
Point **mandelbrot;
The variable is allocated thusly before each frame is being recalculated:
if (proc_id == root) {
//Just a check if this is the first frame that is being rendered
if (s > 0) {
s = W;
Point *p = (Point *) malloc(W * H * sizeof(Point));
mandelbrot = (Point **) malloc(W*sizeof(Point *));
for (int i = 0; i < W; i++) {
mandelbrot[i] = &(p[i*H]);
Here I try to create an array subtype using the Point struct (following the referenced answer as best I can):
//Width of the fractal to render
W = width;
//Height of the fractal
H = height;
//Chunk of width each process is responsible for [width / number of processes]
int segmentSize = (int) W / ntasks;
MPI_Datatype type, resizedtype;
int sizes[2] = {W,H}; /* size of global array */
int subsizes[2] = {segmentSize, H}; /* size of sub-region */
int starts[2] = {0,0};
MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, struct_type, &type);
MPI_Type_create_resized(type, 0, H*sizeof(Point), &resizedtype);
Calculate the displacements and counts of blocks to send and allocate memory for the process' subarray:
int sendcounts[segmentSize*H];
int displs[segmentSize*H];
if (proc_id == root) {
for (int i=0; i<segmentSize*H; i++) sendcounts[i] = 1;
int disp = 0;
for (int i=0; i<segmentSize; i++) {
for (int j=0; j<H; j++) {
displs[i*H+j] = disp;
disp += 1;
disp += ((W/segmentSize)-1)*H;
Point *p = (Point *) malloc(segmentSize * H * sizeof(Point));
Point **segment;
segment = (Point **) malloc(segmentSize * sizeof(Point*));
for (int i = 0; i < segmentSize; i++) {
segment[i] = &(p[i*H]);
Following that I calculate the color of the Mandelbrot set for each point in the chunk:
int i;
float c[3], dX, dY;
for ( x = 0; x < segmentSize; x++) {
for ( y = 0; y < H; y++) {
//Iterate over the point
i = iterateMandelbrot(rM + x * dR, iM - y * dI);
// Get decimal coordinates for rendering <0,1>
dX = (x + segmentSize * proc_id) / W;
dY = y / H;
//Calculate color using Bernoulli Polynomials
makeColor(i, maxIterations, c);
segment[x][y].x = (float) dX;
segment[x][y].y = (float) dY;
segment[x][y].r = (float) c[0];
segment[x][y].g = (float) c[1];
segment[x][y].b = (float) c[2];
Lastly I try to gather the chunks into the mandelbort variable for the root process to render:
int buffsize = (int) segmentSize * H;
MPI_Gatherv(&(segment[0][0]), W*H/(buffsize), struct_type,
&(mandelbrot[0][0]), sendcounts, displs, resizedtype,
Ok so the problem is now that no data seems to be written into the mandelbrot variable as my main process renders a black screen. Without using MPI the code works so the problem lies somewhere in the MPI_Gatherv call or maybe the way I am allocating the arrays. I realize there might be some memory leak associated with the mandelbrot set or the local segment arrays but that is not my main concern at the moment. Can you see what I am doing wrong here? Any help is appreciated!

OpenCV fast mat element and neighbour access

I use OpenCV (C++) Mat for my matrix and want to acces single Mat elements as fast as possible. From OpenCV tutorial, I found code for efficient acces:
for( i = 0; i < nRows; ++i)
p = I.ptr<uchar>(i);
for ( j = 0; j < nCols; ++j)
p[j] = table[p[j]];
For my problem, I need to access a Mat element and its neighbours (i-1,j-1) for a calculation. How can I adapt the given code to acces a single mat element AND its surrounding elements? Since speed matters, I want to avoid<>().
What is the most efficient way to acces a Mat value and its neighbour values?
The pixel and its neighbor pixels can be formed a cv::Rect, then you can simply use:
cv::Mat mat = ...;
cv::Rect roi= ...; // define it properly based on the neighbors defination
cv::Mat sub_mat = mat(roi);
In case your neighbors definition is not regular, i.e. they cannot form a rectangle area, use mask instead. Check out here for examples.
You can directly refers to Mat::data:
template<class T, int N>
T GetPixel(const cv::Mat &img, int x, int y) {
int k = (y * img.cols + x) * N;
T pixel;
for(int i=0;i<N;i++)
pixel[i] = *( + k + i);
return pixel;
template<class T,int N>
void SetPixel(const cv::Mat &img, int x, int y, T t) {
int k = (y * img.cols + x) * N;
for(int i=0;i<N;i++)
*( + k + i) = t[i];
unsigned char GetPixel<unsigned char, 1>(const cv::Mat &img, int x, int y) {
return *( + y * img.cols + x);
void SetPixel<unsigned char, 1>(const cv::Mat &img, int x, int y, unsigned char p) {
*( + y * img.cols + x) = p;
int main() {
unsigned char r,g,b;
int channels = 3;
Mat img = Mat::zeros(256,256, CV_8UC3);
for(int x=0;x<img.cols;x+=2)
for(int y=0;y<img.rows;y+=2)
SetPixel<cv::Vec3b, 3>(img, x, y, cv::Vec3b(255,255,255));
Mat imgGray = Mat::zeros(256,256, CV_8UC1);
for(int x=0;x<imgGray.cols;x+=4)
for(int y=0;y<imgGray.rows;y+=4)
SetPixel<unsigned char, 1>(imgGray, x, y, (unsigned char)255);
imwrite("out.jpg", img);
imwrite("outGray.jpg", imgGray);
return 0;
That is pretty fast I think.
For any future readers: Instead of reading the answers here, please read this blog post for a benchmark-based analysis of this functionality, as some of the answers are a bit off the bat.
From that post you can see that the fastest way to access pixels is using the forEach C++ Mat function. If you want the neighborhood it depends of the size; if you're looking for the usual squared 3x3 neighborhood, use pointers like this:
Mat img = Mat(100,100,CV_8U, Scalar(124)); // sample mat
uchar *up, *row, *down; // Pointers to rows
uchar n[9]; // neighborhood
for (int y = 1 ; y < (img.rows - 1) ; y++) {
up = img.ptr(y - 1);
row = img.ptr(y);
down = img.ptr(y + 1);
for (int x = 1 ; x < (img.cols - 1) ; x++) {
// Examples of how to access any pixel in the 8-connected neighborhood
n[0] = up[x - 1];
n[1] = up[x];
n[2] = up[x + 1];
n[3] = row[x - 1];
n[4] = row[x];
n[5] = row[x + 1];
n[6] = down[x - 1];
n[7] = down[x];
n[8] = down[x + 1];
This code can still be optimized but the idea of using row pointers is what I was trying to convey; this is just a bit faster than using the .at() function and you might have to do benchmarking to notice the difference (in versions of OpenCV 3+). You might want to use .at() before deciding to optimize pixel access.

What is the best way to create multi-dimensional array?

I have a really very basic doubt regarding STL containers.
My requirement is that i want to store double values in the form of multi-dimensional array. I will be performing various algebraic operations directly on them i.e.
myvector[4] = myvector[3] - 2 * myvector[2];
for this I am itterating using for loops & using the [] operator. I am not using STL itterator's. I found 2 basic approaches here.
I prefer speed over memory efficiency. Since I am accessing these variables frequently I think vector would be slow for me.
So what is your humble opinion on this matter?
I know that the answers would be based on your previous experience, that is why I am asking this question. I am sorry if this question is too basic to be discussed here.
The link you gave listed 2 methods, which creates "real" 2d arrays. In general, 2d arrays are not that efficient, because they require a lot of allocations. Instead, you can use a faked 2d array:
// Array of length L and width W
type* array1 = new type[L * W]; // raw pointers
std::vector<type> array2(L * W); // STL Vector
// Accessing a value. You have to use a convention for indices, and follow it.
// Here the convention is: lines are contiguous (index = x + y * W)
type value = array[x + y * W]; // raw pointer array & vector
Here is a simple benchmark (windows only, except if you change the timer part):
#include <vector>
#include <ctime>
#include <iostream>
#include <stdlib.h>
#include <Windows.h>
typedef LARGE_INTEGER clock_int;
void start_timer(clock_int& v)
void end_timer(clock_int v, const char* str)
clock_int e;
clock_int freq;
std::cout << str << 1000.0 * ((double)(e.QuadPart-v.QuadPart) / freq.QuadPart) << " ms\n";
void test_2d_vector(unsigned int w, unsigned int h)
std::vector<std::vector<double> > a;
for(unsigned int t = 0; t < h; t++)
clock_int clock;
// Benchmark random write access
for(unsigned int t = 0; t < w * h; t++)
a[rand() % h][rand() % w] = 0.0f;
end_timer(clock,"[2D] Random write (STL) : ");
// Benchmark contiguous write access
for(unsigned int y = 0; y < h; y++)
for(unsigned int x = 0; x < w; x++)
a[y][x] = 0.0f;
end_timer(clock,"[2D] Contiguous write (STL) : ");
void test_2d_raw(unsigned int w, unsigned int h)
double** a = new double*[h];
for(unsigned int t = 0; t < h; t++)
a[t] = new double[w];
clock_int clock;
// Benchmark random write access
for(unsigned int t = 0; t < w * h; t++)
a[rand() % h][rand() % w] = 0.0f;
end_timer(clock,"[2D] Random write (RAW) : ");
// Benchmark contiguous write access
for(unsigned int y = 0; y < h; y++)
for(unsigned int x = 0; x < w; x++)
a[y][x] = 0.0f;
end_timer(clock,"[2D] Contiguous write (RAW) : ");
void test_1d_raw(unsigned int w, unsigned int h)
double* a = new double[h * w];
clock_int clock;
// Benchmark random write access
for(unsigned int t = 0; t < w * h; t++)
a[(rand() % h) * w + (rand() % w)] = 0.0f;
end_timer(clock,"[1D] Random write (RAW) : ");
// Benchmark contiguous write access
for(unsigned int y = 0; y < h; y++)
for(unsigned int x = 0; x < w; x++)
a[x + y * w] = 0.0f;
end_timer(clock,"[1D] Contiguous write (RAW) : ");
void test_1d_vector(unsigned int w, unsigned int h)
std::vector<double> a(h * w);
clock_int clock;
// Benchmark random write access
for(unsigned int t = 0; t < w * h; t++)
a[(rand() % h) * w + (rand() % w)] = 0.0f;
end_timer(clock,"[1D] Random write (STL) : ");
// Benchmark contiguous write access
for(unsigned int y = 0; y < h; y++)
for(unsigned int x = 0; x < w; x++)
a[x + y * w] = 0.0f;
end_timer(clock,"[1D] Contiguous write (STL) : ");
int main()
int w=1000,h=1000;
return 0;
Compiled with msvc2010, release /Ox /Ot, it outputs for me (Win7 x64, Intel Core i7 2600K):
[2D] Random write (STL) : 32.3436 ms
[2D] Contiguous write (STL) : 0.480035 ms
[2D] Random write (RAW) : 32.3477 ms
[2D] Contiguous write (RAW) : 0.688771 ms
[1D] Random write (STL) : 32.1296 ms
[1D] Contiguous write (STL) : 0.23534 ms
[1D] Random write (RAW) : 32.883 ms
[1D] Contiguous write (RAW) : 0.220138 ms
You can see the STL is equivalent to raw pointers. But 1D is much faster than 2D.