Optimizing 1D Convolution - c++

Is there a way to speed up this 1D convolution ? I tried to make the dy cache efficient
but compiling with g++ and -O3 gave worse performances.
I am convolving with [-1. , 0., 1] in both directions.
Is not homework.
void print_matrix( int height, int width, float *matrix){
for (int j=0; j < height; j++){
for (int i=0; i < width; i++){
std::cout << matrix[j * width + i] << ",";
std::cout << std::endl;
void fill_matrix( int height, int width, float *matrix){
for (int j=0; j < height; j++){
for (int i=0; i < width; i++){
matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ;
#define RESTRICT __restrict__
void dx_matrix( int height, int width, float * RESTRICT in_matrix, float * RESTRICT out_matrix, float *min, float *max){
//init min,max
*min = *max = -1.F * in_matrix[0] + in_matrix[1];
for (int j=0; j < height; j++){
float* row = in_matrix + j * width;
for (int i=1; i < width-1; i++){
float res = -1.F * row[i-1] + row[i+1]; /* -1.F * value + 0.F * value + 1.F * value; */
if (res > *max ) *max = res;
if (res < *min ) *min = res;
out_matrix[j * width + i] = res;
void dy_matrix( int height, int width, float * RESTRICT in_matrix, float * RESTRICT out_matrix, float *min, float *max){
//init min,max
*min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1];
for (int j=1; j < height-1; j++){
for (int i=0; i < width; i++){
float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ;
if (res > *max ) *max = res;
if (res < *min ) *min = res;
out_matrix[j * width + i] = res;
double now (void)
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
int main(int argc, char **argv){
int width, height;
float *in_matrix;
float *out_matrix;
if(argc < 3){
std::cout << argv[0] << "usage: width height " << std::endl;
return -1;
width = atoi(argv[1]);
height = atoi(argv[2]);
std::cout << "Width:"<< width << " Height:" << height << std::endl;
if (width < 3){
std::cout << "Width too short " << std::endl;
return -1;
if (height < 3){
std::cout << "Height too short " << std::endl;
return -1;
in_matrix = (float *) malloc( height * width * sizeof(float));
out_matrix = (float *) malloc( height * width * sizeof(float));
fill_matrix(height, width, in_matrix);
//print_matrix(height, width, in_matrix);
float min, max;
double a = now();
dx_matrix(height, width, in_matrix, out_matrix, &min, &max);
std::cout << "dx min:" << min << " max:" << max << std::endl;
dy_matrix(height, width, in_matrix, out_matrix, &min, &max);
double b = now();
std::cout << "dy min:" << min << " max:" << max << std::endl;
std::cout << "time: " << b-a << " sec" << std::endl;
return 0;

Use local variables for computing the min and max. Every time you do this:
if (res > *max ) *max = res;
if (res < *min ) *min = res;
max and min have to get written to memory. Adding restrict on the pointers would help (indicating the writes are independent), but an even better way would be something like
float tempMin = ...
float tempMax = ...
// Inner loop
tempMin = (res < tempMin) ? res : tempMin;
tempMax = (res > tempMax) ? res : tempMax;
// End
*min = tempMin;
*max = tempMax;

First of all, I would rewrite the dy loop to get rid of "[ (j-1) * width + i]" and "in_matrix[ (j+1) * width + i]", and do something like:
float* p, *q, *out;
p = &in_matrix[(j-1)*width];
q = &in_matrix[(j+1)*width];
out = &out_matrix[j*width];
for (int i=0; i < width; i++){
float res = -1.F * p[i] + q[i] ;
if (res > *max ) *max = res;
if (res < *min ) *min = res;
out[i] = res;
But that is a trivial optimization that the compiler may already be doing for you.
It will be slightly faster to do "q[i]-p[i]" instead of "-1.f*p[i]+q[i]", but, again, the compiler may be smart enough to do that behind your back.
The whole thing would benefit considerably from SSE2 and multithreading. I'd bet on at least a 3x speedup from SSE2 right away. Multithreading can be added using OpenMP and it will only take a few lines of code.

The compiler might notice this but you are creating/freeing a lot of variables on the stack as you go in and out of the scope operators {}. Instead of:
for (int j=0; j < height; j++){
float* row = in_matrix + j * width;
for (int i=1; i < width-1; i++){
float res = -1.F * row[i-1] + row[i+1];
How about:
int i, j;
float *row;
float res;
for (j=0; j < height; j++){
row = in_matrix + j * width;
for (i=1; i < width-1; i++){
res = -1.F * row[i-1] + row[i+1];

Well, the compiler might be taking care of these, but here are a couple of small things:
a) Why are you multiplying by -1.F? Why not just subtract? For instance:
float res = -1.F * row[i-1] + row[i+1];
could just be:
float res = row[i+1] - row[i-1];
b) This:
if (res > *max ) *max = res;
if (res < *min ) *min = res;
can be made into
if (res > *max ) *max = res;
else if (res < *min ) *min = res;
and in other places. If the first is true, the second can't be so let's not check it.
Here's another thing. To minimize your multiplications, change
for (int j=1; j < height-1; j++){
for (int i=0; i < width; i++){
float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ;
int h = 0;
int width2 = 2 * width;
for (int j=1; j < height-1; j++){
h += width;
for (int i=h; i < h + width; i++){
float res = in_matrix[i + width2] - in_matrix[i];
and at the end of the loop
out_matrix[i + width] = res;
You can do similar things in other places, but hopefully you get the idea. Also, there is a minor bug,
*min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1 ];
should be just in_matrix[ width ] at the end.

Profiling this with -O3 and -O2 using versions of both the clang and g++ compilers on OS X, I found that
30% of the time was spent filling the initial matrix
matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ;
40% of the time was spent in dx_matrix on the line.
out_matrix[j * width + i] = row[i+1] -row[i-1];
About 9% of the time was spent in the conditionals in dx_matrix .. I separated them into a separate loop to see if that helped, but it didn't change anything much.
Shark gave the suggestion that this could be improved through the use of SSE instructions.
Interestingly only about 19% of the time was spent in the dy_matrix routine.
This was running on 10k by 10k matrix ( about 1.6 seconds )
Note your results may be different if you're using a different compiler, different OS etc.


Problem in converting the "for loop" in CUDA

I have tried to extract patches from an image parallelly with pixel shift/overlapping. I have written the CPU version of the code. But I could not able to convert the for loop which has an increment of pixel shift. I have given the part of the code where for loop is being used. CreatePatchDataSet function has the "for loop " which has an increment of pixel shift. Please help me out to convert this function into Cuda. I have provided the following code.
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <iostream>
#include <fstream>
#include <sstream>
#include <random>
#include <vector>
#include <omp.h>
using namespace std;
using namespace cv;
#define PATCH_SIZE (5)
#define PIXEL_SHIFT (2)
void ConvertMat2DoubleArray(cv::Mat input, double* output)
for (int i = 0; i < input.rows; i++)
double *src = input.ptr<double>(i);
for (int j = 0; j < input.cols; j++)
output[input.cols * input.channels() * i + input.channels() * j + 0] = src[j];
void GetNumOfPatch(const int width, const int height, const int patch_size, const int pixel_shift, int* num_of_patch, int* num_of_patch_col, int* num_of_patch_row) {
*num_of_patch_col = 0;
int len_nb = 0;
while (len_nb < width) {
if (len_nb != 0) {
len_nb += patch_size - (patch_size - pixel_shift);
else {
len_nb += patch_size;
len_nb = 0;
*num_of_patch_row = 0;
while (len_nb < height) {
if (len_nb != 0) {
len_nb += patch_size - (patch_size - pixel_shift);
else {
len_nb += patch_size;
*num_of_patch = (*num_of_patch_col) * (*num_of_patch_row);
void CreatePatchDataSet(double *original_data, double* patch_data, const int width, const int height, const int pixel_shift, const int patch_size, const int num_of_patch_col, const int num_of_patch_row) {
int counter_row = 0;
int num_of_patch_image = num_of_patch_row * num_of_patch_col;
for (int i = 0; i < height; i += pixel_shift) {
int counter_col = 0;
for (int j = 0; j < width; j += pixel_shift) {
//Get Low Resolution Image
for (int ii = 0; ii < patch_size; ii++) {
for (int jj = 0; jj < patch_size; jj++) {
if ((i + ii) < height && (j + jj) < width) {
patch_data[num_of_patch_image * (patch_size * ii + jj) + num_of_patch_col*counter_row + counter_col] = original_data[width*(i + ii) + (j + jj)];
else {
patch_data[num_of_patch_image * (patch_size * ii + jj) + num_of_patch_col*counter_row + counter_col] = 0.;
if (counter_col == num_of_patch_col) {
if (counter_row == num_of_patch_row) {
int main()
int ratio=2;
cv::Mat image = cv::imread("input_b2_128.tif", CV_LOAD_IMAGE_UNCHANGED);
cv::Mat imageH = cv::Mat(image.rows * ratio, image.cols * ratio, CV_8UC1);
cv::resize(image, imageH, cv::Size(imageH.cols, imageH.rows), 0, 0,
double* orgimageH = (double*)calloc(imageH.cols*imageH.rows*image.channels(), sizeof(double));
ConvertMat2DoubleArray(imageH, orgimageH);
int widthH = imageH.cols;
int heightH = imageH.rows;
int dimH = (int)PATCH_SIZE * (int)PATCH_SIZE* (int)image.channels();
int dimL = (int)PATCH_SIZE/ratio* (int)PATCH_SIZE/ratio * (int)image.channels();
//3. Create training data set=========================
int num_of_patch_image = 0;
int num_of_patch_col = 0;
int num_of_patch_row = 0;
GetNumOfPatch(widthH, heightH, (int)PATCH_SIZE, (int)PIXEL_SHIFT, &num_of_patch_image, &num_of_patch_col, &num_of_patch_row);
cout<<"patch numbers: \n " << num_of_patch_image << endl;
double* FY = (double*)calloc(dimH * num_of_patch_image, sizeof(double));
CreatePatchDataSet(orgimageH, FY, widthH, heightH, (int)PIXEL_SHIFT, (int)PATCH_SIZE, num_of_patch_col, num_of_patch_row);
return 0;
The results I got for first 10 values in CPU version:
patch numbers:
I have tried to convert this function to Kernel function using cuda:. But it goes into the infinite loop. As I am very new to this CUDA field, could you please help me to find out the problem in the code ?
__global__ void CreatePatchDataSet(double *original_data, double* patch_data, const int width, const int height, const int pixel_shift, const int patch_size, const int num_of_patch_col, const int num_of_patch_row) {
int num_of_patch_image = num_of_patch_row * num_of_patch_col;
int i = threadIdx.x + (blockDim.x*blockIdx.x);
int j = threadIdx.y + (blockDim.y*blockIdx.y);
while (i<height && j< width)
int counter_row = 0;
int counter_col = 0;
//Get Low Resolution Image
for (int ii = 0; ii < patch_size; ii++) {
for (int jj = 0; jj < patch_size; jj++) {
if ((i + ii) < height && (j + jj) < width) {
patch_data[num_of_patch_image * (patch_size * ii + jj) + num_of_patch_col*counter_row + counter_col] = original_data[width*(i + ii) + (j + jj)];
else {
patch_data[num_of_patch_image * (patch_size * ii + jj) + num_of_patch_col*counter_row + counter_col] = 0.;
if (counter_col == num_of_patch_col) {
if (counter_row == num_of_patch_row) {
i+= blockDim.x*gridDim.x;
j+= blockDim.y*gridDim.y;
int main()
int ratio=2;
cv::Mat image = cv::imread("input_b2_128.tif", CV_LOAD_IMAGE_UNCHANGED);
cv::Mat imageH = cv::Mat(image.rows * ratio, image.cols * ratio, CV_8UC1);
cv::resize(image, imageH, cv::Size(imageH.cols, imageH.rows), 0, 0, cv::INTER_LANCZOS4);
double *orgimageH = (double*)calloc(imageH.cols*imageH.rows*image.channels(), sizeof(double));
ConvertMat2DoubleArray(imageH, orgimageH);
int widthH = imageH.cols;
int heightH = imageH.rows;
int dimH = (int)PATCH_SIZE * (int)PATCH_SIZE* (int)image.channels();
int dimL = (int)PATCH_SIZE/ratio* (int)PATCH_SIZE/ratio * (int)image.channels();
//3. Create training data set=========================
int num_of_patch_image = 0;
int num_of_patch_col = 0;
int num_of_patch_row = 0;
GetNumOfPatch(widthH, heightH, (int)PATCH_SIZE, (int)PIXEL_SHIFT, &num_of_patch_image, &num_of_patch_col, &num_of_patch_row);
cout<<"patch numbers: \n " << num_of_patch_image << endl;
double* FY = (double*)calloc(dimH * num_of_patch_image, sizeof(double));
double *d_orgimageH;
gpuErrchk(cudaMalloc ((void**)&d_orgimageH, sizeof(double)*widthH*heightH));
double *d_FY;
gpuErrchk(cudaMalloc ((void**)&d_FY, sizeof(double)* dimH * num_of_patch_image));
gpuErrchk(cudaMemcpy(d_orgimageH , orgimageH , sizeof(double)*widthH*heightH, cudaMemcpyHostToDevice));
dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (widthH + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (heightH + dimBlock.y - 1) / dimBlock.y;
CreatePatchDataSet<<<dimGrid,dimBlock>>>(d_orgimageH, d_FY, widthH, heightH, (int)PIXEL_SHIFT, (int)PATCH_SIZE, num_of_patch_col, num_of_patch_row);
gpuErrchk(cudaMemcpy(FY,d_FY, sizeof(double)*dimH * num_of_patch_image, cudaMemcpyDeviceToHost));
// cout<<"Hello world";
return 0;
Image I have used: [1]: https://i.stack.imgur.com/Ywg7p.png
i+= blockDim.x*gridDim.x;
j+= blockDim.y*gridDim.y;
is outside the while loop in your kernel. As i and j never change inside the while loop, it isn't stopping. There could be more problems here, but this is the most prominent one.
EDIT: Another one that I found, is that you have only one while over both i and j instead of one for each. You should probably use for loops like in your CPU code:
for (i = pixel_shift * (threadIdx.x + (blockDim.x*blockIdx.x));
i < height;
i += pixel_shift * blockDim.x * gridDim.x) {
for (j = ...; j < ...; j += ...) {
/* ... */
I could imagine this to be a good idea:
for (counter_row = threadIdx.y + blockDim.y * blockIdx.y;
counter_row < num_of_patch_row;
counter_row += blockDim.y * gridDim.y) {
i = counter_row * pixel_shift;
if (i > height)
for (counter_col = threadIdx.x + blockDim.x * blockIdx.x;
counter_col < num_of_patch_col;
counter_col += blockDim.x * gridDim.x) {
j = counter_col * pixel_shift;
if (j > width)
/* ... */
I have also exchanged the x/y fields of the execution parameters between the inner and the outer loop, as it seemed more appropriate considering that the x field is continuous in warps (memory access benefits).

Issue with convolution kernels that add up to zero

I'm making an image editing program in c++ using sfml and tried to add image filters using:
int clamp(int value, int min, int max)
if (value < min)
return min;
if (value > max)
return max;
return value;
void MyImage::applyKernel(std::vector<std::vector<int>> kernel)
int index(0), tempx(0), tempy(0);
int wr(0), wg(0), wb(0), wa(0), sum(0);
auto newPixels = new sf::Uint8[this->size_y * this->size_x * 4];
// Calculate the sum of the kernel
for (int i = 0; i < kernel.size(); i++) {
for (int j = 0; j < kernel[i].size(); j++) {
sum += kernel[i][j];
for (int y = 0; y < this->size_y; y++) {
for (int x = 0; x < this->size_x; x++) {
Calculate weighted sum from kernel
wr = wg = wb = wa = 0;
for (int i = 0; i < kernel.size(); i++) {
for (int j = 0; j < kernel[i].size(); j++) {
Calculates the coordinates of the kernel relative to the pixel we are changing
tempx = x + (j - floor(kernel[i].size() / 2));
tempy = y + (i - floor(kernel.size() / 2));
//std::cout << "kernel=(" << j << ", " << i << "), pixel=(" << x << ", " << y << ") tempPos=(" << tempx << ", " << tempy << ")\n";
This code below should have the effect of mirroring the image in the case the kernel coordinate is out of bounds (along the edge of the image)
tempx = (tempx < 0) ? -1 * tempx : tempx;
tempy = (tempy < 0) ? -1 * tempy : tempy;
tempx = (tempx > this->size_x) ? x - (j - floor(kernel[i].size() / 2)) : tempx;
tempy = (tempy > this->size_y) ? y - (i - floor(kernel.size() / 2)) : tempy;
if (tempx >= 0 && tempx < this->size_x && tempy >= 0 && tempy < this->size_y) {
index = (((tempy * this->size_x) - tempy) + (tempx)) * 4;
wr += kernel[i][j] * this->pixels[index];
wg += kernel[i][j] * this->pixels[index + 1];
wb += kernel[i][j] * this->pixels[index + 2];
wa += kernel[i][j] * this->pixels[index + 3];
if (sum) {
wr /= sum;
wg /= sum;
wb /= sum;
wa /= sum;
index = (((y * this->size_x) - y) + (x)) * 4;
newPixels[index] = clamp(wr, 0, 255); // Red
newPixels[index + 1] = clamp(wg, 0, 255); // Green
newPixels[index + 2] = clamp(wb, 0, 255); // Blue
newPixels[index + 3] = clamp(wa, 0, 255); // Alpha
this->pixels = newPixels;
// Copies the data from our sf::Uint8 array to the image object to be displayed => Removes the overhead of calling setPixel(x,y,color) for every pixel {As a side note setPixel() should always be avoided}|
this->im->create(this->size_x, this->size_y, this->pixels);
I was trying to use [-1,-1,-1], [-1,8,-1]. [-1,-1,-1] for edge detection but just ended up with a white image except for some pixels near the bottom. I've tried different images and kernels out but any that add to 0 don't work. For example if I take the edge detection kernel above and change the 8 to a 9, it gives an expected result. Is there something wrong with my idea of how convolution kernels work or is it just a bug in my code?
Thank you.

2D array CUDA problems

I'm currently struggling to properly work with 2D arrays within my CUDA kernel. 1D was fine but so far had no luck with it moving on to 2D. Here is my host function and kernel:
__global__ void add_d2D(double *x, double *y,double *z, int n, int m){
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x){
for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < m; j += blockDim.y * gridDim.y){
z[i*m + j] = x[i*m + j] + y[i*m + j];
__host__ void add2D(double *a, double *b, double *result, int N, int M){
double *a_d, *b_d, *c_d;
size_t pitcha;
size_t pitchb;
size_t pitchc;
cudaErrchk(cudaMallocPitch(&a_d,&pitcha, M*sizeof(double),N));
cudaErrchk(cudaMallocPitch(&b_d,&pitchb, M*sizeof(double),N));
cudaErrchk(cudaMallocPitch(&c_d,&pitchc, M*sizeof(double),N));
cudaErrchk(cudaMemcpy2D(a_d,M*sizeof(double), a,pitcha, M*sizeof(double),N, cudaMemcpyHostToDevice));
cudaErrchk(cudaMemcpy2D(b_d,M*sizeof(double), b,pitchb, M*sizeof(double),N, cudaMemcpyHostToDevice));
dim3 threadsPerBlock(2, 2);
dim3 numBlocks(N/threadsPerBlock.x, M/threadsPerBlock.y);
add_d2D<<<numBlocks, threadsPerBlock>>>(a_d, b_d, c_d , N, M);
cudaErrchk(cudaMemcpy2D(result,M*sizeof(double), c_d,pitchc, M*sizeof(double),N, cudaMemcpyDeviceToHost));
And below my example to test it. It prints out the first 10 values of C correctly but all others remain 0. I believe the problem is within the kernel. Where it can't find the correct values due to the pitch, but not sure how to solve it correctly though.
double a[4][10];
double b[4][10];
double c[4][10];
for (int i = 0; i < 4; i ++){
for (int j = 0; j < 10; j ++){
a[i][j] = 0 + rand() % 10;
b[i][j] = 0 + rand() % 10;
ertiscuda::add2D((double *)a, (double *)b, (double *)c, 4, 10);
for (int i = 0; i < 4; i ++){
for (int j = 0; j < 10; j ++){
std::cout << a[i][j] << " " << b[i][j] << " " << c[i][j] << std::endl;
You have two mistakes
Each thread in the kernel should perform one operation rather than all the operations. (For memory reasons you might want to do more, be we will keep this example simple).
You had the destination and source pitches switched when loading the data onto the device.
Here is a working version
#include <cuda_runtime.h>
#include <stdlib.h>
#include <iostream>
#include <sstream>
#define CUDASAFECALL( err ) cuda_safe_call(err, __FILE__, __LINE__ )
void cuda_safe_call(const cudaError err, const char *file, const int line)
if (cudaSuccess != err)
std::stringstream error_msg;
error_msg << "cuda_safe_call() failed at " << file << ":" << line << ":" << cudaGetErrorString(err);
const auto error_msg_str = error_msg.str();
std::cout << error_msg_str << std::endl;
throw std::runtime_error(error_msg_str);
__global__ void add_d2D(const double *x, const double *y, double *z, int n, int m, int m_pitch_elements)
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (row< n && col <m )
auto idx = row*m_pitch_elements + col;
z[idx] = x[idx] + y[idx];
//z[idx] = idx;
__host__ void add2D(const double *a,const double *b, double *result, int N, int M) {
double *a_d, *b_d, *c_d;
size_t pitcha,pitchb,pitchc;
CUDASAFECALL(cudaMallocPitch(&a_d, &pitcha, M * sizeof(double), N));
CUDASAFECALL(cudaMallocPitch(&b_d, &pitchb, M * sizeof(double), N));
CUDASAFECALL(cudaMallocPitch(&c_d, &pitchc, M * sizeof(double), N));
CUDASAFECALL(cudaMemcpy2D(a_d, pitcha, a, M * sizeof(double), M * sizeof(double), N, cudaMemcpyHostToDevice));
CUDASAFECALL(cudaMemcpy2D(b_d, pitchb, b, M * sizeof(double), M * sizeof(double), N, cudaMemcpyHostToDevice));
dim3 threadsPerBlock(2, 2);
auto safediv = [](auto a, auto b) {return static_cast<unsigned int>(ceil(a / (b*1.0))); };
dim3 numBlocks(safediv(N, threadsPerBlock.x), safediv( M, threadsPerBlock.y));
//all the pitches should be the same
auto pitch_elements = pitcha / sizeof(double);
add_d2D << <numBlocks, threadsPerBlock >> >(a_d, b_d, c_d, N, M, pitch_elements);
CUDASAFECALL(cudaMemcpy2D(result, M * sizeof(double), c_d, pitchc, M * sizeof(double), N, cudaMemcpyDeviceToHost));
int main()
double a[4][10];
double b[4][10];
double c[4][10];
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 10; j++) {
a[i][j] = 0 + rand() % 10;
b[i][j] = 0 + rand() % 10;
add2D((double *)a, (double *)b, (double *)c, 4, 10);
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 10; j++) {
std::cout << a[i][j] << " " << b[i][j] << " " << c[i][j]<< "|"<< a[i][j]+ b[i][j] << std::endl;
return 0;

APPCRASH (not when debugging) and Segmentation fault using QtCreator (C/C++)

I'm using QtCreator to code an algorithm that I have already coded on Matlab.
When coding this program, I have two errors. The firts one (APPCRASH) appears just when I build and execute the program normally, but not when I try to debug it (Heisenbug) and it appears on the function 'matriceA'. I tried to make the variables volatile and to write the matrix A term formulas on other function, hoping that that will stop the compiler optimization (I think that the compiler optimization might cause the problem), but I have not been able to solve the problem. I have not tried to to compile the project using the option -o0 because my professor (it's an university project) has to be able to compile it normally (without specific options).
The second one is a SISSEGV segmentation fault. It happens when the code arrives to "DestroyFloatArray(&b, width);" on InpaintingColor.
And here the codes:
clanu_process.cpp (it's little messy because I've tried a lot of things...)
#include "clanu_process.h"
#include "iomanip"
void InpaintingColor(float **Rout, float **Gout, float **Bout, float **Rin, float **Gin, float **Bin, float **Mask, int width, int height, double param)
cout << "1" << endl;
float alphak = 0, bethak = 0, res = 0;
float **b = 0, **xk = 0, **dk = 0, **rk = 0, **Ark = 0, **tmp1 = 0,**tmp2 = 0,**tmp3 = 0;
Ark = AllocateFloatArray( width, height);
tmp1 = AllocateFloatArray( width, height);
tmp2 = AllocateFloatArray( width, height);
tmp3 = AllocateFloatArray( width, height);
xk = AllocateFloatArray( width, height);
dk = AllocateFloatArray( width, height);
rk = AllocateFloatArray( width, height);
b = AllocateFloatArray( width, height);
cout << "2" << endl;
res = 1e8;
matrixDuplicate(xk, b, width, height);
// APPCRASH error
//More code
// SIGSEGV error
DestroyFloatArray(&b, width);
DestroyFloatArray(&xk, width);
DestroyFloatArray(&dk, width);
DestroyFloatArray(&rk, width);
DestroyFloatArray(&Ark, width);
DestroyFloatArray(&tmp1, width);
DestroyFloatArray(&tmp2, width);
DestroyFloatArray(&tmp3, width);
float** matriceA(float **A, float **I, float **Masque, int N2, int N1){
volatile bool bool_iplus = false, bool_imoins = false, bool_jmoins = false, bool_jplus = false;
volatile int iplus = 0, imoins = 0, jplus = 0, jmoins = 0;
for(int i = 1; i <= N1; i++){
bool_iplus = i<N1;
iplus = i+1 < N1 ? i+1 : N1;
bool_imoins = i>1;
imoins = i-1 > 1 ? i-1 : 1;
for(int j = 1; j <= N2; j++){
bool_jplus = j<N2;
jplus = j+1 < N2 ? j+1 : N2;
bool_jmoins = j>1;
jmoins = j -1 > 1 ? j-1 : 1;
//cout << "if - " << i << ", " << j<< endl;
A[i-1][j-1] = (1.0/36)*(16*I[i-1][j-1]
+ 4*(
+ (bool_imoins?I[imoins-1][j-1]:0)
+ (bool_jplus?I[i-1][jplus-1]:0)
+ (bool_jmoins?I[i-1][jmoins-1]:0)
+ (bool_imoins&&bool_jplus?I[imoins-1][jplus-1]:0)
+ (bool_imoins&&bool_jmoins?I[imoins-1][jmoins-1]:0))
+ (bool_iplus&&bool_jmoins?I[iplus-1][jmoins-1]:0));
//cout << "else - " << i << ", " << j << endl;
+ I[iplus-1][j-1]
+ I[imoins-1][j-1]
+ I[i-1][jplus-1]
+ I[i-1][jmoins-1]
+ I[iplus-1][jplus-1]
+ I[imoins-1][jplus-1]
+ I[imoins-1][jmoins-1]
+ I[iplus-1][jmoins-1]);
return A;
The functions AllocateFloatArray and DestroyFloatArray
float ** AllocateFloatArray(int width, int height)
float ** r = new float*[width];
for(int i=0; i<width; i++)
r[i] = new float[height];
return r;
void DestroyFloatArray(float ***a, int width)
if( *a == 0 ) return;
for(int i=0; i<width; i++)
delete[] a[0][i];
delete[] *a;
*a = 0;
Thank you for your time.
I'm no sure that it's the cause of your problem but...
Your function "Matrix operations" (sum(), matrixSubstraction(), matrixAddition(), matrixProductByElement(), matrixProductByScalar(), and matrixDuplicate()) are ranging the first index from zero to width and the second one from zero to height.
If I'm not wrong, this is correct and is consistent with allocation/deallocation (AllocateFloatArray() and DestroyFloatArray()).
But look at the two matriceA() functions; they are defined as
float** matriceA(float **A, float **I, int N2, int N1)
float** matriceA(float **A, float **I, float **Masque, int N2, int N1)
In both functions the first index range from zero to N1 and the second one from zero to N2; by example
for(int i = 1; i <= N1; i++){
// ...
for(int j = 1; j <= N2; j++){
// ...
A[i-1][j-1] = (1.0/36)*(16*I[i-1][j-1] // ...
Good. But you call matriceA() in this way
Briefly: you allocate your matrices as width * height matrices; your "matrix operations" are using they as width * height matrices but your matriceA() function are using they as height * width.
Wonderful way to devastate the memory.
I suppose the solution could be
1) switch N1 and N2 in matriceA() definition
2) or switch width and height in matriceA() calling
p.s.: sorry for my bad English.

How to implement midpoint displacement

I'm trying to implement procedural generation in my game. I want to really grasp and understand all of the algorithms nessecary rather than simply copying/pasting existing code. In order to do this I've attempted to implement 1D midpoint displacement on my own. I've used the information here to write and guide my code. Below is my completed code, it doesn't throw an error but that results don't appear correct.
const int lineLength = 65;
float range = 1.0;
float displacedLine[lineLength];
for (int i = 0; i < lineLength; i++)
displacedLine[i] = 0.0;
for (int p = 0; p < 100; p++)
int segments = 1;
for (int i = 0; i < (lineLength / pow(2, 2)); i++)
int segs = segments;
for (int j = 0; j < segs; j++)
int x = floor(lineLength / segs);
int start = (j * x) + 1;
int end = start + x;
if (i == 0)
float lo = -range;
float hi = +range;
float change = lo + static_cast <float> (rand()) / (static_cast <float> (RAND_MAX / (hi - lo)));
int center = ((end - start) / 2) + start;
displacedLine[center - 1] += change;
range /= 2;
Where exactly have I made mistakes and how might I correct them?
I'm getting results like this:
But I was expecting results like this:
The answer is very simple and by the way I'm impressed you managed to debug all the potential off-by-one errors in your code. The following line is wrong:
displacedLine[center - 1] += change;
You correctly compute the center index and change amount but you missed that the change should be applied to the midpoint in terms of height. That is:
displacedLine[center - 1] = (displacedLine[start] + displacedLine[end]) / 2;
displacedLine[center - 1] += change;
I'm sure you get the idea.
The problem seems to be that you are changing only the midpoint of each line segment, rather than changing the rest of the line segment in proportion to its distance from each end to the midpoint. The following code appears to give you something more like what you're looking for:
#include <iostream>
#include <cstdlib>
#include <math.h>
#include <algorithm>
using namespace std;
void displaceMidPt (float dline[], int len, float disp) {
int midPt = len/2;
float fmidPt = float(midPt);
for (int i = 1; i <= midPt; i++) {
float ptDisp = disp * float(i)/fmidPt;
dline[i] += ptDisp;
dline[len-i] += ptDisp;
void displace (float displacedLine[], int lineLength, float range) {
for (int p = 0; p < 100; p++) {
int segs = pow(p, 2);
for (int j = 0; j < segs; j++) {
float lo = -range;
float hi = +range;
float change = lo + static_cast <float> (rand()) / (static_cast <float> (RAND_MAX / (hi - lo)));
int start = int(float(j)/float(segs)*float(lineLength));
int end = int(float(j+1)/float(segs)*float(lineLength));
displaceMidPt (displacedLine+start,end-start,change);
range /= 2;
void plot1D (float x[], int len, int ht = 10) {
float minX = *min_element(x,x+len);
float maxX = *max_element(x,x+len);
int xi[len];
for (int i = 0; i < len; i++) {
xi[i] = int(ht*(x[i] - minX)/(maxX - minX) + 0.5);
char s[len+1];
s[len] = '\0';
for (int j = ht; j >= 0; j--) {
for (int i = 0; i < len; i++) {
if (xi[i] == j) {
s[i] = '*';
} else {
s[i] = ' ';
cout << s << endl;
int main () {
const int lineLength = 65;
float range = 1.0;
float displacedLine[lineLength];
for (int i = 0; i < lineLength; i++) {
displacedLine[i] = 0.0;
displace (displacedLine,lineLength,range);
plot1D (displacedLine,lineLength);
return 0;
When run this way, it produces the following result:
$ c++ -lm displace.cpp
$ ./a
* *
* ***
* * * *
* ** **** * **
* *** **** * * * ** *
* * ** ** *** * * * *
** ** *
* * * ***
** ***