Compiling smallpt with OpenMP causes infinite loop at runtime - c++

I'm currently looking at the smallpt code by Keavin Beason. I compiled the code with what it says on the tin using g++ -O3 -fopenmp smallpt.cpp, and I'm running into what seems like either an infinite loop or a deadlock.
Compiling the code using just g++ -O3 smallpt.cpp produces the images seen on his page, but I can't get the OpenMP parallelization to work at all.
For reference, I'm compiling on a Windows 7 64-bit machine using Cygwin with GCC 4.5.0. The author himself has stated he's run the same exact code and has run into no issues whatsoever, but I can't get the program to actually exit when it's done tracing the image.
Could this be an issue with my particular compiler and environment, or am I doing something wrong here? Here's the particular snippet of code that's parallelized using OpenMP. I've only modified it with some minor formatting to make it more readable.
int main(int argc, char *argv[])
{
int w=1024, h=768, samps = argc==2 ? atoi(argv[1])/4 : 1;
Ray cam(Vec(50,52,295.6), Vec(0,-0.042612,-1).norm()); // cam pos, dir
Vec cx=Vec(w*.5135/h);
Vec cy=(cx%cam.d).norm()*.5135, r, *c=new Vec[w*h];
#pragma omp parallel for schedule(dynamic, 1) private(r) // OpenMP
for (int y=0; y<h; y++) // Loop over image rows
{
fprintf(stderr,"\rRendering (%d spp) %5.2f%%",samps*4,100.*y/(h-1));
for (unsigned short x=0, Xi[3]={0,0,y*y*y}; x<w; x++) // Loop cols
{
for (int sy=0, i=(h-y-1)*w+x; sy<2; sy++) // 2x2 subpixel rows
{
for (int sx=0; sx<2; sx++, r=Vec()) // 2x2 subpixel cols
{
for (int s=0; s<samps; s++)
{
double r1=2*erand48(Xi), dx=r1<1 ? sqrt(r1)-1: 1-sqrt(2-r1);
double r2=2*erand48(Xi), dy=r2<1 ? sqrt(r2)-1: 1-sqrt(2-r2);
Vec d = cx*( ( (sx+.5 + dx)/2 + x)/w - .5) +
cy*( ( (sy+.5 + dy)/2 + y)/h - .5) + cam.d;
r = r + radiance(Ray(cam.o+d*140,d.norm()),0,Xi)*(1./samps);
} // Camera rays are pushed ^^^^^ forward to start in interior
c[i] = c[i] + Vec(clamp(r.x),clamp(r.y),clamp(r.z))*.25;
}
}
}
}
/* PROBLEM HERE!
The code never seems to reach here
PROBLEM HERE!
*/
FILE *f = fopen("image.ppm", "w"); // Write image to PPM file.
fprintf(f, "P3\n%d %d\n%d\n", w, h, 255);
for (int i=0; i<w*h; i++)
fprintf(f,"%d %d %d ", toInt(c[i].x), toInt(c[i].y), toInt(c[i].z));
}
Here's the output that the program produces, when it runs to completion:
$ time ./a
Rendering (4 spp) 100.00%spp) spp) 00..0026%%
The following is the most basic code that can reproduce the above behavior
#include <cstdio>
#include <cstdlib>
#include <cmath>
struct Vector
{
double x, y, z;
Vector() : x(0), y(0), z(0) {}
};
int toInt(double x)
{
return (int)(255 * x);
}
double clamp(double x)
{
if (x < 0) return 0;
if (x > 1) return 1;
return x;
}
int main(int argc, char *argv[])
{
int w = 1024;
int h = 768;
int samples = 1;
Vector r, *c = new Vector[w * h];
#pragma omp parallel for schedule(dynamic, 1) private(r)
for (int y = 0; y < h; y++)
{
fprintf(stderr,"\rRendering (%d spp) %5.2f%%",samples * 4, 100. * y / (h - 1));
for (unsigned short x = 0, Xi[3]= {0, 0, y*y*y}; x < w; x++)
{
for (int sy = 0, i = (h - y - 1) * w + x; sy < 2; sy++)
{
for (int sx = 0; sx < 2; sx++, r = Vector())
{
for (int s = 0; s < samples; s++)
{
double r1 = 2 * erand48(Xi), dx = r1 < 1 ? sqrt(r1) - 1 : 1 - sqrt(2 - r1);
double r2 = 2 * erand48(Xi), dy = r2 < 1 ? sqrt(r2) - 1 : 1 - sqrt(2 - r2);
r.x += r1;
r.y += r2;
}
c[i].x += clamp(r.x) / 4;
c[i].y += clamp(r.y) / 4;
}
}
}
}
FILE *f = fopen("image.ppm", "w"); // Write image to PPM file.
fprintf(f, "P3\n%d %d\n%d\n", w, h, 255);
for (int i=0; i<w*h; i++)
fprintf(f,"%d %d %d ", toInt(c[i].x), toInt(c[i].y), toInt(c[i].z));
}
This is the output obtained from the following sample program:
$ g++ test.cpp
$ ./a
Rendering (4 spp) 100.00%
$ g++ test.cpp -fopenmp
$ ./a
Rendering (4 spp) 100.00%spp) spp) 00..0052%%

fprintf is not guarded by a critical section or a #pragma omp single/master. I wouldn't be surprised if on Windows this thing messes up the console.

Related

C++ omp no significant improvement

I am on MSVC 2019 with the default compiler. The code I am working on is a Mandelbrot image. Relevant bits of my code looks like:
#pragma omp parallel for
for (int y = 0; y < HEIGHT; y++)
{
for (int x = 0; x < WIDTH; x++)
{
unsigned int retVal = mandel(x_val + x_incr * x, y_val + y_incr * y);
mtest.setPixels(x, y,
static_cast<unsigned char>(retVal / 6),
static_cast<unsigned char>(retVal / 5),
static_cast<unsigned char>(retVal / 4));
}
}
All of the variables outside of the loop are constexpr, eliminating any dependencies. The mandel function does about 1000 iterations with each call. I would expect the outer loop to run on several threads but my msvc records each run at about 5-6 seconds with or without the omp directive.
Edit (The mandel function):
unsigned int mandel(long double x, long double y)
{
long double z_x = 0;
long double z_y = 0;
for (int i = 0; i < ITER; i++)
{
long double temp = z_x;
z_x = (z_x * z_x) - (z_y * z_y) + x;
z_y = 2 * temp * z_y + y;
if ((z_x * z_x + z_y * z_y) > 4)
return i;
}
return ITER; //ITER is a #define macro
}
Your mandel function has a vastly differing runtime cost depending on whether the if condition within the loop has been met. As a result, each iteration of your loop will run in a different time. By default omp uses static scheduling (i.e. break loop into N partitions). This is kinda bad, because you don't have a workload that fits static scheduling. See what happens when you use dynamic scheduling.
#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < HEIGHT; y++)
{
for (int x = 0; x < WIDTH; x++)
{
unsigned int retVal = mandel(x_val + x_incr * x, y_val + y_incr * y);
mtest.setPixels(x, y,
static_cast<unsigned char>(retVal / 6),
static_cast<unsigned char>(retVal / 5),
static_cast<unsigned char>(retVal / 4));
}
}
Also time to rule out the really dumb stuff.....
Have you included omp.h at least once in your program?
Have you enabled omp in the project settings?
IIRC, if you haven't done those two things, omp will be disabled under MSVC.
This is not an answer, but please do this:
unsigned int mandel(long double x, long double y)
{
long double z_x = 0;
long double z_y = 0;
long double z_x_squared = 0;
long double z_y_squared = 0;
for (int i = 0; i < ITER; i++)
{
long double temp = z_x;
z_x = z_x_squared - z_y_squared + x;
z_y = 2 * temp * z_y + y;
z_x_squared = z_x * z_x;
z_y_squared = z_y * z_u;
if ((z_x_squared + z_y_squared) > 4)
return i;
}
return ITER; //ITER is a #define macro
}
Also, try inverting the order of your two for loops.

OpenMP Fractal Generator

I've been trying to create an openMP variant of the julia set, but I'm unable to create a coherent image when running more than one thread, I've been trying to solve what looks like a race condition but cannot find the error.
The offending output looks like the required output along with "scanlines" across the entirety of the picture.
I've attached the code as well if its not clear enough.
#include <iostream>
#include <math.h>
#include <fstream>
#include <sstream>
#include <omp.h>
#include <QtWidgets>
#include <QElapsedTimer>
using namespace std;
double newReal(int x, int imageWidth){
return 1.5*(x - imageWidth / 2)/(0.5 * imageWidth);
}
double newImaginary(int y, int imageHeight){
return (y - imageHeight / 2) / (0.5 * imageHeight);
}
int julia(double& newReal, double& newImaginary, double& oldReal, double& oldImaginary, double cRe, double cIm,int maxIterations){
int i;
for(i = 0; i < maxIterations; i++){
oldReal = newReal;
oldImaginary = newImaginary;
newReal = oldReal * oldReal - oldImaginary * oldImaginary + cRe;
newImaginary = 2 * oldReal * oldImaginary + cIm;
if((newReal * newReal + newImaginary * newImaginary) > 4) break;
}
return i;
}
int main(int argc, char *argv[])
{
int fnum=atoi(argv[1]);
int numThr=atoi(argv[2]);
// int imageHeight=atoi(argv[3]);
// int imageWidth=atoi(arg[4]);
// int maxIterations=atoi(argv[5]);
// double cRe=atof(argv[3]);
// double cIm=atof(argv[4]);
//double cRe, cIm;
int imageWidth=10000, imageHeight=10000, maxIterations=3000;
double newRe, newIm, oldRe, oldIm,cRe,cIm;
cRe = -0.7;
cIm = 0.27015;
string fname;
QElapsedTimer time;
QImage img(imageHeight, imageWidth, QImage::Format_RGB888);//Qimagetesting
img.fill(QColor(Qt::black).rgb());//Qimagetesting
time.start();
int i,x,y;
int r, gr, b;
#pragma omp parallel for shared(imageHeight,imageWidth,newRe,newIm) private(x,y,i) num_threads(3)
for(y = 0; y < imageHeight; y++)
{
for(x = 0; x < imageWidth; x++)
{
newRe = newReal(x,imageWidth);
newIm = newImaginary(y,imageHeight);
i= julia(newRe, newIm, oldRe, oldIm, cRe, cIm, maxIterations);
r = (3*i % 256);
gr = (2*(int)sqrt(i) % 256);
b = (i % 256);
img.setPixel(x, y, qRgb(r, gr, b));
}
}
//stringstream s;
//s << fnum;
//fname= "julia" + s.str();
//fname+=".png";
//img.save(fname.c_str(),"PNG", 100);
img.save("julia.png","PNG", 100);
cout<< "Finished"<<endl;
cout<<time.elapsed()/1000.00<<" seconds"<<endl;
}
As pointed in comments, you have two main problems:
newRe and newIm are shared, but should not be
r, gr and b's access is not specified (shared by default I think)
There is concurrent calls to QImage::setPixel
To correct this, do not hesitate to make a omp for loop nested in a omp parallel block.
Declare private variable just before the for loop:
To prevent concurrent calls to QImage::setPixel, since this function is not thread safe, you can put it in a critical region, with #pragma omp critical.
int main(int argc, char *argv[])
{
int imageWidth=1000, imageHeight=1000, maxIterations=3000;
double cRe = -0.7;
double cIm = 0.27015;
QElapsedTimer time;
QImage img(imageHeight, imageWidth, QImage::Format_RGB888);//Qimagetesting
img.fill(Qt::black);
time.start();
#pragma omp parallel
{
/* all folowing values will be private */
int i,x,y;
int r, gr, b;
double newRe, newIm, oldRe, oldIm;
#pragma omp for
for(y = 0; y < imageHeight; y++)
{
for(x = 0; x < imageWidth; x++)
{
newRe = newReal(x,imageWidth);
newIm = newImaginary(y,imageHeight);
i= julia(newRe, newIm, oldRe, oldIm, cRe, cIm, maxIterations);
r = (3*i % 256);
gr = (2*(int)sqrtf(i) % 256);
b = (i % 256);
#pragma omp critical
img.setPixel(x, y, qRgb(r, gr, b));
}
}
}
img.save("julia.png","PNG", 100);
cout<<time.elapsed()/1000.00<<" seconds"<<endl;
return 0;
}
To go further, you can save some cpu time replacing ::setPixel by ::scanLine:
#pragma omp for
for(y = 0; y < imageHeight; y++)
{
uchar *line = img.scanLine(y);
for(x = 0; x < imageWidth; x++)
{
newRe = newReal(x,imageWidth);
newIm = newImaginary(y,imageHeight);
i= julia(newRe, newIm, oldRe, oldIm, cRe, cIm, maxIterations);
r = (3*i % 256);
gr = (2*(int)sqrtf(i) % 256);
b = (i % 256);
*line++ = r;
*line++ = gr;
*line++ = b;
}
}
EDIT:
Since the julia set seems to have a central symetry around (0,0) point, you can perfom only half of calculus:
int half_heigt = imageHeight / 2;
#pragma omp for
// compute only for first half of image
for(y = 0; y < half_heigt; y++)
{
for(x = 0; x < imageWidth; x++)
{
newRe = newReal(x,imageWidth);
newIm = newImaginary(y,imageHeight);
i= julia(newRe, newIm, oldRe, oldIm, cRe, cIm, maxIterations);
r = (3*i % 256);
gr = (2*(int)sqrtf(i) % 256);
b = (i % 256);
#pragma omp critical
{
// set the point
img.setPixel(x, y, qRgb(r, gr, b));
// set the symetric point
img.setPixel(imageWidth-1-x, imageHeight-1-y, qRgb(r, gr, b));
}
}
}

How to call existing host function from device function in cuda [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 8 years ago.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have seen a similar question here
However,I could not get an exact answer here, and it is written in 2012.
I am trying to call cublasStatus_t cublasSgbmv(...) function, which is defined in "cublas_v2.h", in a __global__ function. However, I could not use the dynamic parallelism feature. I only have 1 source.cu file. However, I have read that I should compile it in a dynamic way so that it separates device and host functions, then I can link these outputs.
Is there anyone who knows how to do it, or a good source to explain it?
Thanks in advance
edit : if undervoted, please explain the reason at least for me to learn my mistake?
edit2 :
my specific problem is, I'm using the following code in my Source.cu :
#include <iostream>
#include <vector>
#include <cuda.h>
#include <cstdio>
#include <stdio.h>
#include <device_launch_parameters.h>
#include <stdlib.h> //srand(), rand()
#include <time.h>
#include <builtin_types.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#define IDX2C(i ,j , ld ) ((( j )*( ld ))+( i ))
#define HEIGHT 4
#define WIDTH 4
#define V 4
#define KL 2
#define KU 1
#define THREADS_PER_BLOCK 512
#pragma comment(lib, "cublas")
//#pragma comment(lib, "helper_cuda")
using namespace std;
void create_Matrix(int* matrix, int width, int height){
int i, len;
len = height * width;
srand(time(NULL));
for (i = 0; i < len; i++){
matrix[i] = rand() % 10 + 1; //generates number between 1-10
}
}
template <typename T>
void print_vector(T* vector, int len){
for (int i = 0; i < len; i++)
cout << vector[i] << " ";
cout << endl;
}
template <typename T>
void creating_bandedMatrix(T* bandedMatrix, int height, int width, int ku, int kl){
//fill matrix with zeros at the beginning
int i, len;
len = height * width;
for (i = 0; i < len; i++){
bandedMatrix[i] = 0; //generates number between 1-10
}
srand(time(NULL));
//filling banded diagonal
int start, end;
for (int i = 0; i < height; i++){
start = i - kl;
if (start < 0)
start = 0;
end = i + ku + 1;
if (end > width)
end = width;
for (int j = start; j < end; j++){
*(bandedMatrix + (i*width) + j) = (float)(rand() % (10) + 1); //rand() / (T)RAND_MAX;;
}
}
}
template <typename T>
void print_matrix(T* matrix, int width, int height){
int len = width*height;
cout << "asdsffffff" << endl;
for (int i = 0; i < len; i++){
if (!(i%width))
cout << endl;
cout << i << ":" <<matrix[i] << " ";
}
cout << endl;
}
template <typename T>
void computeMatrixVectorMultiplication(T* bandedMatrix, T* vector2){
T row_sum = 0;
T* bandedHostResult = (T*)malloc(WIDTH * sizeof(T));
for (int i = 0; i < HEIGHT; i++){
row_sum = 0;
for (int j = 0; j < WIDTH; j++){
row_sum += (*(bandedMatrix + i*WIDTH + j)) * vector2[j];
}
bandedHostResult[i] = row_sum;
}
//priting the result
cout << "\n\nBanded Host Result...\n";
print_vector(bandedHostResult, WIDTH);
}
template <typename T>
void fillLapackMatrix(T* lapack_matrix, T* bandedMatrix, int kl, int ku, int banded_w, int banded_h, int lapack_w, int lapack_h){
int i, j, lapack_i;
int len = lapack_h * lapack_w;
for (i = 0; i < len; i++){
lapack_matrix[i] = 0; //generates number between 1-10
}
for (i = 0; i < banded_w; i++){
for (j = 0; j < banded_h; j++){
lapack_i = ku + i - j;
*(lapack_matrix + lapack_i*lapack_w + j) = *(bandedMatrix + i*banded_w + j);
//lapack_matrix[lapack_i*lapack_w + j] = bandedMatrix[i*bandedMatrix + j];
}
}
}
__global__ void device_cublasSgbmv(int m,int n,int kl, int ku,float* alpha, float* A, int lda ,float* B,int ldb,float*R, int ldr, float* beta){
int index = blockIdx.x * blockDim.x + threadIdx.x;
cublasHandle_t handle;
cublasCreate(&handle);
cublasOperation_t trans = CUBLAS_OP_N;
float* dev_x;
cudaMalloc((void**)&dev_x,sizeof(float) * n);
if(index < ldr){
cublasSgbmv(handle, trans,m, n, kl, ku, alpha, A, m, B+index*n, 1, beta, R+index*n, 1);
index = 0;
}
}
void fillNormalMatrix(float* B,int h,int w){
for(int i = 0; i < h;i++){
for(int j = 0; j < w;j++){
B[i*w + j] = 1;
}
}
}
int main()
{
cublasStatus_t status;
float *A;
float *x, *y;
float *dev_x, *dev_y;
int incx, incy;
float *dev_A = 0;
float alpha = 1.0f;
float beta = 0.0f;
int matrixSize = WIDTH * HEIGHT;
int i, j;
cublasHandle_t handle;
/* Initialize CUBLAS */
status = cublasCreate(&handle);
if (status != CUBLAS_STATUS_SUCCESS)
{
fprintf(stderr, "!!!! CUBLAS initialization error\n");
return EXIT_FAILURE;
}
//Allocate host memory for the matrices
A = (float *)malloc(matrixSize* sizeof(float));
//Allocate memory for host vectors
x = (float *)malloc(WIDTH * sizeof(float));
y = (float*)malloc(WIDTH * sizeof(float));
// Fill the matrices with test data
creating_bandedMatrix(A, WIDTH, HEIGHT, KU, KL);
cout << "Banded Matrix\n";
print_matrix(A, WIDTH, HEIGHT);
//Fill the vectors with random data
for (i = 0; i < WIDTH; i++){
x[i] = 1;// (float)(rand() % (10) + 1);:
y[i] = (float)(rand() % (10) + 1);
}
cout << "\nvector x...\n";
print_vector(x, WIDTH);
//cout << "\nvector y...\n";
//print_vector(y, WIDTH);
//Allocate device memory for the matrix
if (cudaMalloc((void **)&dev_A, matrixSize * sizeof(float)) != cudaSuccess)
{
fprintf(stderr, "!!!! device memory allocation error (allocate A)\n");
return EXIT_FAILURE;
}
//Allocate device memory for vectors
if (cudaMalloc((void**)&dev_x, WIDTH * sizeof(float)) != cudaSuccess){
fprintf(stderr, "Device Vector Allocation PROBLEM\n");
return EXIT_FAILURE;
}
if (cudaMalloc((void**)&dev_y, WIDTH * sizeof(float)) != cudaSuccess){
fprintf(stderr, "Device Vector Allocation PROBLEM\n");
return EXIT_FAILURE;
}
// Initialize the device vectors with the host vectors
status = cublasSetVector(WIDTH, sizeof(float), x, 1, dev_x, 1);
if (status != CUBLAS_STATUS_SUCCESS)
{
fprintf(stderr, "!!!! device access error (write x vector)\n");
return EXIT_FAILURE;
}
status = cublasSetVector(WIDTH, sizeof(float), y, 1, dev_y, 1);
if (status != CUBLAS_STATUS_SUCCESS)
{
fprintf(stderr, "!!!! device access error (write y vector)\n");
return EXIT_FAILURE;
}
//initialize matrix with lapack format
int lapack_width = WIDTH > HEIGHT ? HEIGHT : WIDTH;
int lapack_height = KL + KU + 1;
int lapackSize = lapack_height * lapack_width;
float* lapack_matrix = (float*)malloc(lapackSize * sizeof(float));
fillLapackMatrix(lapack_matrix, A, KL, KU, WIDTH, HEIGHT, lapack_width, lapack_height);
cout << "\n\nLAPACK MAtrix\n";
print_matrix(lapack_matrix, lapack_width, lapack_height);
//convert to column column matrix
float* col = (float*)malloc(lapackSize * sizeof(float));
for (i = 0; i < WIDTH; i++){
for (j = 0; j < HEIGHT; j++){
col[i + WIDTH*j] = lapack_matrix[WIDTH*i + j];
}
}
cout << "Lapack Column Based Matrix\n";
print_matrix(col,HEIGHT-1,WIDTH);
//status = cublasSetVector(lapackSize, sizeof(float), A, 1, dev_A, 1);
cublasSetMatrix(HEIGHT, WIDTH, sizeof(float), col, HEIGHT, dev_A, HEIGHT);
cublasOperation_t trans = CUBLAS_OP_N;
incy = incx = 1;
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////// Banded Matrix Matrix Multipllicatio ///////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
float* B,*dev_B,*dev_R,*R;
B = (float*)malloc(WIDTH*HEIGHT*sizeof(float));
R = (float*)malloc(WIDTH*HEIGHT*sizeof(float));
fillNormalMatrix(B,WIDTH,HEIGHT);
cudaMalloc((void**)&dev_B,matrixSize*sizeof(*B));
cudaMalloc((void**)&dev_R,matrixSize*sizeof(*R));
cublasSetMatrix(HEIGHT, WIDTH, sizeof(*B), B, HEIGHT, dev_B, HEIGHT);
cout << "Matrix B\n";
print_matrix(B,HEIGHT,WIDTH);
cout << "gfsdf\n";
device_cublasSgbmv<<<1,4>>>(HEIGHT, WIDTH, KL, KU, &alpha, dev_A, WIDTH, dev_B, HEIGHT, dev_R, HEIGHT,&beta);
cout << "after\n";
cublasGetMatrix(HEIGHT,WIDTH, sizeof (*R) ,dev_R ,WIDTH,R,WIDTH);
getchar();
return 0;
}
and compile it like :
nvcc -gencode=arch=compute_35,code=sm_35 -lcublas -lcudadevrt -O3 Source.cu -o Source.o -dc
g++ Source.o -lcublas -lcudart
then, I get the following :
In function `__sti____cudaRegisterAll_48_tmpxft_00001f1e_00000000_6_Source_cpp1_ii_ebe2258a()':
tmpxft_00001f1e_00000000-3_lapack_vector.cudafe1.cpp:(.text.startup+0x575): undefined reference to `__cudaRegisterLinkedBinary_48_tmpxft_00001f1e_00000000_6_Source_cpp1_ii_ebe2258a'
collect2: error: ld returned 1 exit status
You can compile and link the code you have now shown with a single command like this:
nvcc -arch=sm_35 -rdc=true -lcublas -lcublas_device -lcudadevrt -o test Source.cu
You may get some warnings like this:
nvlink warning : SM Arch ('sm_35') not found in '/usr/local/cuda/bin/..//lib64/libcublas_device.a:maxwell_sgemm.asm.o'
nvlink warning : SM Arch ('sm_35') not found in '/usr/local/cuda/bin/..//lib64/libcublas_device.a:maxwell_sm50_sgemm.o'
nvlink warning : SM Arch ('sm_35') not found in '/usr/local/cuda/bin/..//lib64/libcublas_device.a:maxwell_sm50_ssyrk.o'
Those can be safely ignored.

C++ Segmentation Fault OpenCV

The idea in the following code is to have a bunch of "wanderer" objects that slowly "paint" an image onto a canvas. The problem is, that this code only seems to working on square images (in the code, the square image is identified as "hidden" (because it is unveiled by the "painters") and it is loaded in from the file called "UncoverTest.png"), not rectangular ones, which is mysterious to me. I get a segmentation fault error when trying to work with anything but a square. As far as I can tell, the segmentation fault error emerges when I enter the loop to iterate through the vector of type Agent (at the line for (vector<Agent>::iterator iter = agents.begin(); iter != agents.end();++iter)).
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <iostream>
#include <vector>
using namespace std;
using namespace cv;
//#define WINDOW_SIZE 500
#define STEP_SIZE 10.0
#define NUM_AGENTS 100
/********************/
/* Agent class definition and class prototypes */
/********************/
class Agent {
public:
Agent();
int * GetLocation(void);
void Move(void);
void Draw(Mat image);
int * GetSize(void);
private:
double UnifRand(void);
int * location;
int * GetReveal(void);
Mat hidden;
};
int * Agent::GetSize(void) {
int * size = new int[2];
size[0] = hidden.cols;
size[1] = hidden.rows;
return (size);
}
int * Agent::GetReveal(void) {
int * BGR = new int[3];
location = GetLocation();
for (int i = 0; i < 3; i++) {
BGR[i] = hidden.data[hidden.step[0]*location[0] + hidden.step[1]*location[1] + i];
}
return (BGR);
}
void Agent::Draw(Mat image) {
int * location = GetLocation();
int * color = GetReveal();
for (int i = 0;i < 3;i++) {
image.data[image.step[0]*location[0] + image.step[1]*location[1] + i] = color[i];
}
}
void Agent::Move(void) {
int dx = (int)(STEP_SIZE*UnifRand() - STEP_SIZE/2);
int dy = (int)(STEP_SIZE*UnifRand() - STEP_SIZE/2);
location[0] += (((location[0] + dx >= 0) & (location[0] + dx < hidden.cols)) ? dx : 0);
location[1] += (((location[1] + dy >= 0) & (location[1] + dy < hidden.rows)) ? dy : 0);
}
Agent::Agent() {
location = new int[2];
hidden = imread("UncoverTest.png",1);
location[0] = (int)(UnifRand()*hidden.cols);
location[1] = (int)(UnifRand()*hidden.rows);
}
double Agent::UnifRand(void) {
return (rand()/(double(RAND_MAX)));
}
int * Agent::GetLocation(void) {
return (location);
}
/********************/
/* Function prototypes unrelated to the Agent class */
/********************/
void DrawAgents(void);
/********************/
/* Main function */
/********************/
int main(void) {
DrawAgents();
return (0);
}
void DrawAgents(void) {
vector<Agent> agents;
int * size = new int[2];
Mat image;
for (int i = 0; i < NUM_AGENTS; i++) {
Agent * a = new Agent();
agents.push_back(* a);
if (i == 0) {
size = (* a).GetSize();
}
}
// cout << size[0] << " " << size[1] << endl;
image = Mat::zeros(size[0],size[1],CV_8UC3);
cvNamedWindow("Agent Example",CV_WINDOW_AUTOSIZE);
cvMoveWindow("Agent Example",100,100);
for (int stop = 1;stop != 27;stop = cvWaitKey(41)) {
for (vector<Agent>::iterator iter = agents.begin(); iter != agents.end();++iter) {
(* iter).Move();
(* iter).Draw(image);
} imshow("Agent Example",image);
}
}
Can anyone explain to me how this error arises with square images only and how the problem might be fixed?
I don't fully understand your code but as per your last comment, "stepping of the canvas" i think i can see that you have a couple of situations where you might be trying to access data out of range in both your "hidden" mat and "image" mats.
Afraid i can only offer sugestions
void Agent::Draw(Mat image) {
int * location = GetLocation();
int * color = GetReveal();
for (int i = 0;i < 3;i++) {
image.data[image.step[0]*location[0] + image.step[1]*location[1] + i] = color[i];
}
}
Here your accessing GetLocation which has been instantiated from a random number times the size of the hidden mat during the construction of the Agent. I would worry that here your going to get an "index out of bounds" type error when accessing the image.data matrix. So this might be the first thing to check.
Like wise in
int * Agent::GetReveal(void) {
int * BGR = new int[3];
location = GetLocation();
for (int i = 0; i < 3; i++) {
BGR[i] = hidden.data[hidden.step[0]*location[0] + hidden.step[1]*location[1] + i];
}
return (BGR);
}
you using getLocation() which is going to return a point far larger than the size of the hidden image. So i'm pretty sure you get an error here as well. location should be derived from the hidden.cols() and hidden.rows().
Only had a glancing look, but i would definitely put some checks in around the values getLocation() is returning, and if such a value is accessible from the mat matrices.
Additionally, although i'm not entirely sure, as i think your using location in two different ways, but if location is a point somewhere in your Draw(image) then you would need to adjust the following:
location[0] += (((location[0] + dx >= 0) & (location[0] + dx < hidden.cols)) ? dx : 0);
location[1] += (((location[1] + dy >= 0) & (location[1] + dy < hidden.rows)) ? dy : 0);
and take into account the width of the hidden image, something like
location[0] += (((location[0] + dx >= 0) & (location[0] + dx + hidden.cols < maxWidth)) ? dx : 0);
location[1] += (((location[1] + dy >= 0) & (location[1] + dy + hidden.rows < maxHeight)) ? dy : 0);
where maxWidth and maxHeight are the width of your image.
Hope that gets you on the right track.
If you are in a Linux environment, you can use valgrind to find out exactly where the segmentation fault is happening. Just type valgrind before the name of the program, or the way you execute your program. For example, if you execute your program with the following command:
hello -print
issue the following command instead:
valgrind hello -print

Performance loss from parallelization

I've modified a raytracer I wrote a while ago for educational purposes to take advantage of multiprocessing using OpenMP. However, I'm not seeing any profit from the parallelization.
I've tried 3 different approaches: a task-pooled environment (the draw_pooled() function), a standard OMP parallel nested for loop with image row-level parallelism (draw_parallel_for()), and another OMP parallel for with pixel-level parallelism (draw_parallel_for2()). The original, serial drawing routine is also included for reference (draw_serial()).
I'm running a 2560x1920 render on an Intel Core 2 Duo E6750 (2 cores # 2,67GHz each w/Hyper-Threading) and 4GB of RAM under Linux, binary compiled by gcc with libgomp. The scene takes an average of:
120 seconds to render in series,
but 196 seconds (sic!) to do so in parallel in 2 threads (the default - number of CPU cores), regardless of which of the three particular methods above I choose,
if I override OMP's default thread number with 4 to take HT into account, the parallel render times drop to 177 seconds.
Why is this happening? I can't see any obvious bottlenecks in the parallel code.
EDIT: Just to clarify - the task pool is only one of the implementations, please do read the question - scroll down to see the parallel fors. Thing is, they are just as slow as the task pool!
void draw_parallel_for(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
for (int y = 0; y < h; ++y) {
#pragma omp parallel for num_threads(4)
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
write_png(buf, w, h, fname);
delete [] buf;
}
void draw_parallel_for2(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
int x, y;
#pragma omp parallel for private(x, y) num_threads(4)
for (int xy = 0; xy < w * h; ++xy) {
x = xy % w;
y = xy / w;
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
write_png(buf, w, h, fname);
delete [] buf;
}
void draw_parallel_for3(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
#pragma omp parallel for num_threads(4)
for (int y = 0; y < h; ++y) {
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
write_png(buf, w, h, fname);
delete [] buf;
}
void draw_serial(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
for (int y = 0; y < h; ++y) {
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
write_png(buf, w, h, fname);
delete [] buf;
}
std::queue< std::pair<int, int> * > task_queue;
void draw_pooled(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
bool tasks_issued = false;
#pragma omp parallel shared(buf, tasks_issued, w, h) num_threads(4)
{
#pragma omp master
{
for (int y = 0; y < h; ++y) {
for (int x = 0; x < w; ++x)
task_queue.push(new std::pair<int, int>(x, y));
}
tasks_issued = true;
}
while (true) {
std::pair<int, int> *coords;
#pragma omp critical(task_fetch)
{
if (task_queue.size() > 0) {
coords = task_queue.front();
task_queue.pop();
} else
coords = NULL;
}
if (coords != NULL) {
Scene::GetInstance().RenderPixel(coords->first, coords->second,
buf + (coords->second * w + coords->first) * 3);
delete coords;
} else {
#pragma omp flush(tasks_issued)
if (tasks_issued)
break;
}
}
}
write_png(buf, w, h, fname);
delete [] buf;
}
You have a critical section inside your innermost loop. In other words, you're hitting a synchronization primitive per pixel. That's going to kill performance.
Better split the scene in tiles and work one on each thread. That way, you have a longer time (a whole tile's worth of processing) between synchronizations.
If the pixels are independent you don't actually need any locking. You can just divide up the image into rows or columns and let the threads work on their own. For example, you could have each thread operate on every nth row (pseudocode):
for(int y = TREAD_NUM; y < h; y += THREAD_COUNT)
for(int x = 0; x < w; ++x)
render_pixel(x,y);
Where THREAD_NUM is a unique number for each thread such that 0 <= THREAD_NUM < THREAD_COUNT. Then after you join your threadpool, perform the png conversion.
There is always an performance overhead while creating threads. OMP Parallel inside a for loop will obviously generate lot of overhead. For example, in your code
void draw_parallel_for(int w, int h, const char *fname) {
for (int y = 0; y < h; ++y) {
// Here There is a lot of overhead
#pragma omp parallel for num_threads(4)
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
}
It can be re-written as
void draw_parallel_for(int w, int h, const char *fname) {
#pragma omp parallel for num_threads(4)
for (int y = 0; y < h; ++y) {
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
}
or
void draw_parallel_for(int w, int h, const char *fname) {
#pragma omp parallel num_threads(4)
for (int y = 0; y < h; ++y) {
#pragma omp for
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
}
By this way, you will eliminate the overhead