Performance loss from parallelization - c++

I've modified a raytracer I wrote a while ago for educational purposes to take advantage of multiprocessing using OpenMP. However, I'm not seeing any profit from the parallelization.
I've tried 3 different approaches: a task-pooled environment (the draw_pooled() function), a standard OMP parallel nested for loop with image row-level parallelism (draw_parallel_for()), and another OMP parallel for with pixel-level parallelism (draw_parallel_for2()). The original, serial drawing routine is also included for reference (draw_serial()).
I'm running a 2560x1920 render on an Intel Core 2 Duo E6750 (2 cores # 2,67GHz each w/Hyper-Threading) and 4GB of RAM under Linux, binary compiled by gcc with libgomp. The scene takes an average of:
120 seconds to render in series,
but 196 seconds (sic!) to do so in parallel in 2 threads (the default - number of CPU cores), regardless of which of the three particular methods above I choose,
if I override OMP's default thread number with 4 to take HT into account, the parallel render times drop to 177 seconds.
Why is this happening? I can't see any obvious bottlenecks in the parallel code.
EDIT: Just to clarify - the task pool is only one of the implementations, please do read the question - scroll down to see the parallel fors. Thing is, they are just as slow as the task pool!
void draw_parallel_for(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
for (int y = 0; y < h; ++y) {
#pragma omp parallel for num_threads(4)
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
write_png(buf, w, h, fname);
delete [] buf;
}
void draw_parallel_for2(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
int x, y;
#pragma omp parallel for private(x, y) num_threads(4)
for (int xy = 0; xy < w * h; ++xy) {
x = xy % w;
y = xy / w;
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
write_png(buf, w, h, fname);
delete [] buf;
}
void draw_parallel_for3(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
#pragma omp parallel for num_threads(4)
for (int y = 0; y < h; ++y) {
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
write_png(buf, w, h, fname);
delete [] buf;
}
void draw_serial(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
for (int y = 0; y < h; ++y) {
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
write_png(buf, w, h, fname);
delete [] buf;
}
std::queue< std::pair<int, int> * > task_queue;
void draw_pooled(int w, int h, const char *fname) {
unsigned char *buf;
buf = new unsigned char[w * h * 3];
Scene::GetInstance().PrepareRender(w, h);
bool tasks_issued = false;
#pragma omp parallel shared(buf, tasks_issued, w, h) num_threads(4)
{
#pragma omp master
{
for (int y = 0; y < h; ++y) {
for (int x = 0; x < w; ++x)
task_queue.push(new std::pair<int, int>(x, y));
}
tasks_issued = true;
}
while (true) {
std::pair<int, int> *coords;
#pragma omp critical(task_fetch)
{
if (task_queue.size() > 0) {
coords = task_queue.front();
task_queue.pop();
} else
coords = NULL;
}
if (coords != NULL) {
Scene::GetInstance().RenderPixel(coords->first, coords->second,
buf + (coords->second * w + coords->first) * 3);
delete coords;
} else {
#pragma omp flush(tasks_issued)
if (tasks_issued)
break;
}
}
}
write_png(buf, w, h, fname);
delete [] buf;
}

You have a critical section inside your innermost loop. In other words, you're hitting a synchronization primitive per pixel. That's going to kill performance.
Better split the scene in tiles and work one on each thread. That way, you have a longer time (a whole tile's worth of processing) between synchronizations.

If the pixels are independent you don't actually need any locking. You can just divide up the image into rows or columns and let the threads work on their own. For example, you could have each thread operate on every nth row (pseudocode):
for(int y = TREAD_NUM; y < h; y += THREAD_COUNT)
for(int x = 0; x < w; ++x)
render_pixel(x,y);
Where THREAD_NUM is a unique number for each thread such that 0 <= THREAD_NUM < THREAD_COUNT. Then after you join your threadpool, perform the png conversion.

There is always an performance overhead while creating threads. OMP Parallel inside a for loop will obviously generate lot of overhead. For example, in your code
void draw_parallel_for(int w, int h, const char *fname) {
for (int y = 0; y < h; ++y) {
// Here There is a lot of overhead
#pragma omp parallel for num_threads(4)
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
}
It can be re-written as
void draw_parallel_for(int w, int h, const char *fname) {
#pragma omp parallel for num_threads(4)
for (int y = 0; y < h; ++y) {
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
}
or
void draw_parallel_for(int w, int h, const char *fname) {
#pragma omp parallel num_threads(4)
for (int y = 0; y < h; ++y) {
#pragma omp for
for (int x = 0; x < w; ++x)
Scene::GetInstance().RenderPixel(x, y, buf + (y * w + x) * 3);
}
}
By this way, you will eliminate the overhead

Related

C++ omp no significant improvement

I am on MSVC 2019 with the default compiler. The code I am working on is a Mandelbrot image. Relevant bits of my code looks like:
#pragma omp parallel for
for (int y = 0; y < HEIGHT; y++)
{
for (int x = 0; x < WIDTH; x++)
{
unsigned int retVal = mandel(x_val + x_incr * x, y_val + y_incr * y);
mtest.setPixels(x, y,
static_cast<unsigned char>(retVal / 6),
static_cast<unsigned char>(retVal / 5),
static_cast<unsigned char>(retVal / 4));
}
}
All of the variables outside of the loop are constexpr, eliminating any dependencies. The mandel function does about 1000 iterations with each call. I would expect the outer loop to run on several threads but my msvc records each run at about 5-6 seconds with or without the omp directive.
Edit (The mandel function):
unsigned int mandel(long double x, long double y)
{
long double z_x = 0;
long double z_y = 0;
for (int i = 0; i < ITER; i++)
{
long double temp = z_x;
z_x = (z_x * z_x) - (z_y * z_y) + x;
z_y = 2 * temp * z_y + y;
if ((z_x * z_x + z_y * z_y) > 4)
return i;
}
return ITER; //ITER is a #define macro
}
Your mandel function has a vastly differing runtime cost depending on whether the if condition within the loop has been met. As a result, each iteration of your loop will run in a different time. By default omp uses static scheduling (i.e. break loop into N partitions). This is kinda bad, because you don't have a workload that fits static scheduling. See what happens when you use dynamic scheduling.
#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < HEIGHT; y++)
{
for (int x = 0; x < WIDTH; x++)
{
unsigned int retVal = mandel(x_val + x_incr * x, y_val + y_incr * y);
mtest.setPixels(x, y,
static_cast<unsigned char>(retVal / 6),
static_cast<unsigned char>(retVal / 5),
static_cast<unsigned char>(retVal / 4));
}
}
Also time to rule out the really dumb stuff.....
Have you included omp.h at least once in your program?
Have you enabled omp in the project settings?
IIRC, if you haven't done those two things, omp will be disabled under MSVC.
This is not an answer, but please do this:
unsigned int mandel(long double x, long double y)
{
long double z_x = 0;
long double z_y = 0;
long double z_x_squared = 0;
long double z_y_squared = 0;
for (int i = 0; i < ITER; i++)
{
long double temp = z_x;
z_x = z_x_squared - z_y_squared + x;
z_y = 2 * temp * z_y + y;
z_x_squared = z_x * z_x;
z_y_squared = z_y * z_u;
if ((z_x_squared + z_y_squared) > 4)
return i;
}
return ITER; //ITER is a #define macro
}
Also, try inverting the order of your two for loops.

OpenMP Fractal Generator

I've been trying to create an openMP variant of the julia set, but I'm unable to create a coherent image when running more than one thread, I've been trying to solve what looks like a race condition but cannot find the error.
The offending output looks like the required output along with "scanlines" across the entirety of the picture.
I've attached the code as well if its not clear enough.
#include <iostream>
#include <math.h>
#include <fstream>
#include <sstream>
#include <omp.h>
#include <QtWidgets>
#include <QElapsedTimer>
using namespace std;
double newReal(int x, int imageWidth){
return 1.5*(x - imageWidth / 2)/(0.5 * imageWidth);
}
double newImaginary(int y, int imageHeight){
return (y - imageHeight / 2) / (0.5 * imageHeight);
}
int julia(double& newReal, double& newImaginary, double& oldReal, double& oldImaginary, double cRe, double cIm,int maxIterations){
int i;
for(i = 0; i < maxIterations; i++){
oldReal = newReal;
oldImaginary = newImaginary;
newReal = oldReal * oldReal - oldImaginary * oldImaginary + cRe;
newImaginary = 2 * oldReal * oldImaginary + cIm;
if((newReal * newReal + newImaginary * newImaginary) > 4) break;
}
return i;
}
int main(int argc, char *argv[])
{
int fnum=atoi(argv[1]);
int numThr=atoi(argv[2]);
// int imageHeight=atoi(argv[3]);
// int imageWidth=atoi(arg[4]);
// int maxIterations=atoi(argv[5]);
// double cRe=atof(argv[3]);
// double cIm=atof(argv[4]);
//double cRe, cIm;
int imageWidth=10000, imageHeight=10000, maxIterations=3000;
double newRe, newIm, oldRe, oldIm,cRe,cIm;
cRe = -0.7;
cIm = 0.27015;
string fname;
QElapsedTimer time;
QImage img(imageHeight, imageWidth, QImage::Format_RGB888);//Qimagetesting
img.fill(QColor(Qt::black).rgb());//Qimagetesting
time.start();
int i,x,y;
int r, gr, b;
#pragma omp parallel for shared(imageHeight,imageWidth,newRe,newIm) private(x,y,i) num_threads(3)
for(y = 0; y < imageHeight; y++)
{
for(x = 0; x < imageWidth; x++)
{
newRe = newReal(x,imageWidth);
newIm = newImaginary(y,imageHeight);
i= julia(newRe, newIm, oldRe, oldIm, cRe, cIm, maxIterations);
r = (3*i % 256);
gr = (2*(int)sqrt(i) % 256);
b = (i % 256);
img.setPixel(x, y, qRgb(r, gr, b));
}
}
//stringstream s;
//s << fnum;
//fname= "julia" + s.str();
//fname+=".png";
//img.save(fname.c_str(),"PNG", 100);
img.save("julia.png","PNG", 100);
cout<< "Finished"<<endl;
cout<<time.elapsed()/1000.00<<" seconds"<<endl;
}
As pointed in comments, you have two main problems:
newRe and newIm are shared, but should not be
r, gr and b's access is not specified (shared by default I think)
There is concurrent calls to QImage::setPixel
To correct this, do not hesitate to make a omp for loop nested in a omp parallel block.
Declare private variable just before the for loop:
To prevent concurrent calls to QImage::setPixel, since this function is not thread safe, you can put it in a critical region, with #pragma omp critical.
int main(int argc, char *argv[])
{
int imageWidth=1000, imageHeight=1000, maxIterations=3000;
double cRe = -0.7;
double cIm = 0.27015;
QElapsedTimer time;
QImage img(imageHeight, imageWidth, QImage::Format_RGB888);//Qimagetesting
img.fill(Qt::black);
time.start();
#pragma omp parallel
{
/* all folowing values will be private */
int i,x,y;
int r, gr, b;
double newRe, newIm, oldRe, oldIm;
#pragma omp for
for(y = 0; y < imageHeight; y++)
{
for(x = 0; x < imageWidth; x++)
{
newRe = newReal(x,imageWidth);
newIm = newImaginary(y,imageHeight);
i= julia(newRe, newIm, oldRe, oldIm, cRe, cIm, maxIterations);
r = (3*i % 256);
gr = (2*(int)sqrtf(i) % 256);
b = (i % 256);
#pragma omp critical
img.setPixel(x, y, qRgb(r, gr, b));
}
}
}
img.save("julia.png","PNG", 100);
cout<<time.elapsed()/1000.00<<" seconds"<<endl;
return 0;
}
To go further, you can save some cpu time replacing ::setPixel by ::scanLine:
#pragma omp for
for(y = 0; y < imageHeight; y++)
{
uchar *line = img.scanLine(y);
for(x = 0; x < imageWidth; x++)
{
newRe = newReal(x,imageWidth);
newIm = newImaginary(y,imageHeight);
i= julia(newRe, newIm, oldRe, oldIm, cRe, cIm, maxIterations);
r = (3*i % 256);
gr = (2*(int)sqrtf(i) % 256);
b = (i % 256);
*line++ = r;
*line++ = gr;
*line++ = b;
}
}
EDIT:
Since the julia set seems to have a central symetry around (0,0) point, you can perfom only half of calculus:
int half_heigt = imageHeight / 2;
#pragma omp for
// compute only for first half of image
for(y = 0; y < half_heigt; y++)
{
for(x = 0; x < imageWidth; x++)
{
newRe = newReal(x,imageWidth);
newIm = newImaginary(y,imageHeight);
i= julia(newRe, newIm, oldRe, oldIm, cRe, cIm, maxIterations);
r = (3*i % 256);
gr = (2*(int)sqrtf(i) % 256);
b = (i % 256);
#pragma omp critical
{
// set the point
img.setPixel(x, y, qRgb(r, gr, b));
// set the symetric point
img.setPixel(imageWidth-1-x, imageHeight-1-y, qRgb(r, gr, b));
}
}
}

3d array as double pointer, freeing memory after usage

First off, let me say that everything works like a charm, but the memory deallocation. Maybe it's pure coincidence that I do not have any problems and my basic understanding is wrong.
Also, please refrain from telling me "Use std::vector instead" or the like, as I want to learn and understand the basics first!
Now to the code:
class Test
{
public:
Test(const unsigned int xDim, const unsigned int yDim, const unsigned int zDim);
~Test();
private:
unsigned int xDim, yDim, zDim;
TestStructure **testStructure;
void init();
int to1D(int x, int y, int z) { return x + (y * xDim) + (z * xDim * yDim); }
}
cpp:
Test::Test(const unsigned int xDim, const unsigned int yDim, const unsigned int zDim)
{
this->xDim = xDim;
this->yDim = yDim;
this->zDim = zDim;
this->testStructure = new TestStructure *[xDim * yDim * zDim];
init();
}
void Test::init()
{
for(int x = 0; x < xDim; x++)
for(int y = 0; y < yDim; y++)
for(int z = 0; z < zDim; z++)
{
this->testStructure[to1D(x, y, z)] = new TestStructure(
x - xDim / 2,
y - yDim / 2,
z - zDim / 2);
...
}
Now for the destructor I tried two ways:
Test::~Test()
{
for(int x = 0; x < xDim; x++)
for(int y = 0; y < yDim; y++)
for(int z = 0; z < zDim; z++) {
free(this->testStructure[to1D(x, y, z)]);
}
free(this->testStructure);
}
and
Test::~Test()
{
for(int a = 0; a < xDim * yDim * zDim; a++) {
free(this->testStructure[a]);
}
free(this->testStructure);
}
I am pretty sure the second is like the first, but In both cases I get:
==3854==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new vs free) on 0x62100008e900
Where are my mistakes? Throw them at my face and let me learn!
edit: Basic mistake: use delete (for new) instead of free (for malloc).
Replacing the frees above with deletes leads to:
==4076==ERROR: AddressSanitizer: new-delete-type-mismatch on 0x60600002c300 in thread T0:
edit2:
Pardon me, with delete[] for the array part it leads to:
==4138==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62100008e8f8 at pc 0x000000480b3d bp 0x7ffc5bcbc0f0 sp 0x7ffc5bcbc0e0
edit3/SOLUTION: Did it the wrong way. This seems to work:
Test::~Test()
{
for(int a = 0; a < xDim * yDim * zDim; a++) {
delete(this->testStructure[a]);
}
delete[] this->testStructure;
}
Big thanks #SomeProgrammerDude

Increment shared loop counter in OpenMP for progress reporting

I want to keep track of total pixels and rays processed by a long running raytracing process. If I update the shared variables every iteration, the process will slow down noticeably because of synchronization. I'd like to keep track of the progress and still get accurate count results at the end. Is there a way to do this with OpenMP for loops?
Here's some code of the loop in question:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 4096)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount); // will increment sharedRayCount
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++sharedPixelCount;
}
}
Since you have a chunk size of 4096 for your dynamically scheduled parallel-for loop, why not use that as the granularity for amortizing the counter updates?
For example, something like the following might work. I didn't test this code and you probably need to add some bookkeeping for totalPixelCount%4096!=0.
Unlike the previous answer, this does not add a branch to your loop, other than the one implied by the loop itself, for which many processors have optimized instructions. It also does not require any extra variables or arithmetic.
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 1)
for (int j = 0; j < totalPixelCount; j+=4096)
{
for (int i = j; i < (i+4096); ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
}
sharedPixelCount += 4096;
}
}
It's not really clear why sharedPixelCount needs to be updated inside of this loop at all, since it is not referenced in the loop body. If this is correct, I suggest the following instead.
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int reducePixelCount = 0;
#pragma omp parallel for schedule(dynamic, 4096) \
reduction(+:reducePixelCount) \
shared(reducePixelCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++reducePixelCount; /* thread-local operation, not atomic */
}
/* The interoperability of C++11 atomics and OpenMP is not defined yet,
* so this should just be avoided until OpenMP 5 at the earliest.
* It is sufficient to reduce over a non-atomic type and
* do the assignment here. */
sharedPixelCount = reducePixelCount;
}
Here's an example on how to do it:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int rayCount = 0;
int previousRayCount = 0;
#pragma omp parallel for schedule(dynamic, 1000) reduction(+:rayCount) firstprivate(previousRayCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, rayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
if ((i + 1) % 100 == 0)
{
sharedPixelCount += 100;
sharedRayCount += (rayCount - previousRayCount);
previousRayCount = rayCount;
}
}
sharedPixelCount = totalPixelCount;
sharedRayCount = rayCount;
}
It won't be 100% accurate while the loop is running, but the error is negligible. At the end exact values will be reported.

Compiling smallpt with OpenMP causes infinite loop at runtime

I'm currently looking at the smallpt code by Keavin Beason. I compiled the code with what it says on the tin using g++ -O3 -fopenmp smallpt.cpp, and I'm running into what seems like either an infinite loop or a deadlock.
Compiling the code using just g++ -O3 smallpt.cpp produces the images seen on his page, but I can't get the OpenMP parallelization to work at all.
For reference, I'm compiling on a Windows 7 64-bit machine using Cygwin with GCC 4.5.0. The author himself has stated he's run the same exact code and has run into no issues whatsoever, but I can't get the program to actually exit when it's done tracing the image.
Could this be an issue with my particular compiler and environment, or am I doing something wrong here? Here's the particular snippet of code that's parallelized using OpenMP. I've only modified it with some minor formatting to make it more readable.
int main(int argc, char *argv[])
{
int w=1024, h=768, samps = argc==2 ? atoi(argv[1])/4 : 1;
Ray cam(Vec(50,52,295.6), Vec(0,-0.042612,-1).norm()); // cam pos, dir
Vec cx=Vec(w*.5135/h);
Vec cy=(cx%cam.d).norm()*.5135, r, *c=new Vec[w*h];
#pragma omp parallel for schedule(dynamic, 1) private(r) // OpenMP
for (int y=0; y<h; y++) // Loop over image rows
{
fprintf(stderr,"\rRendering (%d spp) %5.2f%%",samps*4,100.*y/(h-1));
for (unsigned short x=0, Xi[3]={0,0,y*y*y}; x<w; x++) // Loop cols
{
for (int sy=0, i=(h-y-1)*w+x; sy<2; sy++) // 2x2 subpixel rows
{
for (int sx=0; sx<2; sx++, r=Vec()) // 2x2 subpixel cols
{
for (int s=0; s<samps; s++)
{
double r1=2*erand48(Xi), dx=r1<1 ? sqrt(r1)-1: 1-sqrt(2-r1);
double r2=2*erand48(Xi), dy=r2<1 ? sqrt(r2)-1: 1-sqrt(2-r2);
Vec d = cx*( ( (sx+.5 + dx)/2 + x)/w - .5) +
cy*( ( (sy+.5 + dy)/2 + y)/h - .5) + cam.d;
r = r + radiance(Ray(cam.o+d*140,d.norm()),0,Xi)*(1./samps);
} // Camera rays are pushed ^^^^^ forward to start in interior
c[i] = c[i] + Vec(clamp(r.x),clamp(r.y),clamp(r.z))*.25;
}
}
}
}
/* PROBLEM HERE!
The code never seems to reach here
PROBLEM HERE!
*/
FILE *f = fopen("image.ppm", "w"); // Write image to PPM file.
fprintf(f, "P3\n%d %d\n%d\n", w, h, 255);
for (int i=0; i<w*h; i++)
fprintf(f,"%d %d %d ", toInt(c[i].x), toInt(c[i].y), toInt(c[i].z));
}
Here's the output that the program produces, when it runs to completion:
$ time ./a
Rendering (4 spp) 100.00%spp) spp) 00..0026%%
The following is the most basic code that can reproduce the above behavior
#include <cstdio>
#include <cstdlib>
#include <cmath>
struct Vector
{
double x, y, z;
Vector() : x(0), y(0), z(0) {}
};
int toInt(double x)
{
return (int)(255 * x);
}
double clamp(double x)
{
if (x < 0) return 0;
if (x > 1) return 1;
return x;
}
int main(int argc, char *argv[])
{
int w = 1024;
int h = 768;
int samples = 1;
Vector r, *c = new Vector[w * h];
#pragma omp parallel for schedule(dynamic, 1) private(r)
for (int y = 0; y < h; y++)
{
fprintf(stderr,"\rRendering (%d spp) %5.2f%%",samples * 4, 100. * y / (h - 1));
for (unsigned short x = 0, Xi[3]= {0, 0, y*y*y}; x < w; x++)
{
for (int sy = 0, i = (h - y - 1) * w + x; sy < 2; sy++)
{
for (int sx = 0; sx < 2; sx++, r = Vector())
{
for (int s = 0; s < samples; s++)
{
double r1 = 2 * erand48(Xi), dx = r1 < 1 ? sqrt(r1) - 1 : 1 - sqrt(2 - r1);
double r2 = 2 * erand48(Xi), dy = r2 < 1 ? sqrt(r2) - 1 : 1 - sqrt(2 - r2);
r.x += r1;
r.y += r2;
}
c[i].x += clamp(r.x) / 4;
c[i].y += clamp(r.y) / 4;
}
}
}
}
FILE *f = fopen("image.ppm", "w"); // Write image to PPM file.
fprintf(f, "P3\n%d %d\n%d\n", w, h, 255);
for (int i=0; i<w*h; i++)
fprintf(f,"%d %d %d ", toInt(c[i].x), toInt(c[i].y), toInt(c[i].z));
}
This is the output obtained from the following sample program:
$ g++ test.cpp
$ ./a
Rendering (4 spp) 100.00%
$ g++ test.cpp -fopenmp
$ ./a
Rendering (4 spp) 100.00%spp) spp) 00..0052%%
fprintf is not guarded by a critical section or a #pragma omp single/master. I wouldn't be surprised if on Windows this thing messes up the console.