I am writing a function to change the values of the pixels in an image. The way it works is by splitting up the task of shading each pixel into multiple threads. For example if there are 4 threads then each one will shade every 4 pixels. What I find strange is that the threaded approach is about a 1/10 of a second slower than doing it in a single loop. I can't figure out why this is since I have a quad core CPU and there is no real synchronization involved between the threads. I would expect it to be about 4x faster minus a bit of overhead. Am I doing something wrong here?
Note that I set nthreads=1 to measure the single loop approach.
FYI raster is a pointer in the class which points to a dynamic array of pixels.
void RGBImage::shade(Shader sh, size_t sx, size_t sy, size_t ex, size_t ey)
{
validate();
if(ex == 0)
ex = width;
if(ey == 0)
ey = height;
if(sx < 0 || sx >= width || sx >= ex || ex > width || sy < 0 || sy >= height || sy >= ey
|| ey > height)
throw std::invalid_argument("Bounds Invalid");
size_t w = ex - sx;
size_t h = ey - sy;
size_t nthreads = std::thread::hardware_concurrency();
if(nthreads > MAX_THREADS)
nthreads = MAX_THREADS;
else if(nthreads < 1)
nthreads = 1;
size_t load_per_thread = w * h / nthreads;
if(load_per_thread < MIN_THREAD_LOAD)
nthreads = (w * h) / MIN_THREAD_LOAD;
clock_t start = clock();
if(nthreads > 1)
{
std::unique_ptr<std::thread[]> threads(new std::thread[nthreads]);
for(size_t i = 0; i < nthreads; i++)
threads[i] = std::thread([=]()
{
for(size_t p = i; p < (w * h); p += nthreads)
{
size_t x = sx + p % w;
size_t y = sy + p / w;
sh(raster[y * width + x], x, y);
}
});
for(size_t i = 0; i < nthreads; i++)
threads[i].join();
}
else
{
for(size_t p = 0; p < (w * h); ++p)
{
size_t x = sx + p % w;
size_t y = sy + p / w;
sh(raster[y * width + x], x, y);
}
}
std::cout << ((float)(clock() - start) / CLOCKS_PER_SEC) << std::endl;
}
I took some of the advice ans changed up my function.
void RGBImage::shade(Shader sh, bool threads)
{
validate();
clock_t c = clock();
if(threads)
{
int nthreads = std::thread::hardware_concurrency();
size_t pix = width * height;
if(nthreads < 1)
nthreads = 1;
else if(nthreads > MAX_THREADS)
nthreads = MAX_THREADS;
if(pix / nthreads < MIN_THREAD_LOAD)
nthreads = pix / MIN_THREAD_LOAD;
size_t pix_per_threads = pix / nthreads;
std::unique_ptr<std::thread[]> t(new std::thread[nthreads]);
for(int i = 0; i < nthreads; i++)
{
t[i] = std::thread([=]()
{
size_t offset = i * pix_per_threads;
size_t x = offset % width;
size_t y = offset / width;
sh(raster + offset, *this, x, y,
i == nthreads - 1 ? pix_per_threads + (width * height) % nthreads : pix_per_threads);
});
}
for(int i = 0; i < nthreads; i++)
t[i].join();
}
else
{
sh(raster, *this, 0, 0, width * height);
}
std::cout << ((float)(clock() - c) / CLOCKS_PER_SEC) << std::endl;
}
Now it runs about 10x faster but the threaded version is still slower.
What you have done is maximized contention between threads.
You want to minimize it.
Threads should work on a scanline at a time (or more). Divide your image into n blocks of roughly equal number of scanlines (left of image to right), and tell each thread to work on the nth block of scanlines.
std::vector<std::thread> threads;
threads.reserve(nthreads);
for(size_t i = 0; i < nthreads; i++) {
size_t v_start = (h*i)/nthreads;
size_t v_end = (h*(i+1))/nthreads;
threads.push_back(std::thread([=]()
{
for(size_t y = v_start; y < v_end; ++y)
{
for (size_t x = 0; x < w; ++x) {
sh(raster[y * width + x], x, y);
}
}
}));
}
for(auto&& thread:threads)
thread.join();
another approach is to grab the ppl (parallel patterns library) and use it. It will dynamically balance the number of threads based on the current load and hardware specs, and might use thread pooling to reduce thread startup costs.
A serious concern is your Shader sh. You do not want to be calling anything as expensive as a function pointer (or even more expensive, a std::function) on a per-pixel basis.
My general rule is that I write a "for each pixel" function that takes the pixel operation as a F&&, and pass it to a "for each scanline" function after wrapping the pixel-shader in an (in-header-file) scanline based operation. The cost of indirection is then reduced to once per scanline. In addition, the compiler may be able to optimize between pixel operations (say, doing SIMD), while a per-pixel call cannot be optimized this way.
A final problem with your "interleave" solution is it makes it impossible for the compiler to vectorize your code. Vectorization can easily give a 3-4x speedup.
Well that answer is pretty simple. The threaded solution was infact faster. It was just consuming more clock() time and the clock() function is not good for timing threads.
in C++ you can take advantage of those cores using parallelism or use amp (accelerated massive parallelism). My vote will be to do the latter.
sample amp project: http://austin.codeplex.com/
https://msdn.microsoft.com/en-us/library/hh265137.aspx
http://blogs.msdn.com/b/nativeconcurrency/archive/2012/08/30/learn-c-amp.aspx
Related
I am on MSVC 2019 with the default compiler. The code I am working on is a Mandelbrot image. Relevant bits of my code looks like:
#pragma omp parallel for
for (int y = 0; y < HEIGHT; y++)
{
for (int x = 0; x < WIDTH; x++)
{
unsigned int retVal = mandel(x_val + x_incr * x, y_val + y_incr * y);
mtest.setPixels(x, y,
static_cast<unsigned char>(retVal / 6),
static_cast<unsigned char>(retVal / 5),
static_cast<unsigned char>(retVal / 4));
}
}
All of the variables outside of the loop are constexpr, eliminating any dependencies. The mandel function does about 1000 iterations with each call. I would expect the outer loop to run on several threads but my msvc records each run at about 5-6 seconds with or without the omp directive.
Edit (The mandel function):
unsigned int mandel(long double x, long double y)
{
long double z_x = 0;
long double z_y = 0;
for (int i = 0; i < ITER; i++)
{
long double temp = z_x;
z_x = (z_x * z_x) - (z_y * z_y) + x;
z_y = 2 * temp * z_y + y;
if ((z_x * z_x + z_y * z_y) > 4)
return i;
}
return ITER; //ITER is a #define macro
}
Your mandel function has a vastly differing runtime cost depending on whether the if condition within the loop has been met. As a result, each iteration of your loop will run in a different time. By default omp uses static scheduling (i.e. break loop into N partitions). This is kinda bad, because you don't have a workload that fits static scheduling. See what happens when you use dynamic scheduling.
#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < HEIGHT; y++)
{
for (int x = 0; x < WIDTH; x++)
{
unsigned int retVal = mandel(x_val + x_incr * x, y_val + y_incr * y);
mtest.setPixels(x, y,
static_cast<unsigned char>(retVal / 6),
static_cast<unsigned char>(retVal / 5),
static_cast<unsigned char>(retVal / 4));
}
}
Also time to rule out the really dumb stuff.....
Have you included omp.h at least once in your program?
Have you enabled omp in the project settings?
IIRC, if you haven't done those two things, omp will be disabled under MSVC.
This is not an answer, but please do this:
unsigned int mandel(long double x, long double y)
{
long double z_x = 0;
long double z_y = 0;
long double z_x_squared = 0;
long double z_y_squared = 0;
for (int i = 0; i < ITER; i++)
{
long double temp = z_x;
z_x = z_x_squared - z_y_squared + x;
z_y = 2 * temp * z_y + y;
z_x_squared = z_x * z_x;
z_y_squared = z_y * z_u;
if ((z_x_squared + z_y_squared) > 4)
return i;
}
return ITER; //ITER is a #define macro
}
Also, try inverting the order of your two for loops.
I am looking at switching from nvidia to amd for my compute card because I want double precision support. Before doing this I decided to learn opencl on my nvidia card to see if I like it. I want to convert the following code from CUDA to OpenCL. I am using the curand library to generate uniformly and normally distributed random numbers. Each thread needs to be able to create a different sequence of random numbers and generate a few million per thread. Here is the code. How would I go about this in OpenCL. Everything I have read online seems to imply that I should generate a buffer of random numbers and then use that on the gpu but this is not practical for me.
template<int NArgs, typename OptimizationFunctor>
__global__
void statistical_solver_kernel(float* args_lbounds,
float* args_ubounds,
int trials,
int initial_temp,
unsigned long long seed,
float* results,
OptimizationFunctor f)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= trials)
return;
curandState rand;
curand_init(seed, idx, 0, &rand);
float x[NArgs];
for(int i = 0; i < NArgs; i++)
{
x[i] = curand_uniform(&rand) * (args_ubounds[i]- args_lbounds[i]) + args_lbounds[i];
}
float y = f(x);
for(int t = initial_temp - 1; t > 0; t--)
{
float t_percent = (float)t / initial_temp;
float x_prime[NArgs];
for(int i = 0; i < NArgs; i++)
{
x_prime[i] = curand_normal(&rand) * (args_ubounds[i] - args_lbounds[i]) * t_percent + x[i];
x_prime[i] = fmaxf(args_lbounds[i], x_prime[i]);
x_prime[i] = fminf(args_ubounds[i], x_prime[i]);
}
float y_prime = f(x_prime);
if(y_prime < y || (y_prime - y) / y_prime < t_percent)
{
y = y_prime;
for(int i = 0; i < NArgs; i++)
{
x[i] = x_prime[i];
}
}
}
float* rptr = results + idx * (NArgs + 1);
rptr[0] = y;
for(int i = 1; i <= NArgs; i++)
rptr[i] = x[i - 1];
}
The VexCL library provides an implementation of counter-based generators. You can use those inside larger expressions, see this slide for an example.
EDIT: Take this with a grain of sault, as I am the author of VexCL :).
I have implemented two different functions to round a double figure to integer.
Here is the first function
static inline int round_v1(double value)
{
int t;
__asm
{
fld value;
fistp t;
}
return t;
}
Here is the second function
static inline int round_v2(double value)
{
double intpart, fractpart;
fractpart = modf(value, &intpart);
if ((fabs(fractpart) != 0.5) || ((((int)intpart) % 2) != 0))
return (int)(value + (value >= 0 ? 0.5 : -0.5));
else
return (int)intpart;
}
Both functions can work well in single thread, but the second one cannot work int multi-thread (using openMP). The program just crash when I use the second one.
Here is the main code where the round_v1 or round_v2 function is called.
void
BilateralFilter_Invoker::doFilter() const
{
if (!src || !dst) return;
int i, j;
int src_width = width + (radius << 1);
omp_set_num_threads(2);
#pragma omp parallel for
for (i = 0; i < height; ++i)
{
unsigned char* pSrc = src + (i+radius)*src_step + radius;
unsigned char* pDst = dst + i*dst_step;
for (j = 0; j < width; ++j)
{
float sum = 0.f, wsum = 0.f;
int val0 = pSrc[j];
for (int k = 0; k < maxk; ++k)
{
int val = pSrc[j + space_offset[k]];
float w = space_weight[k] * color_weight[std::abs(val-val0)];
sum += val * w;
wsum += w;
}
//pDst[j] = (unsigned char)round_v2(sum / wsum);
pDst[j] = (unsigned char)round_v1(sum / wsum);
}
}
}
the variables src, dst, height, width, src_step, dst_step, radius, maxk, space_offset, space_weight, color_weight are member variables of class BilateralFilter_Invoker.
I respectively call round_v1 and round_v2 for test and program crashes only when round_v2 was called. I wonder whether the modf(double, double*) function may cause this problem. For further test, I comment this line
fractpart = modf(value, &intpart);
and replace it by
fractpart = intpart = value;
I run the program again and it did not crash again. I have no idea whether modf(double, double*) causes this problem. Or maybe there is something wrong in my code causes the problem rather than the modf(double, double*) function.
Notice that The operating system I use is Windows7 and the compiler is VC10.
You have made the most common mistake with OpenMP on SO. The iterator of your inner loop needs to be made private. You can either do
#pragma omp parallel for private(j)
or use loop initial declarations
for (int j = 0; j < width; ++j)
In fact, since you never use i or j outside of the loops they apply to there is no reason to declare them C89 style outside of the loops.
everyone I am trying to implement patter matching with FFT but I am not sure what the result should be (I think I am missing something even though a read a lot of stuff about the problem and tried a lot of different implementations this one is the best so far). Here is my FFT correlation function.
void fft2d(fftw_complex**& a, int rows, int cols, bool forward = true)
{
fftw_plan p;
for (int i = 0; i < rows; ++i)
{
p = fftw_plan_dft_1d(cols, a[i], a[i], forward ? FFTW_FORWARD : FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
}
fftw_complex* t = (fftw_complex*)fftw_malloc(rows * sizeof(fftw_complex));
for (int j = 0; j < cols; ++j)
{
for (int i = 0; i < rows; ++i)
{
t[i][0] = a[i][j][0];
t[i][1] = a[i][j][1];
}
p = fftw_plan_dft_1d(rows, t, t, forward ? FFTW_FORWARD : FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
for (int i = 0; i < rows; ++i)
{
a[i][j][0] = t[i][0];
a[i][j][1] = t[i][1];
}
}
fftw_free(t);
}
int findCorrelation(int argc, char* argv[])
{
BMP bigImage;
BMP keyImage;
BMP result;
RGBApixel blackPixel = { 0, 0, 0, 1 };
const bool swapQuadrants = (argc == 4);
if (argc < 3 || argc > 4) {
cout << "correlation img1.bmp img2.bmp" << endl;
return 1;
}
if (!keyImage.ReadFromFile(argv[1])) {
return 1;
}
if (!bigImage.ReadFromFile(argv[2])) {
return 1;
}
//Preparations
const int maxWidth = std::max(bigImage.TellWidth(), keyImage.TellWidth());
const int maxHeight = std::max(bigImage.TellHeight(), keyImage.TellHeight());
const int rowsCount = maxHeight;
const int colsCount = maxWidth;
BMP bigTemp = bigImage;
BMP keyTemp = keyImage;
keyImage.SetSize(maxWidth, maxHeight);
bigImage.SetSize(maxWidth, maxHeight);
for (int i = 0; i < rowsCount; ++i)
for (int j = 0; j < colsCount; ++j) {
RGBApixel p1;
if (i < bigTemp.TellHeight() && j < bigTemp.TellWidth()) {
p1 = bigTemp.GetPixel(j, i);
} else {
p1 = blackPixel;
}
bigImage.SetPixel(j, i, p1);
RGBApixel p2;
if (i < keyTemp.TellHeight() && j < keyTemp.TellWidth()) {
p2 = keyTemp.GetPixel(j, i);
} else {
p2 = blackPixel;
}
keyImage.SetPixel(j, i, p2);
}
//Here is where the transforms begin
fftw_complex **a = (fftw_complex**)fftw_malloc(rowsCount * sizeof(fftw_complex*));
fftw_complex **b = (fftw_complex**)fftw_malloc(rowsCount * sizeof(fftw_complex*));
fftw_complex **c = (fftw_complex**)fftw_malloc(rowsCount * sizeof(fftw_complex*));
for (int i = 0; i < rowsCount; ++i) {
a[i] = (fftw_complex*)fftw_malloc(colsCount * sizeof(fftw_complex));
b[i] = (fftw_complex*)fftw_malloc(colsCount * sizeof(fftw_complex));
c[i] = (fftw_complex*)fftw_malloc(colsCount * sizeof(fftw_complex));
for (int j = 0; j < colsCount; ++j) {
RGBApixel p1;
p1 = bigImage.GetPixel(j, i);
a[i][j][0] = (0.299*p1.Red + 0.587*p1.Green + 0.114*p1.Blue);
a[i][j][1] = 0.0;
RGBApixel p2;
p2 = keyImage.GetPixel(j, i);
b[i][j][0] = (0.299*p2.Red + 0.587*p2.Green + 0.114*p2.Blue);
b[i][j][1] = 0.0;
}
}
fft2d(a, rowsCount, colsCount);
fft2d(b, rowsCount, colsCount);
result.SetSize(maxWidth, maxHeight);
for (int i = 0; i < rowsCount; ++i)
for (int j = 0; j < colsCount; ++j) {
fftw_complex& y = a[i][j];
fftw_complex& x = b[i][j];
double u = x[0], v = x[1];
double m = y[0], n = y[1];
c[i][j][0] = u*m + n*v;
c[i][j][1] = v*m - u*n;
int fx = j;
if (fx>(colsCount / 2)) fx -= colsCount;
int fy = i;
if (fy>(rowsCount / 2)) fy -= rowsCount;
float r2 = (fx*fx + fy*fy);
const double cuttoffCoef = (maxWidth * maxHeight) / 37992.;
if (r2<128 * 128 * cuttoffCoef)
c[i][j][0] = c[i][j][1] = 0;
}
fft2d(c, rowsCount, colsCount, false);
const int halfCols = colsCount / 2;
const int halfRows = rowsCount / 2;
if (swapQuadrants) {
for (int i = 0; i < halfRows; ++i)
for (int j = 0; j < halfCols; ++j) {
std::swap(c[i][j][0], c[i + halfRows][j + halfCols][0]);
std::swap(c[i][j][1], c[i + halfRows][j + halfCols][1]);
}
for (int i = halfRows; i < rowsCount; ++i)
for (int j = 0; j < halfCols; ++j) {
std::swap(c[i][j][0], c[i - halfRows][j + halfCols][0]);
std::swap(c[i][j][1], c[i - halfRows][j + halfCols][1]);
}
}
for (int i = 0; i < rowsCount; ++i)
for (int j = 0; j < colsCount; ++j) {
const double& g = c[i][j][0];
RGBApixel pixel;
pixel.Alpha = 0;
int gInt = 255 - static_cast<int>(std::floor(g + 0.5));
pixel.Red = gInt;
pixel.Green = gInt;
pixel.Blue = gInt;
result.SetPixel(j, i, pixel);
}
BMP res;
res.SetSize(maxWidth, maxHeight);
result.WriteToFile("result.bmp");
return 0;
}
Sample output
This question would probably be more appropriately posted on another site like cross validated (metaoptimize.com used to also be a good one, but it appears to be gone)
That said:
There's two similar operations you can perform with FFT: convolution and correlation. Convolution is used for determining how two signals interact with each-other, whereas correlation can be used to express how similar two signals are to each-other. Make sure you're doing the right operation as they're both commonly implemented throught a DFT.
For this type of application of DFTs you usually wouldn't extract any useful information in the fourier spectrum unless you were looking for frequencies common to both data sources or whatever (eg, if you were comparing two bridges to see if their supports are spaced similarly).
Your 3rd image looks a lot like the power domain; normally I see the correlation output entirely grey except where overlap occurred. Your code definitely appears to be computing the inverse DFT, so unless I'm missing something the only other explanation I've come up with for the fuzzy look could be some of the "fudge factor" code in there like:
if (r2<128 * 128 * cuttoffCoef)
c[i][j][0] = c[i][j][1] = 0;
As for what you should expect: wherever there are common elements between the two images you'll see a peak. The larger the peak, the more similar the two images are near that region.
Some comments and/or recommended changes:
1) Convolution & correlation are not scale invariant operations. In other words, the size of your pattern image can make a significant difference in your output.
2) Normalize your images before correlation.
When you get the image data ready for the forward DFT pass:
a[i][j][0] = (0.299*p1.Red + 0.587*p1.Green + 0.114*p1.Blue);
a[i][j][1] = 0.0;
/* ... */
How you grayscale the image is your business (though I would've picked something like sqrt( r*r + b*b + g*g )). However, I don't see you doing anything to normalize the image.
The word "normalize" can take on a few different meanings in this context. Two common types:
normalize the range of values between 0.0 and 1.0
normalize the "whiteness" of the images
3) Run your pattern image through an edge enhancement filter. I've personally made use of canny, sobel, and I think I messed with a few others. As I recall, canny was "quick'n dirty", sobel was more expensive, but I got comparable results when it came time to do correlation. See chapter 24 of the "dsp guide" book that's freely available online. The whole book is worth your time, but if you're low on time then at a minimum chapter 24 will help a lot.
4) Re-scale the output image between [0, 255]; if you want to implement thresholds, do it after this step because the thresholding step is lossy.
My memory on this one is hazy, but as I recall (edited for clarity):
You can scale the final image pixels (before rescaling) between [-1.0, 1.0] by dividing off the largest power spectrum value from the entire power spectrum
The largest power spectrum value is, conveniently enough, the center-most value in the power spectrum (corresponding to the lowest frequency)
If you divide it off the power spectrum, you'll end up doing twice the work; since FFTs are linear, you can delay the division until after the inverse DFT pass to when you're re-scaling the pixels between [0..255].
If after rescaling most of your values end up so black you can't see them, you can use a solution to the ODE y' = y(1 - y) (one example is the sigmoid f(x) = 1 / (1 + exp(-c*x) ), for some scaling factor c that gives better gradations). This has more to do with improving your ability to interpret the results visually than anything you might use to programmatically find peaks.
edit I said [0, 255] above. I suggest you rescale to [128, 255] or some other lower bound that is gray rather than black.
I want to keep track of total pixels and rays processed by a long running raytracing process. If I update the shared variables every iteration, the process will slow down noticeably because of synchronization. I'd like to keep track of the progress and still get accurate count results at the end. Is there a way to do this with OpenMP for loops?
Here's some code of the loop in question:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 4096)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount); // will increment sharedRayCount
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++sharedPixelCount;
}
}
Since you have a chunk size of 4096 for your dynamically scheduled parallel-for loop, why not use that as the granularity for amortizing the counter updates?
For example, something like the following might work. I didn't test this code and you probably need to add some bookkeeping for totalPixelCount%4096!=0.
Unlike the previous answer, this does not add a branch to your loop, other than the one implied by the loop itself, for which many processors have optimized instructions. It also does not require any extra variables or arithmetic.
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 1)
for (int j = 0; j < totalPixelCount; j+=4096)
{
for (int i = j; i < (i+4096); ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
}
sharedPixelCount += 4096;
}
}
It's not really clear why sharedPixelCount needs to be updated inside of this loop at all, since it is not referenced in the loop body. If this is correct, I suggest the following instead.
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int reducePixelCount = 0;
#pragma omp parallel for schedule(dynamic, 4096) \
reduction(+:reducePixelCount) \
shared(reducePixelCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++reducePixelCount; /* thread-local operation, not atomic */
}
/* The interoperability of C++11 atomics and OpenMP is not defined yet,
* so this should just be avoided until OpenMP 5 at the earliest.
* It is sufficient to reduce over a non-atomic type and
* do the assignment here. */
sharedPixelCount = reducePixelCount;
}
Here's an example on how to do it:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int rayCount = 0;
int previousRayCount = 0;
#pragma omp parallel for schedule(dynamic, 1000) reduction(+:rayCount) firstprivate(previousRayCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, rayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
if ((i + 1) % 100 == 0)
{
sharedPixelCount += 100;
sharedRayCount += (rayCount - previousRayCount);
previousRayCount = rayCount;
}
}
sharedPixelCount = totalPixelCount;
sharedRayCount = rayCount;
}
It won't be 100% accurate while the loop is running, but the error is negligible. At the end exact values will be reported.