Can't get modf() to work in multithread - c++

I have implemented two different functions to round a double figure to integer.
Here is the first function
static inline int round_v1(double value)
{
int t;
__asm
{
fld value;
fistp t;
}
return t;
}
Here is the second function
static inline int round_v2(double value)
{
double intpart, fractpart;
fractpart = modf(value, &intpart);
if ((fabs(fractpart) != 0.5) || ((((int)intpart) % 2) != 0))
return (int)(value + (value >= 0 ? 0.5 : -0.5));
else
return (int)intpart;
}
Both functions can work well in single thread, but the second one cannot work int multi-thread (using openMP). The program just crash when I use the second one.
Here is the main code where the round_v1 or round_v2 function is called.
void
BilateralFilter_Invoker::doFilter() const
{
if (!src || !dst) return;
int i, j;
int src_width = width + (radius << 1);
omp_set_num_threads(2);
#pragma omp parallel for
for (i = 0; i < height; ++i)
{
unsigned char* pSrc = src + (i+radius)*src_step + radius;
unsigned char* pDst = dst + i*dst_step;
for (j = 0; j < width; ++j)
{
float sum = 0.f, wsum = 0.f;
int val0 = pSrc[j];
for (int k = 0; k < maxk; ++k)
{
int val = pSrc[j + space_offset[k]];
float w = space_weight[k] * color_weight[std::abs(val-val0)];
sum += val * w;
wsum += w;
}
//pDst[j] = (unsigned char)round_v2(sum / wsum);
pDst[j] = (unsigned char)round_v1(sum / wsum);
}
}
}
the variables src, dst, height, width, src_step, dst_step, radius, maxk, space_offset, space_weight, color_weight are member variables of class BilateralFilter_Invoker.
I respectively call round_v1 and round_v2 for test and program crashes only when round_v2 was called. I wonder whether the modf(double, double*) function may cause this problem. For further test, I comment this line
fractpart = modf(value, &intpart);
and replace it by
fractpart = intpart = value;
I run the program again and it did not crash again. I have no idea whether modf(double, double*) causes this problem. Or maybe there is something wrong in my code causes the problem rather than the modf(double, double*) function.
Notice that The operating system I use is Windows7 and the compiler is VC10.

You have made the most common mistake with OpenMP on SO. The iterator of your inner loop needs to be made private. You can either do
#pragma omp parallel for private(j)
or use loop initial declarations
for (int j = 0; j < width; ++j)
In fact, since you never use i or j outside of the loops they apply to there is no reason to declare them C89 style outside of the loops.

Related

C++ omp no significant improvement

I am on MSVC 2019 with the default compiler. The code I am working on is a Mandelbrot image. Relevant bits of my code looks like:
#pragma omp parallel for
for (int y = 0; y < HEIGHT; y++)
{
for (int x = 0; x < WIDTH; x++)
{
unsigned int retVal = mandel(x_val + x_incr * x, y_val + y_incr * y);
mtest.setPixels(x, y,
static_cast<unsigned char>(retVal / 6),
static_cast<unsigned char>(retVal / 5),
static_cast<unsigned char>(retVal / 4));
}
}
All of the variables outside of the loop are constexpr, eliminating any dependencies. The mandel function does about 1000 iterations with each call. I would expect the outer loop to run on several threads but my msvc records each run at about 5-6 seconds with or without the omp directive.
Edit (The mandel function):
unsigned int mandel(long double x, long double y)
{
long double z_x = 0;
long double z_y = 0;
for (int i = 0; i < ITER; i++)
{
long double temp = z_x;
z_x = (z_x * z_x) - (z_y * z_y) + x;
z_y = 2 * temp * z_y + y;
if ((z_x * z_x + z_y * z_y) > 4)
return i;
}
return ITER; //ITER is a #define macro
}
Your mandel function has a vastly differing runtime cost depending on whether the if condition within the loop has been met. As a result, each iteration of your loop will run in a different time. By default omp uses static scheduling (i.e. break loop into N partitions). This is kinda bad, because you don't have a workload that fits static scheduling. See what happens when you use dynamic scheduling.
#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < HEIGHT; y++)
{
for (int x = 0; x < WIDTH; x++)
{
unsigned int retVal = mandel(x_val + x_incr * x, y_val + y_incr * y);
mtest.setPixels(x, y,
static_cast<unsigned char>(retVal / 6),
static_cast<unsigned char>(retVal / 5),
static_cast<unsigned char>(retVal / 4));
}
}
Also time to rule out the really dumb stuff.....
Have you included omp.h at least once in your program?
Have you enabled omp in the project settings?
IIRC, if you haven't done those two things, omp will be disabled under MSVC.
This is not an answer, but please do this:
unsigned int mandel(long double x, long double y)
{
long double z_x = 0;
long double z_y = 0;
long double z_x_squared = 0;
long double z_y_squared = 0;
for (int i = 0; i < ITER; i++)
{
long double temp = z_x;
z_x = z_x_squared - z_y_squared + x;
z_y = 2 * temp * z_y + y;
z_x_squared = z_x * z_x;
z_y_squared = z_y * z_u;
if ((z_x_squared + z_y_squared) > 4)
return i;
}
return ITER; //ITER is a #define macro
}
Also, try inverting the order of your two for loops.

Access violation when reading 2d array C++

My code seems to have a bug somewhere but I just can't catch it. I'm passing a 2d array to three sequential functions. First function populates it, second function modifies the values to 1's and 0's, the third function counts the 1's and 0's. I can access the array easily inside the first two functions, but I get an access violation at the first iteration of the third one.
Main
text_image_data = new int*[img_height];
for (i = 0; i < img_height; i++) {
text_image_data[i] = new int[img_width];
}
cav_length = new int[numb_of_files];
// Start processing - load each image and find max cavity length
for (proc = 0; proc < numb_of_files; proc++)
{
readImage(filles[proc], text_image_data, img_height, img_width);
threshold = makeBinary(text_image_data, img_height, img_width);
cav_length[proc] = measureCavity(bullet[0], img_width, bullet[1], img_height, text_image_data);
}
Functions
int makeBinary(int** img, int height, int width)
{
int threshold = 0;
unsigned long int sum = 0;
for (int k = 0; k < width; k++)
{
sum = sum + img[1][k] + img[2][k] + img[3][k] + img[4][k] + img[5][k];
}
threshold = sum / (width * 5);
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
img[i][j] = img[i][j] > threshold ? 1 : 0;
}
}
return threshold;
}
// Count pixels - find length of cavity here
int measureCavity(int &x, int& width, int &y, int &height, int **img)
{
double mean = 1.;
int maxcount = 0;
int pxcount = 0;
int i = x - 1;
int j;
int pxsum = 0;
for (j = 0; j < height - 2; j++)
{
while (mean > 0.0)
{
for (int ii = i; ii > i - 4; ii--)
{
pxsum = pxsum + img[ii][j] + img[ii][j + 1];
}
mean = pxsum / 4.;
pxcount += 2;
i += 2;
pxsum = 0;
}
maxcount = std::max(maxcount, pxcount);
pxcount = 0;
j++;
}
return maxcount;
}
I keep getting an access violation in the measureCavity() function. I'm passing and accessing the array text_image_data the same way as in makeBinary() and readImage(), and it works just fine for those functions. The size is [550][70], I'm getting the error when trying to access [327][0].
Is there a better, more reliable way to pass this array between the functions?

Equivalent of curand for OpenCL

I am looking at switching from nvidia to amd for my compute card because I want double precision support. Before doing this I decided to learn opencl on my nvidia card to see if I like it. I want to convert the following code from CUDA to OpenCL. I am using the curand library to generate uniformly and normally distributed random numbers. Each thread needs to be able to create a different sequence of random numbers and generate a few million per thread. Here is the code. How would I go about this in OpenCL. Everything I have read online seems to imply that I should generate a buffer of random numbers and then use that on the gpu but this is not practical for me.
template<int NArgs, typename OptimizationFunctor>
__global__
void statistical_solver_kernel(float* args_lbounds,
float* args_ubounds,
int trials,
int initial_temp,
unsigned long long seed,
float* results,
OptimizationFunctor f)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= trials)
return;
curandState rand;
curand_init(seed, idx, 0, &rand);
float x[NArgs];
for(int i = 0; i < NArgs; i++)
{
x[i] = curand_uniform(&rand) * (args_ubounds[i]- args_lbounds[i]) + args_lbounds[i];
}
float y = f(x);
for(int t = initial_temp - 1; t > 0; t--)
{
float t_percent = (float)t / initial_temp;
float x_prime[NArgs];
for(int i = 0; i < NArgs; i++)
{
x_prime[i] = curand_normal(&rand) * (args_ubounds[i] - args_lbounds[i]) * t_percent + x[i];
x_prime[i] = fmaxf(args_lbounds[i], x_prime[i]);
x_prime[i] = fminf(args_ubounds[i], x_prime[i]);
}
float y_prime = f(x_prime);
if(y_prime < y || (y_prime - y) / y_prime < t_percent)
{
y = y_prime;
for(int i = 0; i < NArgs; i++)
{
x[i] = x_prime[i];
}
}
}
float* rptr = results + idx * (NArgs + 1);
rptr[0] = y;
for(int i = 1; i <= NArgs; i++)
rptr[i] = x[i - 1];
}
The VexCL library provides an implementation of counter-based generators. You can use those inside larger expressions, see this slide for an example.
EDIT: Take this with a grain of sault, as I am the author of VexCL :).

Increment shared loop counter in OpenMP for progress reporting

I want to keep track of total pixels and rays processed by a long running raytracing process. If I update the shared variables every iteration, the process will slow down noticeably because of synchronization. I'd like to keep track of the progress and still get accurate count results at the end. Is there a way to do this with OpenMP for loops?
Here's some code of the loop in question:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 4096)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount); // will increment sharedRayCount
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++sharedPixelCount;
}
}
Since you have a chunk size of 4096 for your dynamically scheduled parallel-for loop, why not use that as the granularity for amortizing the counter updates?
For example, something like the following might work. I didn't test this code and you probably need to add some bookkeeping for totalPixelCount%4096!=0.
Unlike the previous answer, this does not add a branch to your loop, other than the one implied by the loop itself, for which many processors have optimized instructions. It also does not require any extra variables or arithmetic.
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
#pragma omp parallel for schedule(dynamic, 1)
for (int j = 0; j < totalPixelCount; j+=4096)
{
for (int i = j; i < (i+4096); ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
}
sharedPixelCount += 4096;
}
}
It's not really clear why sharedPixelCount needs to be updated inside of this loop at all, since it is not referenced in the loop body. If this is correct, I suggest the following instead.
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int reducePixelCount = 0;
#pragma omp parallel for schedule(dynamic, 4096) \
reduction(+:reducePixelCount) \
shared(reducePixelCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, sharedRayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
++reducePixelCount; /* thread-local operation, not atomic */
}
/* The interoperability of C++11 atomics and OpenMP is not defined yet,
* so this should just be avoided until OpenMP 5 at the earliest.
* It is sufficient to reduce over a non-atomic type and
* do the assignment here. */
sharedPixelCount = reducePixelCount;
}
Here's an example on how to do it:
void Raytracer::trace(RenderTarget& renderTarget, const Scene& scene, std::atomic<int>& sharedPixelCount, std::atomic<int>& sharedRayCount)
{
int width = renderTarget.getWidth();
int height = renderTarget.getHeight();
int totalPixelCount = width * height;
int rayCount = 0;
int previousRayCount = 0;
#pragma omp parallel for schedule(dynamic, 1000) reduction(+:rayCount) firstprivate(previousRayCount)
for (int i = 0; i < totalPixelCount; ++i)
{
int x = i % width;
int y = i / width;
Ray rayToScene = scene.camera.getRay(x, y);
shootRay(rayToScene, scene, rayCount);
renderTarget.setPixel(x, y, rayToScene.color.clamped());
if ((i + 1) % 100 == 0)
{
sharedPixelCount += 100;
sharedRayCount += (rayCount - previousRayCount);
previousRayCount = rayCount;
}
}
sharedPixelCount = totalPixelCount;
sharedRayCount = rayCount;
}
It won't be 100% accurate while the loop is running, but the error is negligible. At the end exact values will be reported.

C++: Time for filling an array is too long

We are writing a method (myFunc) that writes some data to the array. The array must be a field of the class (MyClass).
Example:
class MyClass {
public:
MyClass(int dimension);
~MyClass();
void myFunc();
protected:
float* _nodes;
};
MyClass::MyClass(int dimension){
_nodes = new float[dimension];
}
void MyClass::myFunc(){
for (int i = 0; i < _dimension; ++i)
_nodes[i] = (i % 2 == 0) ? 0 : 1;
}
The method myFunc is called near 10000 times and it takes near 9-10 seconds (with other methods).
But if we define myFunc as:
void MyClass::myFunc(){
float* test = new float[_dimension];
for (int i = 0; i < _dimension; ++i)
test[i] = (i % 2 == 0) ? 0 : 1;
}
our programm works much faster - it takes near 2-3 seconds (if it's calles near 10000 times).
Thanks in advance!
This may help (in either case)
for (int i = 0; i < _dimension; )
{
test[i++] = 0.0f;
test[i++] = 1.0f;
}
I'm assuming _dimension is even, but easy to fix if it is not.
If you want to speed up Debug-mode, maybe help the compiler, try
void MyClass::myFunc(){
float* const nodes = _nodes;
const int dimension = _dimension;
for (int i = 0; i < dimension; ++i)
nodes[i] = (i % 2 == 0) ? 0.0f : 1.0f;
}
Of course, in reality you should focus on using Release-mode for everything performance-related.
In your example code, you do not initialise _dimension in the constructor, but use it in MyFunc. So you might be filling millions of entries in the array even though you have only allocated a few thousand entries. In the example that works, you use the same dimension for creating and filling the array so you are probably initialising it correctly in that case..
Just make sure that _dimension is properly initialised.
This is faster on most machine.
void MyClass::myFunc(){
float* const nodes = _nodes;
const int dimension = _dimension;
if(dimension < 2){
if(dimension < 1)
return;
nodes[0] = 0.0f;
return;
}
nodes[0] = 0.0f;
nodes[1] = 1.0f;
for (int i = 2; ; i <<= 1){
if( (i << 1) < dimension ){
memcpy(nodes + i, nodes, i * sizeof(float));
}else{
memcpy(nodes + i, nodes, (dimension - i) * sizeof(float));
break;
}
}
}
Try this:
memset(test, 0, sizeof(float) * _dimension));
for (int i = 1; i < _dimension; i += 2)
{
test[i] = 1.0f;
}
You can also run this piece once and store the array at static location.
For each consecutive iteration you can address the stored data without any computation.