CUDA Math API: unable to use atan2 function [duplicate] - c++

I'm new in CUDA, and cannot understand what I'm doing wrong.
I'm trying to calculate the distance of object it has id in array, axis x in array and axis y in array to find neighbors for each object
__global__
void dist(int *id_d, int *x_d, int *y_d,
int *dist_dev, int dimBlock, int i)
{
int idx = threadIdx.x + blockIdx.x*blockDim.x;
while(idx < dimBlock){
int i;
for(i= 0; i< dimBlock; i++){
if (idx == i)continue;
dist_dev[idx] = pow(x_d[idx] - x_d[i], 2) + pow(y_d[idx] - y_d[i], 2); // error here
}
}
}
Is pow not defined in kernel code?

Your problem is that while pow is defined in the CUDA math API (see here), it is not template specialised for integer arguments, ie. there is no version like this:
__device__ ​ int pow ( int x, int y )
This is why you are getting an error. You will need to explicitly cast the base argument to a floating point type like this:
dist_dev[idx] = pow((double)(x_d[idx] - x_d[i]), 2.0) +
pow((double)(y_d[idx] - y_d[i]), 2.0);
Having said that, using double precision floating point exponential in your example for a integer square will be poor from an efficiency point of view. It would be preferable to perform the calculation using integer multiplication instead:
int dx = x_d[idx] - x_d[i];
int dy = y_d[idx] - y_d[i];
dist_dev[idx] = (dx * dx) + (dy * dy);

Related

C++ boost library to generate negative binomial random variables

I'm new to C++ and I'm using the boost library to generate random variables. I want to generate random variables from a negative binomial distribution.
The first parameter of boost::random::negative_binomial_distribution<int> freq_nb(r, p); has to be an integer. I want to expand that to a real value. Therefore I would like to use a poisson-gamma mixture, but I fail to.
Here's an excerpt from my code:
int nr_sim = 1000000;
double mean = 2.0;
double variance = 15.0;
double r = mean * mean / (variance - mean);
double p = mean / variance;
double beta = (1 - p) / p;
typedef boost::mt19937 RNGType;
RNGType rng(5);
boost::random::gamma_distribution<double> my_gamma(r, beta);
boost::random::poisson_distribution<int> my_poi(my_gamma(rng));
int simulated_mean = 0;
for (int i = 0; i < nr_sim; i++) {
simulated_mean += my_poi(rng);
}
double my_result = (double)simulated_mean / (double)nr_sim;
With my_result == 0.5 there is definitly something wrong. Is it my_poi(my_gamma(rng))? If so, what is the correct way to solve that problem?

weird inaccuracy in line rotation - c++

I have programmed a simple dragon curve fractal. It seems to work for the most part, but there is an odd logical error that shifts the rotation of certain lines by one pixel. This wouldn't normally be an issue, but after a few generations, at the right size, the fractal begins to look wonky.
I am using open cv in c++ to generate it, but I'm pretty sure it's a logical error rather than a display error. I have printed the values to the console multiple times and seen for myself that there is a one-digit difference between values that are intended to be the exact same - meaning a line may have a y of 200 at one end and 201 at another.
Here is the full code:
#include<iostream>
#include<cmath>
#include<opencv2/opencv.hpp>
const int width=500;
const int height=500;
const double PI=std::atan(1)*4.0;
struct point{
double x;
double y;
point(double x_,double y_){
x=x_;
y=y_;
}};
cv::Mat img(width,height,CV_8UC3,cv::Scalar(255,255,255));
double deg_to_rad(double degrees){return degrees*PI/180;}
point rotate(int degree, int centx, int centy, int ll) {
double radians = deg_to_rad(degree);
return point(centx + (ll * std::cos(radians)), centy + (ll * std::sin(radians)));
}
void generate(point & r, std::vector < point > & verticies, int rotation = 90) {
int curRotation = 90;
bool start = true;
point center = r;
point rot(0, 0);
std::vector<point> verticiesc(verticies);
for (point i: verticiesc) {
double dx = center.x - i.x;
double dy = center.y - i.y;
//distance from centre
int ll = std::sqrt(dx * dx + dy * dy);
//angle from centre
curRotation = std::atan2(dy, dx) * 180 / PI;
//add 90 degrees of rotation
rot = rotate(curRotation + rotation, center.x, center.y, ll);
verticies.push_back(rot);
//endpoint, where the next centre will be
if (start) {
r = rot;
start = false;
}
}
}
void gen(int gens, int bwidth = 1) {
int ll = 7;
std::vector < point > verticies = {
point(width / 2, height / 2 - ll),
point(width / 2, height / 2)
};
point rot(width / 2, height / 2);
for (int i = 0; i < gens; i++) {
generate(rot, verticies);
}
//draw lines
for (int i = 0; i < verticies.size(); i += 2) {
cv::line(img, cv::Point(verticies[i].x, verticies[i].y), cv::Point(verticies[i + 1].x, verticies[i + 1].y), cv::Scalar(0, 0, 0), 1, 8);
}
}
int main() {
gen(10);
cv::imshow("", img);
cv::waitKey(0);
return 0;
}
First, you use int to store point coordinates - that's a bad idea - you lose all accuracy of point position. Use double or float.
Second, your method for drawing fractals is not too stable numericly. You'd better store original shape and all rotation/translation/scale that indicate where and how to draw scaled copies of the original shape.
Also, I believe this is a bug:
for(point i: verices)
{
...
vertices.push_back(rot);
...
}
Changing size of vertices while inside such a for-loop might cause a crash or UB.
Turns out it was to do with floating-point precision. I changed
x=x_;
y=y_;
to
x=std::round(x_);
y=std::round(y_);
and it works.

No match for 'operator* in '(1.0e + 0 - ((double)u)) * bezPoints[i][(j + 1)]'

I am getting this error in trying to implement the Bezier Curve psuedocode in C++ in Qt , have the method implementation below.
void GLWidget::drawBezierCurve() {
int N_PTS = vertices.size();
Point bezPoints[N_PTS][N_PTS];
for (float u = 0.0; u <= 1.0; u += 0.01){
for(int diag = N_PTS/2; diag >= 0;diag--){
for(int i = 0; i <= diag; i++){
int j = diag - i;
bezPoints[i][j] = (1.0 - u) * bezPoints[i][j+1] + u * bezPoints[i+1][j];
}
}
theImage.setPixel(bezPoints[0][0], bezPoints[0][0], RGBValue(100,12,140), 255);
}
}
This looks like it is because you are multiplying a float by a Point object. You are most likely going to need to define you're own multiplication method for this operation, or overload the * operator to perform this, depending on which fields in the Point object you intend to multiply the floating point number by.
Something like:
float operator* (const float num, const Point& point) {
return num * point.floating_point_field;
}
Where the floating_point_field is the member of the class that you want to multiply and it should also be of the same type as float, otherwise you'll have to start doing something more involved to define the multiplication.
Alternatively, if the multiplication is as simple as in the example above you could just use a getter in the code such as:
u * bezPoints[i][j+1].get_floating_point_value()
Hope that helps,
Matt

Need help optimizing code (minimum image convention)

I have written some simulation code and am using the "randomly break in GDB" method of debugging. I am finding that 99.9% of my program's time is spent in this routine (it's the minimum image convention):
inline double distanceSqPeriodic(double const * const position1, double const * const position2, double boxWidth) {
double xhw, yhw, zhw, x, y, z;
xhw = boxWidth / 2.0;
yhw = xhw;
zhw = xhw;
x = position2[0] - position1[0];
if (x > xhw)
x -= boxWidth;
else if (x < -xhw)
x += boxWidth;
y = position2[1] - position1[1];
if (y > yhw)
y -= boxWidth;
else if (y < -yhw)
y += boxWidth;
z = position2[2] - position1[2];
if (z > zhw)
z -= boxWidth;
else if (z < -zhw)
z += boxWidth;
return x * x + y * y + z * z;
}
The optimizations I have performed so far (maybe not very significant ones):
Return the square of the distance instead of the square root
Inline it
Const what I can
No standard library bloat
Compiling with every g++ optimization flag I can think of
I am running out of things I can do with this. Maybe I could use floats instead of doubles but I would prefer that be a last resort. And maybe I could somehow use SIMD on this, but I've never done that so I imagine that's a lot of work. Any ideas?
Thanks
First, you're not using the right algorithm. What if the two points are greater than boxWidth apart? Second, if you have multiple particles, calling a single function that does all of the distance calculations and places the results in an output buffer is going to be significantly more efficient. Inlining helps reduce some of this, but not all. Any of the precalculation -- like dividing the box length by 2 in your algorithm -- is going to be repeated when it doesn't need to be.
Here is some SIMD code to do the calculation. You need to compile with -msse4. Using -O3, on my machine (macbook pro, llvm-gcc-4.2), I get a speed up of about 2x. This does require using 32bit floats instead of double precision arithmetic.
SSE really isn't that complicated, it just looks terrible. e.g. instead of writing a*b, you have to write the clunky _mm_mul_ps(a,b).
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <smmintrin.h>
// you can compile this code with -DDOUBLE to try using doubles vs. floats
// in the unoptimized code. The SSE code uses only floats.
#ifdef DOUBLE
typedef double real;
#else
typedef float real;
#endif
static inline __m128 loadFloat3(const float const* value) {
// Load (x,y,z) into a SSE register, leaving the last entry
// set to zero.
__m128 x = _mm_load_ss(&value[0]);
__m128 y = _mm_load_ss(&value[1]);
__m128 z = _mm_load_ss(&value[2]);
__m128 xy = _mm_movelh_ps(x, y);
return _mm_shuffle_ps(xy, z, _MM_SHUFFLE(2, 0, 2, 0));
}
int fdistanceSqPeriodic(float* position1, float* position2, const float boxWidth,
float* out, const int n_points) {
int i;
__m128 r1, r2, r12, s12, r12_2, s, box, invBox;
box = _mm_set1_ps(boxWidth);
invBox = _mm_div_ps(_mm_set1_ps(1.0f), box);
for (i = 0; i < n_points; i++) {
r1 = loadFloat3(position1);
r2 = loadFloat3(position1);
r12 = _mm_sub_ps(r1, r2);
s12 = _mm_mul_ps(r12, invBox);
s12 = _mm_sub_ps(s12, _mm_round_ps(s12, _MM_FROUND_TO_NEAREST_INT));
r12 = _mm_mul_ps(box, s12);
r12_2 = _mm_mul_ps(r12, r12);
// double horizontal add instruction accumulates the sum of
// all four elements into each of the elements
// (e.g. s.x = s.y = s.z = s.w = r12_2.x + r12_2.y + r12_2.z + r12_2.w)
s = _mm_hadd_ps(r12_2, r12_2);
s = _mm_hadd_ps(s, s);
_mm_store_ss(out++, s);
position1 += 3;
position2 += 3;
}
return 1;
}
inline real distanceSqPeriodic(real const * const position1, real const * const position2, real boxWidth) {
real xhw, yhw, zhw, x, y, z;
xhw = boxWidth / 2.0;
yhw = xhw;
zhw = xhw;
x = position2[0] - position1[0];
if (x > xhw)
x -= boxWidth;
else if (x < -xhw)
x += boxWidth;
y = position2[1] - position1[1];
if (y > yhw)
y -= boxWidth;
else if (y < -yhw)
y += boxWidth;
z = position2[2] - position1[2];
if (z > zhw)
z -= boxWidth;
else if (z < -zhw)
z += boxWidth;
return x * x + y * y + z * z;
}
int main(void) {
real* position1;
real* position2;
real* output;
int n_runs = 10000000;
posix_memalign((void**) &position1, 16, n_runs*3*sizeof(real));
posix_memalign((void**) &position2, 16, n_runs*3*sizeof(real));
posix_memalign((void**) &output, 16, n_runs*sizeof(real));
real boxWidth = 1.8;
real result = 0;
int i;
clock_t t;
#ifdef OPT
printf("Timing optimized SSE implementation\n");
#else
printf("Timinig original implementation\n");
#endif
#ifdef DOUBLE
printf("Using double precision\n");
#else
printf("Using single precision\n");
#endif
t = clock();
#ifdef OPT
fdistanceSqPeriodic(position1, position2, boxWidth, output, n_runs);
#else
for (i = 0; i < n_runs; i++) {
*output = distanceSqPeriodic(position1, position2, boxWidth);
position1 += 3;
position2 += 3;
output++;
}
#endif
t = clock() - t;
printf("It took me %d clicks (%f seconds).\n", (int) t, ((float)t)/CLOCKS_PER_SEC);
}
you may want to use fabs (standarized in ISO 90 C) since this should be able to be reduced to a single non-branching instruction.
Return the square of the distance instead of the square root
That's a good idea as long as you are comparing squares to squares.
Inline it
This is sometimes a counter-optimization: Inlined code takes up space in the execution pipeline/cache, whether it is branched to or not.
Often it makes no difference because the compiler has the final word on whether to inline or not.
Const what I can
Normally no difference at all.
No standard library bloat
What bloat?
Compiling with every g++ optimization flag I can think of
That's good: Leave most optimizations to the compiler. Only if you measured your real bottleneck, and determined if that bottleneck is significant, invest money on hand optimizing.
What you could try do is to make your code branchfree. Without using bitmasks, this may look like this:
//if (z > zhw)
// z -= boxWidths[2];
//else if (z < -zhw)
// z += boxWidths[2];
const auto z_a[] = {
z,
z - boxWidths[2]
};
z = z_a[z>zhw];
...
or
z -= (z>zhw) * boxWidths[2];
However, there is no guarantee that this is faster. Your compiler may now have a harder time identifying SIMD spots in your code, or the branch target buffer does a good job and most of the times you have the same code paths through your function.
You need to get rid of the comparisons, as those are hard to predict.
The function to be implemented is:
/ / /\ /\
/ / / \/ \
----0----- or ------------ , as (-x)^2 == x^2
/ /
/ /
The latter is a result of two abs statements:
x = abs(half-abs(diff))+half;
The code
double tst(double a[4], double b[4], double half)
{
double sum=0.0,t;
int i;
for (i=0;i<3;i++) { t=fabs(fabs(b[i]-a[i])-half)-half; sum+=t*t;}
return sum;
}
beats the original implementation by a factor of four (+some) -- and at this point there's not even full parallelism: only the lower half of xmm registers are used.
With parallel processing of x && y, there's a theoretical gain of about 50% to be achieved. Using floats instead of doubles could in theory make it still about 3x faster.

Optimization method for finding floating status of an object

The problem to solve is finding the floating status of a floating body, given its weight and the center of gravity.
The function i use calculates the displaced volume and center of bouyance of the body given sinkage, heel and trim.
Where sinkage is a length unit and heel/trim is an angle limited to a value from -90 to 90.
The floating status is found when displaced volum is equal to weight and the center of gravity is in a vertical line with center of bouancy.
I have this implemeted as a non-linear Newton-Raphson root finding problem with 3 variables (sinkage, trim, heel) and 3 equations.
This method works, but needs good initial guesses. So I am hoping to find either a better approach for this, or a good method to find the initial values.
Below is the code for the newton and jacobian algorithm used for the Newton-Raphson iteration. The function volume takes the parameters sinkage, heel and trim. And returns volume, and the coordinates for center of bouyancy.
I also included the maxabs and GSolve2 algorithms, I belive these are taken from Numerical Recipies.
void jacobian(float x[], float weight, float vcg, float tcg, float lcg, float jac[][3], float f0[]) {
float h = 0.0001f;
float temp;
float j_volume, j_vcb, j_lcb, j_tcb;
float f1[3];
volume(x[0], x[1], x[2], j_volume, j_lcb, j_vcb, j_tcb);
f0[0] = j_volume-weight;
f0[1] = j_tcb-tcg;
f0[2] = j_lcb-lcg;
for (int i=0;i<3;i++) {
temp = x[i];
x[i] = temp + h;
volume(x[0], x[1], x[2], j_volume, j_lcb, j_vcb, j_tcb);
f1[0] = j_volume-weight;
f1[1] = j_tcb-tcg;
f1[2] = j_lcb-lcg;
x[i] = temp;
jac[0][i] = (f1[0]-f0[0])/h;
jac[1][i] = (f1[1]-f0[1])/h;
jac[2][i] = (f1[2]-f0[2])/h;
}
}
void newton(float weight, float vcg, float tcg, float lcg, float &sinkage, float &heel, float &trim) {
float x[3] = {10,1,1};
float accuracy = 0.000001f;
int ntryes = 30;
int i = 0;
float jac[3][3];
float max;
float f0[3];
float gauss_f0[3];
while (i < ntryes) {
jacobian(x, weight, vcg, tcg, lcg, jac, f0);
if (sqrt((f0[0]*f0[0]+f0[1]*f0[1]+f0[2]*f0[2])/2) < accuracy) {
break;
}
gauss_f0[0] = -f0[0];
gauss_f0[1] = -f0[1];
gauss_f0[2] = -f0[2];
GSolve2(jac, 3, gauss_f0);
x[0] = x[0]+gauss_f0[0];
x[1] = x[1]+gauss_f0[1];
x[2] = x[2]+gauss_f0[2];
// absmax(x) - Return absolute max value from an array
max = absmax(x);
if (max < 1) max = 1;
if (sqrt((gauss_f0[0]*gauss_f0[0]+gauss_f0[1]*gauss_f0[1]+gauss_f0[2]*gauss_f0[2])) < accuracy*max) {
x[0]=x2[0];
x[1]=x2[1];
x[2]=x2[2];
break;
}
i++;
}
sinkage = x[0];
heel = x[1];
trim = x[2];
}
int GSolve2(float a[][3],int n,float b[]) {
float x,sum,max,temp;
int i,j,k,p,m,pos;
int nn = n-1;
for (k=0;k<=n-1;k++)
{
/* pivot*/
max=fabs(a[k][k]);
pos=k;
for (p=k;p<n;p++){
if (max < fabs(a[p][k])){
max=fabs(a[p][k]);
pos=p;
}
}
if (ABS(a[k][pos]) < EPS) {
writeLog("Matrix is singular");
break;
}
if (pos != k) {
for(m=k;m<n;m++){
temp=a[pos][m];
a[pos][m]=a[k][m];
a[k][m]=temp;
}
}
/* convert to upper triangular form */
if ( fabs(a[k][k])>=1.e-6)
{
for (i=k+1;i<n;i++)
{
x = a[i][k]/a[k][k];
for (j=k+1;j<n;j++) a[i][j] = a[i][j] -a[k][j]*x;
b[i] = b[i] - b[k]*x;
}
}
else
{
writeLog("zero pivot found in line:%d",k);
return 0;
}
}
/* back substitution */
b[nn] = b[nn] / a[nn][nn];
for (i=n-2;i>=0;i--)
{
sum = b[i];
for (j=i+1;j<n;j++)
sum = sum - a[i][j]*b[j];
b[i] = sum/a[i][i];
}
return 0;
}
float absmax(float x[]) {
int i = 1;
int n = sizeof(x);
float max = x[0];
while (i < n) {
if (max < x[i]) {
max = x[i];
}
i++;
}
return max;
}
Have you considered some stochastic search methods to find the initial value and then fine-tuning with Newton Raphson? One possibility is evolutionary computation, you can use the Inspyred package. For a physical problem similar in many ways to the one you describe, look at this example: http://inspyred.github.com/tutorial.html#lunar-explorer
What about using a damped version of Newton's method? You could quite easily modify your implementation to make it. Think about Newton's method as finding a direction
d_k = f(x_k) / f'(x_k)
and updating the variable
x_k+1 = x_k - L_k d_k
In the usual Newton's method, L_k is always 1, but this might create overshoots or undershoots. So, let your method chose L_k. Suppose that your method usually overshoots. A possible strategy consists in taking the largest L_k in the set {1,1/2,1/4,1/8,... L_min} such that the condition
|f(x_k+1)| <= (1-L_k/2) |f(x_k)|
is satisfied (or L_min if none of the values satisfies this criteria).
With the same criteria, another possible strategy is to start with L_0=1 and if the criteria is not met, try with L_0/2 until it works (or until L_0 = L_min). Then for L_1, start with min(1, 2L_0) and do the same. Then start with L_2=min(1, 2L_1) and so on.
By the way: are you sure that your problem has a unique solution? I guess that the answer to this question depends on the shape of your object. If you have a rugby ball, there's one angle that you cannot fix. So if your shape is close to such an object, I would not be surprised that the problem is difficult to solve for that angle.