c++ integer division

c++ integer division - c++

Say I've got
SDL_Rect rect;
rect.x = 5; // rect.x is of type "Uint16"
int y = 11;
and I want to perform the operation rect.x/y.
I want to get the result as a floating point number and then round up to the nearest int.
Something like this:
float result = rect.x/y;
int rounded = ceil(result);
How should I do this?

Cast either rect.x or y to float, and then do the division. This will force the entire division operation to take place in floating point.
float result = rect.x/(float)y;
int rounded = ceil(result);

You need to cast your variables.
float result = (float)rect.x / (float)y;
int rounded = ceil(result);

You can do this without any floating point types. If x and y are positive integers, then x divided by y, rounded up is (x+y-1)/y.

Related

C++: Rotating point around origin, but output point is not correct by some margin

first time asking.
I want to rotate a point in 3d in c++ in the XY Plane and am using the following function for the task.
void rotateXY(double angle){
//save the x and y and z coordinates in seperate variables
double x = this->pos[0]; // value 1
double y = this->pos[1]; // value 0
double z = this->pos[2]; // value 0, but in the xy rotation it is not important
double radian = angle*M_PI/180;
this->pos[0] = cos(radian)*x - sin(radian)*y;
this->pos[1] = sin(radian)*x + cos(radian)*y;
this->pos[2] = 1*z;
};
I got the matrix from https://gamedevelopment.tutsplus.com/tutorials/lets-build-a-3d-graphics-engine-linear-transformations--gamedev-7716
In this I am directly manipulating the coordinates of the point, hence the this->pos[0]
If I call another function called rotateXYP where I first substract a mathematical vector from the rotating point and add the same mathematical vector to it after the rotate I get the wanted results.
void rotateXYP(double angle, eng::point originOfRotation){
this->subVec(originOfRotation);
this->rotateXY(angle);
this->addVec(originOfRotation);
};
void rotateXY(double angle){
//save x,y and z in seperate variables for manipulation
double x = this->pos[0]; // value 1
double y = this->pos[1]; // value 0
double z = this->pos[2]; // value 0, but in the xy rotation it is not important
//convert from degrees to radians because cmath requires it
double radian = angle*M_PI/180;
//apply the values according to a rotation matrix found on the internet
this->pos[0] = cos(radian)*x - sin(radian)*y;
this->pos[1] = sin(radian)*x + cos(radian)*y;
this->pos[2] = 1*z;
};
My Question
Why am I getting with the point (1|0|0) as an input to the function rotateXY(90) the follwing as an output.
(6.12323e-17|1|0)
instead of
(0|1|0)
and if I call the function rotateXYP(90, some point), I get the correct point, without the tiny number on the x-coordinate.
I suspect it has something to do with the cos and sin in the following line of code:
this->pos[0] = cos(radian)*x - sin(radian)*y;
As I am too inexperienced with c++ I seek answeres and hope that this was not a bad question.

Your implementation is correct. This is just the nature of floating point arithmetic. All numbers are represented as approximations. When translating the point you get a better numeric condition.
I might add that this effect will occur independent of the programming language and hardware used.

I have solved my problem by adding a variable named accuracy which controls the number of decimal places the double is allowed to have.
void rotateXY(double angle){
//Accuracy: a is the number of decimal places
int a = 2;
int acc = pow(10,a);
//save x,y and z in seperate variables for manipulation
double x = this->pos[0]; // value 1
double y = this->pos[1]; // value 0
double z = this->pos[2]; // value 0, but in the xy rotation it is not important
//convert from degrees to radians because cmath requires it
double radian = angle*M_PI/180;
//apply the values according to a rotation matrix found on the internet
this->pos[0] = round((cos(radian)*x - sin(radian)*y)*acc)/acc;
this->pos[1] = round((sin(radian)*x + cos(radian)*y)*acc)/acc;
this->pos[2] = round((1*z)*acc)/acc;
};

How to correctly floor the floating point pair sum

Let me have two floating point variables coming as function arguments:
float fun(float x, float y) {
// ...
}
I would like to calculate the floor of their sum. Is it possible to do it correctly not relating on current floating point rounding mode?
I mean the following. Consider the expression:
floorf(x+y)
It is possible that the exact value of the function argument (x + y) < n for some integer n will be rounded to the nearest integer n during the floating point operation, and then the floorf() function will return n instead of (n-1).

Here is a demonstration using the numbers given by Bathsheba and the effect of the floating point rounding mode:
#include <stdio.h>
#include <fenv.h>
#include <math.h>
int main(void) {
double y = 0.49999999999999994;
double x = 0.5;
double z1 = x + y;
// set floating point rounding downwards
fesetround(FE_DOWNWARD);
double z2 = x + y;
printf("y < 0.5: %d\nz1 == 1: %d\nz2 == 1: %d\n", y < x, z1 == 1, z2 == 1);
printf("floor(z1): %f\nfloor(z2): %f\n", floor(z1), floor(z2));
}
y is less than 0.5, so the sum of y + 0.5 should be less than 1, but it is rounded to 1 using the default mode (z1). If the floating point rounding mode is set to round downwards, the result is less than 1 (z2), which would floor to 0. Clearly it is not possible to do this "correctly" under "any arbitrary floating point rounding mode"...
The output is
y < 0.5: 1
z1 == 1: 1
z2 == 1: 0
floor(z1): 1.000000
floor(z2): 0.000000

Yes this is possible.
A well-known example is an IEEE754 64 bit float (unusual but possible by the standard)
x = 0.5 and y = 0.49999999999999994.
(x + y) is exactly 1 and yes floorf will return 1.

Is a whole number float divided by itself guaranteed to be 1.f?

If I write:
int x = /* any non-zero integer value */;
float y = x;
float z = y / y;
Is z guaranteed to be exactly 1.f ?

If your C++ implementation uses IEEE754 then yes, this is guaranteed. (The division operator is required to return the best possible floating point value).
The only exceptions for y / y, in general, not being 1.f are the cases when y is NaN, +Inf, -Inf, 0.f, and -0.f, or if you are on a platform where int is so wide that certain instances of it cannot be represented in a float without that float being set to +Inf or -Inf1. Setting aside that final point, in your case that means that int x = 0; will produce the only exception.
IEEE754 is extremely common. But to check for sure, test the value of
std::numeric_limits<float>::is_iec559;
1A platform, for example, with a 128 bit int and an IEEE754 32 bit float would exhibit this behaviour for certain values of x.

No, not in all cases, even for IEEE754.
For example, with int x = 0;, you'll get NaN. (Live)

Is there any "standard" way to calculate the numerical gradient?

I am trying to calculate the numerical gradient of a smooth function in c++. And the parameter value could vary from zero to a very large number(maybe 1e10 to 1e20?)
I used the function f(x,y) = 10*x^3 + y^3 as a testbench, but I found that if x or y is too large, I can't get correct gradient.
Here is my code to calculate the graidient:
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y)
{
// black box expensive function
return 10 * pow(x, 3) + pow(y, 3);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+19;
double y = -5.49756e+14;
const double epsi = 1e-4;
double f1 = f(x, y);
double f2 = f(x, y+epsi);
double f3 = f(x, y-epsi);
cout << f1 << endl;
cout << f2 << endl;
cout << f3 << endl;
cout << f1 - f2 << endl; // 0
cout << f2 - f3 << endl; // 0
return 0;
}
If I use the above code to calculate the gradient, the gradient would be zero!
The testbench function, 10*x^3 + y^3, is just a demo, the real problem I need to solve is actually a black box function.
So, is there any "standard" way to calculate the numerical gradient?

In the first place, you should use the central difference scheme, which is more accurate (by cancellation of one more term of the Taylor develoment).
(f(x + h) - f(x - h)) / 2h
rather than
(f(x + h) - f(x)) / h
Then the choice of h is critical and using a fixed constant is the worst thing you can do. Because for small x, h will be too large so that the approximation formula no more works, and for large x, h will be too small, resulting in severe truncation error.
A much better choice is to take a relative value, h = x√ε, where ε is the machine epsilon (1 ulp), which gives a good tradeoff.
(f(x(1 + √ε)) - f(x(1 - √ε))) / 2x√ε
Beware that when x = 0, a relative value cannot work and you need to fall back to a constant. But then, nothing tells you which to use !

You need to consider the precision needed.
At first glance, since |y| = 5.49756e14 and epsi = 1e-4, you need at least ⌈log2(5.49756e14)-log2(1e-4)⌉ = 63 bits of significand precision (that is the number of bits used to encode the digits of your number, also known as mantissa) for y and y+epsi to be considered different.
The double-precision floating-point format only has 53 bits of significand precision (assuming it is 8 bytes). So, currently, f1, f2 and f3 are exactly the same because y, y+epsi and y-epsi are equal.
Now, let's consider the limit : y = 1e20, and the result of your function, 10x^3 + y^3. Let's ignore x for now, so let's take f = y^3. Now we can calculate the precision needed for f(y) and f(y+epsi) to be different : f(y) = 1e60 and f(epsi) = 1e-12. This gives a minimum significand precision of ⌈log2(1e60)-log2(1e-12)⌉ = 240 bits.
Even if you were to use the long double type, assuming it is 16 bytes, your results would not differ : f1, f2 and f3 would still be equal, even though y and y+epsi would not.
If we take x into account, the maximum value of f would be 11e60 (with x = y = 1e20). So the upper limit on precision is ⌈log2(11e60)-log2(1e-12)⌉ = 243 bits, or at least 31 bytes.
One way to solve your problem is to use another type, maybe a bignum used as fixed-point.
Another way is to rethink your problem and deal with it differently. Ultimately, what you want is f1 - f2. You can try to decompose f(y+epsi). Again, if you ignore x, f(y+epsi) = (y+epsi)^3 = y^3 + 3*y^2*epsi + 3*y*epsi^2 + epsi^3. So f(y+epsi) - f(y) = 3*y^2*epsi + 3*y*epsi^2 + epsi^3.

The only way to calculate gradient is calculus.
Gradient is a vector:
g(x, y) = Df/Dx i + Df/Dy j
where (i, j) are unit vectors in x and y directions, respectively.
One way to approximate derivatives is first order differences:
Df/Dx ~ (f(x2, y)-f(x1, y))/(x2-x1)
and
Df/Dy ~ (f(x, y2)-f(x, y1))/(y2-y1)
That doesn't look like what you're doing.
You have a closed form expression:
g(x, y) = 30*x^2 i + 3*y^2 j
You can plug in values for (x, y) and calculate the gradient exactly at any point. Compare that to your differences and see how well your approximation is doing.
How you implement it numerically is your responsibility. (10^19)^3 = 10^57, right?
What is the size of double on your machine? Is it a 64 bit IEEE double precision floating point number?

Use
dx = (1+abs(x))*eps, dfdx = (f(x+dx,y) - f(x,y)) / dx
dy = (1+abs(y))*eps, dfdy = (f(x,y+dy) - f(x,y)) / dy
to get meaningful step sizes for large arguments.
Use eps = 1e-8 for one-sided difference formulas, eps = 1e-5 for central difference quotients.
Explore automatic differentiation (see autodiff.org) for derivatives without difference quotients and thus much smaller numerical errors.

We can examine the behaviour of the error in the derivative using the following program - it calculates the 1-sided derivative and the central difference based derivative using a varying step size. Here I'm using x and y ~ 10^10, which is smaller than what you were using, but should illustrate the same point.
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y) {
return 10 * pow(x, 3) + pow(y, 3);
}
double f_x(double x, double y) {
return 3 * 10 * pow(x,2);
}
double f_y(double x, double y) {
return 3 * pow(y,2);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+10;
double y = -5.49756e+10;
//double x = 10.1;
//double y = -5.2;
double epsi = 1e8;
for(int i=0; i<60; ++i) {
double dfx_n = (f(x+epsi,y) - f(x,y))/epsi;
double dfx_cd = (f(x+epsi,y) - f(x-epsi,y))/(2*epsi);
double dfx = f_x(x,y);
cout<<epsi<<" "<<fabs(dfx-dfx_n)<<" "<<fabs(dfx - dfx_cd)<<std::endl;
epsi/=1.5;
}
return 0;
}
The output shows that a 1-sided difference gets us an optimal error of about 1.37034e+13 at a step length of about 100.0. Note that while this error looks large, as a relative error it is 3.5746632302764072e-09 (since the exact value is 3.833e+21)
In comparison the 2-sided difference gets an optimal error of about 1.89493e+10 with a step size of about 45109.3. This is three-orders of magnitude better, (with a much larger step-size).
How can we work out the step size? The link in the comments of Yves Daosts answer gives us a ballpark value:
h=x_c sqrt(eps) for 1-Sided, and h=x_c cbrt(eps) for 2-Sided.
But either way, if the required step size for decent accuracy at x ~ 10^10 is 100.0, the required step size with x ~ 10^20 is going to be 10^10 larger too. So the problem is simply that your step size is way too small.
This can be verified by increasing the starting step-size in the above code and resetting the x/y values to the original values.
Then expected derivative is O(1e39), best 1-sided error of about O(1e31) occurs near a step length of 5.9e10, best 2-sided error of about O(1e29) occurs near a step length of 6.1e13.

As numerical differentiation is ill conditioned (which means a small error could alter your result significantly) you should consider to use Cauchy's integral formula. This way you can calculate the n-th derivative with an integral. This will lead to less problems with considering accuracy and stability.

World coordinates as float to chunk and pixel coordinates

I have no idea right now for how to calculate pixel coordinates within the chunk.
I calculate the chunk from the world coordinates like this:
float xCoord = ..., yCoord = ...; //Can be positive and negative.
int xChunk = static_cast<int>(std::floor((xCoord + WORLD_OFFSET_X_F) / CHUNK_XY_SIZE_F));
int yChunk = static_cast<int>(std::floor((yCoord + WORLD_OFFSET_Y_F) / CHUNK_XY_SIZE_F));
It would be easy to calculate the pixel coordinates within the chunk if they were integers. Example:
int xCoord = ..., yCoord = ...; //Can be positive and negative.
int xPixel = (xCoord + WORLD_OFFSET_X_I) % CHUNK_XY_SIZE_I; Or by using the AND(&) operator.
int yPixel = (yCoord + WORLD_OFFSET_Y_I) % CHUNK_XY_SIZE_I;
This doesn't work by using floating point numbers. How can I accomplish the same results with floating point numbers?
Thanks in advance.

You can do it exactly the same way. You just have to use a floating point modulo instead of integer. Take a look at fmod

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

c++ integer division - c++

Cast either rect.x or y to float, and then do the division. This will force the entire division operation to take place in floating point. float result = rect.x/(float)y; int rounded = ceil(result);

You need to cast your variables. float result = (float)rect.x / (float)y; int rounded = ceil(result);

You can do this without any floating point types. If x and y are positive integers, then x divided by y, rounded up is (x+y-1)/y.

Related

C++: Rotating point around origin, but output point is not correct by some margin

How to correctly floor the floating point pair sum

Is a whole number float divided by itself guaranteed to be 1.f?

Is there any "standard" way to calculate the numerical gradient?

World coordinates as float to chunk and pixel coordinates

Categories

Resources