C++ and Octave results differ for simple arithmetic - c++

I have a C++ code and an Octave that both compute the same equation
In C++
#include <math.h>
int main()
{
float x = 1.5f;
float y = pow(x, 6) * 235809835.41f - pow(x, 5) * 2110439254.2f + pow(x, 4) *7869448124.8f - pow(x, 3) * 15648965509.0f + pow(x, 2) * 17503313074.0f - (x)* 10440563329.0f + 2594694745.0f; // result y = 3584
return 0;
}
In Octave
x = 1.5
y = (x ^ 6) * 235809835.41 - (x ^ 5) * 2110439254.2 + (x ^ 4) *7869448124.8 - (x ^ 3) * 15648965509 + (x ^ 2) * 17503313074 - (x)* 10440563329 + 2594694745‏ // result y = 26
The computed value of y differs in the two cases. C++ computes y to be 3584 and Octave computes y to be 26. What could be the cause for this divergence?
EDIT : Excel produces the same result as Octave, and the result is logical too within the context of the equation. So, something is wrong with the C++ code or compiler.

This appears to be due to the limited precision of the float type, which is likely causing one of the operations to be effectively discarded because one operand is of a smaller enough magnitude than the other to cause a significant change to the result. (See this extremely contrived example that shows what this might look like.)
If you rewrite the code to use the double type, which is more precise, then the result is 26.810783, which matches the result I get from evaluating the formula in Maxima.
Further reading: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Related

How can you calculate a factor if you have the other factor and the product with overflows?

a * x = b
I have a seemingly rather complicated multiplication / imul problem: if I have a and I have b, how can I calculate x if they're all 32-bit dwords (e.g. 0-1 = FFFFFFFF, FFFFFFFF+1 = 0)?
For example:
0xcb9102df * x = 0x4d243a5d
In that case, x is 0x1908c643. I found a similar question but the premises were different and I'm hoping there's a simpler solution than those given.
Numbers have a modular multiplicative inverse modulo a power of two precisely iff they are odd. Everything else is a bit-shifted odd number (even zero, which might be anything, with all bits shifted out). So there are a couple of cases:
Given a * x = b
tzcnt(a) > tzcnt(b) no solution
tzcnt(a) <= tzcnt(b) solvable, with 2tzcnt(a) solutions
The second case has a special case with 1 solution, for odd a, namely x = inverse(a) * b
More generally, x = inverse(a >> tzcnt(a)) * (b >> tzcnt(a)) is a solution, because you write a as (a >> tzcnt(a)) * (1 << tzcnt(a)), so we cancel the left factor with its inverse, we leave the right factor as part of the result (cannot be cancelled anyway) and then multiply by what remains to get it up to b. Still only works in the second case, obviously. If you wanted, you could enumerate all solutions by filling in all possibilities for the top tzcnt(a) bits.
The only thing that remains is getting the inverse, you've probably seen it in the other answer, whatever it was, but for completeness you can compute it as follows: (not tested)
; input x
dword y = (x * x) + x - 1;
dword t = y * x;
y *= 2 - t;
t = y * x;
y *= 2 - t;
t = y * x;
y *= 2 - t;
; result y

Does gcc optimize c++ code algebraically and if so to what extent?

Consider the following piece of code showing some simple arithmetic operations
int result = 0;
result = c * (a + b) + d * (a + b) + e;
To get the result in the expression above the cpu would need to execute two integer multiplications and three integer additions. However algebraically the above expression could be simplified to the code below.
result = (c + d) * (a + b) + e
The two expressions are algebraically identical however the second expression only contains one multiplication and three additions. Is gcc (or other compilers for that matter) able to make this simple optimization on their own.
Now assuming that the compiler is intelligent enough to make this simple optimization, would it be able to optimize something more complex such as the Trapezoidal rule (used for numerical integration). Example below approximates the area under sin(x) where 0 <= x <= pi with a step size of pi/4 (small for the sake of simplicity). Please assume all literals are runtime variables.
#include <math.h>
// Please assume all literals are runtime variables. Have done it this way to
// simplify the code.
double integral = 0.5 * ((sin(0) + sin(M_PI/4) * (M_PI/4 - 0) + (sin(M_PI/4) +
sin(M_PI/2)) * (M_PI/2 - M_PI/4) + (sin(M_PI/2) + sin(3 * M_PI/4)) *
(3 * M_PI/4 - M_PI/2) + (sin(3 * M_PI/4) + sin(M_PI)) * (M_PI - 3 * M_PI/4));
Now the above function could be written like this once simplified using the trapezoidal rule. This drastically reduces the number of multiplications/divisions needed to get the same answer.
integral = 0.5 * (1 / no_steps /* 4 in th case above*/) *
(M_PI - 0 /* Upper and lower limit*/) * (sin(0) + 2 * (sin(M_PI/4) +
sin(3 * M_PI/4)) + sin(M_PI));
GCC (and most C++ compilers, for that matter) does not refactor algebraic expressions.
This is mainly because as far as GCC and general software arithmetic is concerned, the lines
double x = 0.5 * (4.6 + 6.7);
double y = 0.5 * 4.6 + 0.5 * 6.7;
assert(x == y); //Will probably fail!
Are not guaranteed to be evaluate to the exact same number. GCC can't optimize these structures without that kind of guarantee.
Furthermore, order of operations can matter a lot. For example:
int x = y;
int z = (y / 16) * 16;
assert(x == z); //Will only be true if y is a whole multiple of 16
Algebraically, those two lines should be equivalent, right? But if y is an int, what it will actually do is make x equal to "y rounded to the lower whole multiple of 16". Sometimes, that's intended behavior (Like if you're byte aligning). Other times, it's a bug. The important thing is, both are valid computer code and both can be useful depending on the circumstances, and if GCC optimized around those structures, it would prevent programmers from having agency over their code.
Yes, optimizers, gcc's included, do optimizations of this type. Not necessarily the expression that you quoted exactly, or other arbitrarily complex expressions. But a simpler expresion, (a + a) - a is likely to be optimized to a for example. Another example of possible optimization is a*a*a*a to temp = a*a; (temp)*(temp)
Whether a given compiler optimizes the expressions that you quote, can be observed by reading the output assembly code.
No, this type of optimization is not used with floating points by default (unless maybe if the optimizer can prove that no accuracy is lost). See Are floating point operations in C associative? You can let for example gcc do this with -fassociative-math option. At your own peril.

Numerical precision for difference of squares

in my code I often compute things like the following piece (here C code for simplicity):
float cos_theta = /* some simple operations; no cosf call! */;
float sin_theta = sqrtf(1.0f - cos_theta * cos_theta); // Option 1
For this example ignore that the argument of the square root might be negative due to imprecisions. I fixed that with additional fdimf call. However, I wondered if the following is more precise:
float sin_theta = sqrtf((1.0f + cos_theta) * (1.0f - cos_theta)); // Option 2
cos_theta is between -1 and +1 so for each choice there will be situations where I subtract similar numbers and thus will loose precision, right? What is the most precise and why?
The most precise way with floats is likely to compute both sin and cos using a single x87 instruction, fsincos.
However, if you need to do the computation manually, it's best to group arguments with similar magnitudes. This means the second option is more precise, especially when cos_theta is close to 0, where precision matters the most.
As the article
What Every Computer Scientist Should Know About Floating-Point Arithmetic notes:
The expression x2 - y2 is another formula that exhibits catastrophic
cancellation. It is more accurate to evaluate it as (x - y)(x + y).
Edit: it's more complicated than this. Although the above is generally true, (x - y)(x + y) is slightly less accurate when x and y are of very different magnitudes, as the footnote to the statement explains:
In this case, (x - y)(x + y) has three rounding errors, but x2 - y2 has only two since the rounding error committed when computing the smaller of x2 and y2 does not affect the final subtraction.
In other words, taking x - y, x + y, and the product (x - y)(x + y) each introduce rounding errors (3 steps of rounding error). x2, y2, and the subtraction x2 - y2 also each introduce rounding errors, but the rounding error obtained by squaring a relatively small number (the smaller of x and y) is so negligible that there are effectively only two steps of rounding error, making the difference of squares more precise.
So option 1 is actually going to be more precise. This is confirmed by dev.brutus's Java test.
I wrote small test. It calcutates expected value with double precision. Then it calculates an error with your options. The first option is better:
Algorithm: FloatTest$1
option 1 error = 3.802792362162126
option 2 error = 4.333273185303996
Algorithm: FloatTest$2
option 1 error = 3.802792362167937
option 2 error = 4.333273185305868
The Java code:
import org.junit.Test;
public class FloatTest {
#Test
public void test() {
testImpl(new ExpectedAlgorithm() {
public double te(double cos_theta) {
return Math.sqrt(1.0f - cos_theta * cos_theta);
}
});
testImpl(new ExpectedAlgorithm() {
public double te(double cos_theta) {
return Math.sqrt((1.0f + cos_theta) * (1.0f - cos_theta));
}
});
}
public void testImpl(ExpectedAlgorithm ea) {
double delta1 = 0;
double delta2 = 0;
for (double cos_theta = -1; cos_theta <= 1; cos_theta += 1e-8) {
double[] delta = delta(cos_theta, ea);
delta1 += delta[0];
delta2 += delta[1];
}
System.out.println("Algorithm: " + ea.getClass().getName());
System.out.println("option 1 error = " + delta1);
System.out.println("option 2 error = " + delta2);
}
private double[] delta(double cos_theta, ExpectedAlgorithm ea) {
double expected = ea.te(cos_theta);
double delta1 = Math.abs(expected - t1((float) cos_theta));
double delta2 = Math.abs(expected - t2((float) cos_theta));
return new double[]{delta1, delta2};
}
private double t1(float cos_theta) {
return Math.sqrt(1.0f - cos_theta * cos_theta);
}
private double t2(float cos_theta) {
return Math.sqrt((1.0f + cos_theta) * (1.0f - cos_theta));
}
interface ExpectedAlgorithm {
double te(double cos_theta);
}
}
The correct way to reason about numerical precision of some expression is to:
Measure the result discrepancy relative to the correct value in ULPs (Unit in the last place), introduced in 1960. by W. H. Kahan. You can find C, Python & Mathematica implementations here, and learn more on the topic here.
Discriminate between two or more expressions based on the worst case they produce, not average absolute error as done in other answers or by some other arbitrary metric. This is how numerical approximation polynomials are constructed (Remez algorithm), how standard library methods' implementations are analysed (e.g. Intel atan2), etc...
With that in mind, version_1: sqrt(1 - x * x) and version_2: sqrt((1 - x) * (1 + x)) produce significantly different outcomes. As presented in the plot below, version_1 demonstrates catastrophic performance for x close to 1 with error > 1_000_000 ulps, while on the other hand error of version_2 is well behaved.
That is why I always recommend using version_2, i.e. exploiting the square difference formula.
Python 3.6 code that produces square_diff_error.csv file:
from fractions import Fraction
from math import exp, fabs, sqrt
from random import random
from struct import pack, unpack
def ulp(x):
"""
Computing ULP of input double precision number x exploiting
lexicographic ordering property of positive IEEE-754 numbers.
The implementation correctly handles the special cases:
- ulp(NaN) = NaN
- ulp(-Inf) = Inf
- ulp(Inf) = Inf
Author: Hrvoje Abraham
Date: 11.12.2015
Revisions: 15.08.2017
26.11.2017
MIT License https://opensource.org/licenses/MIT
:param x: (float) float ULP will be calculated for
:returns: (float) the input float number ULP value
"""
# setting sign bit to 0, e.g. -0.0 becomes 0.0
t = abs(x)
# converting IEEE-754 64-bit format bit content to unsigned integer
ll = unpack('Q', pack('d', t))[0]
# computing first smaller integer, bigger in a case of ll=0 (t=0.0)
near_ll = abs(ll - 1)
# converting back to float, its value will be float nearest to t
near_t = unpack('d', pack('Q', near_ll))[0]
# abs takes care of case t=0.0
return abs(t - near_t)
with open('e:/square_diff_error.csv', 'w') as f:
for _ in range(100_000):
# nonlinear distribution of x in [0, 1] to produce more cases close to 1
k = 10
x = (exp(k) - exp(k * random())) / (exp(k) - 1)
fx = Fraction(x)
correct = sqrt(float(Fraction(1) - fx * fx))
version1 = sqrt(1.0 - x * x)
version2 = sqrt((1.0 - x) * (1.0 + x))
err1 = fabs(version1 - correct) / ulp(correct)
err2 = fabs(version2 - correct) / ulp(correct)
f.write(f'{x},{err1},{err2}\n')
Mathematica code that produces the final plot:
data = Import["e:/square_diff_error.csv"];
err1 = {1 - #[[1]], #[[2]]} & /# data;
err2 = {1 - #[[1]], #[[3]]} & /# data;
ListLogLogPlot[{err1, err2}, PlotRange -> All, Axes -> False, Frame -> True,
FrameLabel -> {"1-x", "error [ULPs]"}, LabelStyle -> {FontSize -> 20}]
As an aside, you will always have a problem when theta is small, because the cosine is flat around theta = 0. If theta is between -0.0001 and 0.0001 then cos(theta) in float is exactly one, so your sin_theta will be exactly zero.
To answer your question, when cos_theta is close to one (corresponding to a small theta), your second computation is clearly more accurate. This is shown by the following program, that lists the absolute and relative errors for both computations for various values of cos_theta. The errors are computed by comparing against a value which is computed with 200 bits of precision, using GNU MP library, and then converted to a float.
#include <math.h>
#include <stdio.h>
#include <gmp.h>
int main()
{
int i;
printf("cos_theta abs (1) rel (1) abs (2) rel (2)\n\n");
for (i = -14; i < 0; ++i) {
float x = 1 - pow(10, i/2.0);
float approx1 = sqrt(1 - x * x);
float approx2 = sqrt((1 - x) * (1 + x));
/* Use GNU MultiPrecision Library to get 'exact' answer */
mpf_t tmp1, tmp2;
mpf_init2(tmp1, 200); /* use 200 bits precision */
mpf_init2(tmp2, 200);
mpf_set_d(tmp1, x);
mpf_mul(tmp2, tmp1, tmp1); /* tmp2 = x * x */
mpf_neg(tmp1, tmp2); /* tmp1 = -x * x */
mpf_add_ui(tmp2, tmp1, 1); /* tmp2 = 1 - x * x */
mpf_sqrt(tmp1, tmp2); /* tmp1 = sqrt(1 - x * x) */
float exact = mpf_get_d(tmp1);
printf("%.8f %.3e %.3e %.3e %.3e\n", x,
fabs(approx1 - exact), fabs((approx1 - exact) / exact),
fabs(approx2 - exact), fabs((approx2 - exact) / exact));
/* printf("%.10f %.8f %.8f %.8f\n", x, exact, approx1, approx2); */
}
return 0;
}
Output:
cos_theta abs (1) rel (1) abs (2) rel (2)
0.99999988 2.910e-11 5.960e-08 0.000e+00 0.000e+00
0.99999970 5.821e-11 7.539e-08 0.000e+00 0.000e+00
0.99999899 3.492e-10 2.453e-07 1.164e-10 8.178e-08
0.99999684 2.095e-09 8.337e-07 0.000e+00 0.000e+00
0.99998999 1.118e-08 2.497e-06 0.000e+00 0.000e+00
0.99996835 6.240e-08 7.843e-06 9.313e-10 1.171e-07
0.99989998 3.530e-07 2.496e-05 0.000e+00 0.000e+00
0.99968380 3.818e-07 1.519e-05 0.000e+00 0.000e+00
0.99900001 1.490e-07 3.333e-06 0.000e+00 0.000e+00
0.99683774 8.941e-08 1.125e-06 7.451e-09 9.376e-08
0.99000001 5.960e-08 4.225e-07 0.000e+00 0.000e+00
0.96837723 1.490e-08 5.973e-08 0.000e+00 0.000e+00
0.89999998 2.980e-08 6.837e-08 0.000e+00 0.000e+00
0.68377221 5.960e-08 8.168e-08 5.960e-08 8.168e-08
When cos_theta is not close to one, then the accuracy of both methods is very close to each other and to round-off error.
[Edited for major think-o] It looks to me like option 2 will be better, because for a number like 0.000001 for example option 1 will return the sine as 1 while option will return a number just smaller than 1.
No difference in my option since (1-x) preserves the precision not effecting the carried bit. Then for (1+x) the same is true. Then the only thing effecting the carry bit precision is the multiplication. So in both cases there is one single multiplication, so they are both as likely to give the same carry bit error.

How i can make matlab precision to be the same as in c++?

I have problem with precision. I have to make my c++ code to have same precision as matlab. In matlab i have script which do some stuff with numbers etc. I got code in c++ which do the same as that script. Output on the same input is diffrent :( I found that in my script when i try 104 >= 104 it returns false. I tried to use format long but it did not help me to find out why its false. Both numbers are type of double. i thought that maybe matlab stores somewhere the real value of 104 and its for real like 103.9999... So i leveled up my precision in c++. It also didnt help because when matlab returns me value of 50.000 in c++ i got value of 50.050 with high precision. Those 2 values are from few calculations like + or *. Is there any way to make my c++ and matlab scrips have same precision?
for i = 1:neighbors
y = spoints(i,1)+origy;
x = spoints(i,2)+origx;
% Calculate floors, ceils and rounds for the x and y.
fy = floor(y); cy = ceil(y); ry = round(y);
fx = floor(x); cx = ceil(x); rx = round(x);
% Check if interpolation is needed.
if (abs(x - rx) < 1e-6) && (abs(y - ry) < 1e-6)
% Interpolation is not needed, use original datatypes
N = image(ry:ry+dy,rx:rx+dx);
D = N >= C;
else
% Interpolation needed, use double type images
ty = y - fy;
tx = x - fx;
% Calculate the interpolation weights.
w1 = (1 - tx) * (1 - ty);
w2 = tx * (1 - ty);
w3 = (1 - tx) * ty ;
w4 = tx * ty ;
%Compute interpolated pixel values
N = w1*d_image(fy:fy+dy,fx:fx+dx) + w2*d_image(fy:fy+dy,cx:cx+dx) + ...
w3*d_image(cy:cy+dy,fx:fx+dx) + w4*d_image(cy:cy+dy,cx:cx+dx);
D = N >= d_C;
end
I got problems in else which is in line 12. tx and ty eqauls 0.707106781186547 or 1 - 0.707106781186547. Values from d_image are in range 0 and 255. N is value 0..255 of interpolating 4 pixels from image. d_C is value 0.255. Still dunno why matlab shows that when i have in N vlaues like: x x x 140.0000 140.0000 and in d_C: x x x 140 x. D gives me 0 on 4th position so 140.0000 != 140. I Debugged it trying more precision but it still says that its 140.00000000000000 and it is still not 140.
int Codes::Interpolation( Point_<int> point, Point_<int> center , Mat *mat)
{
int x = center.x-point.x;
int y = center.y-point.y;
Point_<double> my;
if(x<0)
{
if(y<0)
{
my.x=center.x+LEN;
my.y=center.y+LEN;
}
else
{
my.x=center.x+LEN;
my.y=center.y-LEN;
}
}
else
{
if(y<0)
{
my.x=center.x-LEN;
my.y=center.y+LEN;
}
else
{
my.x=center.x-LEN;
my.y=center.y-LEN;
}
}
int a=my.x;
int b=my.y;
double tx = my.x - a;
double ty = my.y - b;
double wage[4];
wage[0] = (1 - tx) * (1 - ty);
wage[1] = tx * (1 - ty);
wage[2] = (1 - tx) * ty ;
wage[3] = tx * ty ;
int values[4];
//wpisanie do tablicy 4 pixeli ktore wchodza do interpolacji
for(int i=0;i<4;i++)
{
int val = mat->at<uchar>(Point_<int>(a+help[i].x,a+help[i].y));
values[i]=val;
}
double moze = (wage[0]) * (values[0]) + (wage[1]) * (values[1]) + (wage[2]) * (values[2]) + (wage[3]) * (values[3]);
return moze;
}
LEN = 0.707106781186547 Values in array values are 100% same as matlab values.
Matlab uses double precision. You can use C++'s double type. That should make most things similar, but not 100%.
As someone else noted, this is probably not the source of your problem. Either there is a difference in the algorithms, or it might be something like a library function defined differently in Matlab and in C++. For example, Matlab's std() divides by (n-1) and your code may divide by n.
First, as a rule of thumb, it is never a good idea to compare floating point variables directly. Instead of, for example instead of if (nr >= 104) you should use if (nr >= 104-e), where e is a small number, like 0.00001.
However, there must be some serious undersampling or rounding error somewhere in your script, because getting 50050 instead of 50000 is not in the limit of common floating point imprecision. For example, Matlab can have a step of as small as 15 digits!
I guess there are some casting problems in your code, for example
int i;
double d;
// ...
d = i/3 * d;
will will give a very inaccurate result, because you have an integer division. d = (double)i/3 * d or d = i/3. * d would give a much more accurate result.
The above example would NOT cause any problems in Matlab, because there everything is already a floating-point number by default, so a similar problem might be behind the differences in the results of the c++ and Matlab code.
Seeing your calculations would help a lot in finding what went wrong.
EDIT:
In c and c++, if you compare a double with an integer of the same value, you have a very high chance that they will not be equal. It's the same with two doubles, but you might get lucky if you perform the exact same computations on them. Even in Matlab it's dangerous, and maybe you were just lucky that as both are doubles, both got truncated the same way.
By you recent edit it seems, that the problem is where you evaluate your array. You should never use == or != when comparing floats or doubles in c++ (or in any languages when you use floating-point variables). The proper way to do a comparison is to check whether they are within a small distance of each other.
An example: using == or != to compare two doubles is like comparing the weight of two objects by counting the number of atoms in them, and deciding that they are not equal even if there is one single atom difference between them.
MATLAB uses double precision unless you say otherwise. Any differences you see with an identical implementation in C++ will be due to floating-point errors.

Fast ceiling of an integer division in C / C++

Given integer values x and y, C and C++ both return as the quotient q = x/y the floor of the floating point equivalent. I'm interested in a method of returning the ceiling instead. For example, ceil(10/5)=2 and ceil(11/5)=3.
The obvious approach involves something like:
q = x / y;
if (q * y < x) ++q;
This requires an extra comparison and multiplication; and other methods I've seen (used in fact) involve casting as a float or double. Is there a more direct method that avoids the additional multiplication (or a second division) and branch, and that also avoids casting as a floating point number?
For positive numbers where you want to find the ceiling (q) of x when divided by y.
unsigned int x, y, q;
To round up ...
q = (x + y - 1) / y;
or (avoiding overflow in x+y)
q = 1 + ((x - 1) / y); // if x != 0
For positive numbers:
q = x/y + (x % y != 0);
Sparky's answer is one standard way to solve this problem, but as I also wrote in my comment, you run the risk of overflows. This can be solved by using a wider type, but what if you want to divide long longs?
Nathan Ernst's answer provides one solution, but it involves a function call, a variable declaration and a conditional, which makes it no shorter than the OPs code and probably even slower, because it is harder to optimize.
My solution is this:
q = (x % y) ? x / y + 1 : x / y;
It will be slightly faster than the OPs code, because the modulo and the division is performed using the same instruction on the processor, because the compiler can see that they are equivalent. At least gcc 4.4.1 performs this optimization with -O2 flag on x86.
In theory the compiler might inline the function call in Nathan Ernst's code and emit the same thing, but gcc didn't do that when I tested it. This might be because it would tie the compiled code to a single version of the standard library.
As a final note, none of this matters on a modern machine, except if you are in an extremely tight loop and all your data is in registers or the L1-cache. Otherwise all of these solutions will be equally fast, except for possibly Nathan Ernst's, which might be significantly slower if the function has to be fetched from main memory.
You could use the div function in cstdlib to get the quotient & remainder in a single call and then handle the ceiling separately, like in the below
#include <cstdlib>
#include <iostream>
int div_ceil(int numerator, int denominator)
{
std::div_t res = std::div(numerator, denominator);
return res.rem ? (res.quot + 1) : res.quot;
}
int main(int, const char**)
{
std::cout << "10 / 5 = " << div_ceil(10, 5) << std::endl;
std::cout << "11 / 5 = " << div_ceil(11, 5) << std::endl;
return 0;
}
There's a solution for both positive and negative x but only for positive y with just 1 division and without branches:
int div_ceil(int x, int y) {
return x / y + (x % y > 0);
}
Note, if x is positive then division is towards zero, and we should add 1 if reminder is not zero.
If x is negative then division is towards zero, that's what we need, and we will not add anything because x % y is not positive
How about this? (requires y non-negative, so don't use this in the rare case where y is a variable with no non-negativity guarantee)
q = (x > 0)? 1 + (x - 1)/y: (x / y);
I reduced y/y to one, eliminating the term x + y - 1 and with it any chance of overflow.
I avoid x - 1 wrapping around when x is an unsigned type and contains zero.
For signed x, negative and zero still combine into a single case.
Probably not a huge benefit on a modern general-purpose CPU, but this would be far faster in an embedded system than any of the other correct answers.
I would have rather commented but I don't have a high enough rep.
As far as I am aware, for positive arguments and a divisor which is a power of 2, this is the fastest way (tested in CUDA):
//example y=8
q = (x >> 3) + !!(x & 7);
For generic positive arguments only, I tend to do it like so:
q = x/y + !!(x % y);
This works for positive or negative numbers:
q = x / y + ((x % y != 0) ? !((x > 0) ^ (y > 0)) : 0);
If there is a remainder, checks to see if x and y are of the same sign and adds 1 accordingly.
simplified generic form,
int div_up(int n, int d) {
return n / d + (((n < 0) ^ (d > 0)) && (n % d));
} //i.e. +1 iff (not exact int && positive result)
For a more generic answer, C++ functions for integer division with well defined rounding strategy
For signed or unsigned integers.
q = x / y + !(((x < 0) != (y < 0)) || !(x % y));
For signed dividends and unsigned divisors.
q = x / y + !((x < 0) || !(x % y));
For unsigned dividends and signed divisors.
q = x / y + !((y < 0) || !(x % y));
For unsigned integers.
q = x / y + !!(x % y);
Zero divisor fails (as with a native operation). Cannot cause overflow.
Corresponding floored and modulo constexpr implementations here, along with templates to select the necessary overloads (as full optimization and to prevent mismatched sign comparison warnings):
https://github.com/libbitcoin/libbitcoin-system/wiki/Integer-Division-Unraveled
Compile with O3, The compiler performs optimization well.
q = x / y;
if (x % y) ++q;