calculating constant library functions at compile time - c++

I want to use boltzmann constant in my functions. I am using the following code to declare the boltzmann constant
const double boltzmann_constant = 1.3806503 * pow (10,-23);
Will this get calculated at the compile time itself? If now, how should i ensure that it does get calculated at compile time? Any other method to declare the constant?

The pow() function is very unlikely to be calculated at compile time. However, the operation requested is directly expressible in scientific notation, a standard aspect of floating point numbers:
const double boltzmann_constant = 1.3806503e-23;
For a more complex situation, like sin(M_PI / 3), it can be useful to write a program to calculate and display such values so they can be edited into a program. If you do this, do everyone a favor and include a comment explaining what the constant is:
const double magic_val = 0.8660254037844385965883; // sin(M_PI / 3);

Related

How can I make this expression involving floating-point functions a compile-time constant?

I have a constant integer, steps, which is calculated using the floor function of the quotient of two other constant variables. However, when I attempt to use this as the length of an array, visual studio tells me it must be a constant value and the current value cannot be used as a constant. How do I make this a "true" constant that can be used as an array length? Is the floor function the problem, and is there an alternative I could use?
const int simlength = 3.154*pow(10,7);
const float timestep = 100;
const int steps = floor(simlength / timestep);
struct body bodies[bcount];
struct body {
string name;
double mass;
double position[2];
double velocity[2];
double radius;
double trace[2][steps];
};
It is not possible with the standard library's std::pow and std::floor function, because they are not constexpr-qualified.
You can probably replace std::pow with a hand-written implementation my_pow that is marked constexpr. Since you are just trying to take the power of integers, that shouldn't be too hard. If you are only using powers of 10, floating point literals may be written in the scientific notation as well, e.g. 1e7, which makes the pow call unnecessary.
The floor call is not needed since float/double to int conversion already does flooring implicitly. Or more correctly it truncates, which for positive non-negative values is equivalent to flooring.
Then you should also replace the const with constexpr in the variable declarations to make sure that the variables are usable in constant expressions:
constexpr int simlength = 3.154*my_pow(10,7); // or `3.154e7`
constexpr float timestep = 100;
constexpr int steps = simlength / timestep;
Theoretically only float requires this change, since there is a special exception for const integral types, but it seems more consistent this way.
Also, I have a feeling that there is something wrong with the types of your variables. A length and steps should not be determined by floating-point operations and types, but by integer types and operations alone. Floating-point operations are not exact and introduce errors relative to the mathematical precise calculations on the real numbers. It is easy to get unexpected off-by-one or worse errors this way.
You cannot define an array of a class type before defining the class.
Solution: Define body before defining bodies.
Furthermore, you cannot use undefined names.
Solution: Define bcount before using it as the size of the array.
Is the floor function the problem, and is there an alternative I could use?
std::floor is one problem. There's an easy solution: Don't use it. Converting a floating point number to integer performs similar operation implicitly (the behaviour is different in case of negative numbers).
std::pow is another problem. It cannot be replaced as trivially in general, but in this case we can use a floating point literal in scientific notation instead.
Lastly, non-constexpr floating point variable isn't compile time constant. Solution: Use constexpr.
Here is a working solution:
constexpr int simlength = 3.154e7;
constexpr float timestep = 100;
constexpr int steps = simlength / timestep;
P.S. trace is a very large array. I would recommend against using so large member variables, because it's easy for the user of the class to not notice such detail, and they are likely to create instances of the class in automatic storage. This is a problem because so large objects in automatic storage are prone to cause stack overflow errors. Using std::vector instead of an array is an easy solution. If you do use std::vector, then as a side effect the requirement of compile time constant size disappear and you will no longer have trouble using std::pow etc.
Because simlength is 3.154*10-to-the-7th, and because timestep is 10-squared, then the steps variable's value can be written as:
3.154e7 / 1e2 == 3.154e5
And, adding a type-cast, you should be able to write the array as:
double trace[2][(int)(3.154e5)];
Note that this is HIGHLY IRREGULAR, and should have extensive comments describing why you did this.
Try switching to constexpr:
constexpr int simlength = 3.154e7;
constexpr float timestep = 1e2;
constexpr int steps = simlength / timestep;
struct body {
string name;
double mass;
double position[2];
double velocity[2];
double radius;
double trace[2][steps];
};

Gecode: constraining integer variables using a float value

I use Gecode through its C++ API in a kind of learning context with positive and negative examples.
In this context I have two BoolVarArray: positive_bags_ and negative_bags_.
And what I want to do seems very simple: I want to constrain these bags with a minimal growth rate constraint based on a user parameter gmin.
Thereby, the constraint should look like: sum(positive_bags_) >= gmin * sum(negative_bags_).
It works using the rel function defined like this: rel(*this, sum(positive_bags_) >= gmin * sum(negative_bags_)) but my problem is that in my case gmin is a float but is casted by rel as an integer.
Therefore I can only constrain positive_bags_ to be 2, 3, ... times bigger than negative_bags_ but I need for my experiments to define gmin as 1.5 for example.
I checked the documentation and did not find a definition of linear that use both Boolean/Integer and Float variables.
Is there some way to define this constraint using a float gmin?
Thanks in advance!
If your factor gmincan be expressed as a reasonably small rational n/d (3/2 in your example), then you could use
d * sum(positive_bags_) >= n * sum(negative_bags_)
as your constraint. If there is no small rational that is suitable, then you need to channel your variables to FloatVars and use the FloatVar linear constraint.
If implicit type-casting is an issue you can try:
(float) sum(positive_bags_) >= (gmin * (float) sum(negative_bags_))
Assuming gmin is a float.
Implicit casting will convert your float to an int. If you want to control what type of rounding you want to apply, wrap the result into <math.h>'s roundf or a rounding function of your choice depending on the type.

How to force pow(float, int) to return float

The overloaded function float pow(float base, int iexp ) was removed in C++11 and now pow returns a double. In my program, I am computing lots of these (in single precision) and I am interested in the most efficient way how to do it.
Is there some special function (in standard libraries or any other) with the above signature?
If not, is it better (in terms of performance in single precision) to explicitly cast result of pow into float before any other operations (which would cast everything else into double) or cast iexp into float and use overloaded function float pow(float base, float exp)?
EDIT: Why I need float and do not use double?
The primarily reason is RAM -- I need tens or hundreds of GB so this reduction is huge advantage. So I need from float to get float. And now I need the most efficient way to achieve that (less casts, use already optimize algorithms, etc).
You could easily write your own fpow using exponentiation by squaring.
float my_fpow(float base, unsigned exp)
{
float result = 1.f;
while (exp)
{
if (exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
return result;
}
Boring part:
This algorithm gives the best accuracy, that can be archived with float type when |base| > 1
Proof:
Let we want to calculate pow(a, n) where a is base and n is exponent.
Let's define b1=a1, b2=a2, b3=a4, b4=a8,and so on.
Then an is a product over all such bi where ith bit is set in n.
So we have ordered set B={bk1,bk1,...,bkn} and for any j the bit kj is set in n.
The following obvious algorithm A can be used for rounding error minimization:
If B contains single element, then it is result
Pick two elements p and q from B with minimal modulo
Remove them from B
Calculate product s = p*q and put it to B
Go to the first step
Now, lets prove that elements in B could be just multiplied from left to right without loosing accuracy. It comes form the fact, that:
bj > b1*b2*...*bj-1
because bj=bj-1*bj-1=bj-1*bj-2*bj-2=...=bj-1*bj-2*...*b1*b1
Since, b1 = a1 = a and its modulo more than one then:
bj > b1*b2*...*bj-1
Hence we may conclude, that during multiplication from left to right the accumulator variable is less than any element from B.
Then, expression result *= base; (except the very first iteration, for sure) does multiplication of two minimal numbers from B, so the rounding error is minimal. So, the code employs algorithm A.
Another question that can only be honestly answered with "wrong question". Or at least: "Are you really willing to go there?". float theoretically needs ca. 80% less die space (for the same number of cycles) and so can be much cheaper for bulk processing. GPUs love float for this reason.
However, let's look at x86 (admittedly, you didn't say what architecture you're on, so I picked the most common). The price in die space has already been paid. You literally gain nothing by using float for calculations. Actually, you may even lose throughput because additional extensions from float to double are required, and additional rounding to intermediate float precision. In other words, you pay extra to have a less accurate result. This is typically something to avoid except maybe when you need maximum compatibility with some other program.
See Jens' comment as well. These options give the compiler permission to disregard some language rules to achieve higher performance. Needless to say this can sometimes backfire.
There are two scenarios where float might be more efficient, on x86:
GPU (including GPGPU), in fact many GPUs don't even support double and if they do, it's usually much slower. Yet, you will only notice when doing very many calculations of this sort.
CPU SIMD aka vectorization
You'd know if you did GPGPU. Explicit vectorization by using compiler intrinsics is also a choice – one you could make, for sure, but this requires quite a cost-benefit analysis. Possibly your compiler is able to auto-vectorize some loops, but this is usually limited to "obvious" applications, such as where you multiply each number in a vector<float> by another float, and this case is not so obvious IMO. Even if you pow each number in such a vector by the same int, the compiler may not be smart enough to vectorize this effectively, especially if pow resides in another translation unit, and without effective link time code generation.
If you are not ready to consider changing the whole structure of your program to allow effective use of SIMD (including GPGPU), and you're not on an architecture where float is indeed much cheaper by default, I suggest you stick with double by all means, and consider float at best a storage format that may be useful to conserve RAM, or to improve cache locality (when you have a lot of them). Even then, measuring is an excellent idea.
That said, you could try ivaigult's algorithm (only with double for the intermediate and for the result), which is related to a classical algorithm called Egyptian multiplication (and a variety of other names), only that the operands are multiplied and not added. I don't know how pow(double, double) works exactly, but it is conceivable that this algorithm could be faster in some cases. Again, you should be OCD about benchmarking.
If you're targeting GCC you can try
float __builtin_powif(float, int)
I have no idea about it's performance tough.
Is there some special function (in standard libraries or any other) with the above signature?
Unfortunately, not that I know of.
But, as many have already mentioned benchmarking is necessary to understand if there is even an issue at all.
I've assembled a quick benchmark online. Benchmark code:
#include <iostream>
#include <boost/timer/timer.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_real_distribution.hpp>
#include <cmath>
int main ()
{
boost::random::mt19937 gen;
boost::random::uniform_real_distribution<> dist(0, 10000000);
const size_t size = 10000000;
std::vector<float> bases(size);
std::vector<float> fexp(size);
std::vector<int> iexp(size);
std::vector<float> res(size);
for(size_t i=0; i<size; i++)
{
bases[i] = dist(gen);
iexp[i] = std::floor(dist(gen));
fexp[i] = iexp[i];
}
std::cout << "float pow(float, int):" << std::endl;
{
boost::timer::auto_cpu_timer timer;
for(size_t i=0; i<size; i++)
res[i] = std::pow(bases[i], iexp[i]);
}
std::cout << "float pow(float, float):" << std::endl;
{
boost::timer::auto_cpu_timer timer;
for(size_t i=0; i<size; i++)
res[i] = std::pow(bases[i], fexp[i]);
}
return 0;
}
Benchmark results (quick conclusions):
gcc: c++11 is consistently faster than c++03.
clang: indeed int-version of c++03 seems a little faster. I'm not sure if it is within a margin of error, since I only run the benchmark online.
Both: even with c++11 calling pow with int seems to be a tad more performant.
It would be great if others could verify if this holds for their configurations as well.
Try using powf() instead. This is C99 function that should be also available in C++11.

How to floor a number using the NTL library (C++)

I am building a C++ program to verify a mathematical conjecture for up to 100 billion iterations. In order to test such high numbers, I cannot use a C++ int, so I am using the NTL library, using the type ZZ as my number type.
My algorithm looks like this:
ZZ generateNthSeq(ZZ n)
{
return floor(n*sqrt(2));
}
I have the two libraries being imported:
#include <cmath>
#include <NTL/ZZ.h>
But obviously this cannot compile because I get the error:
$ g++ deepness*.cpp
deepness.cpp: In function ‘NTL::ZZ generateNthSeq(NTL::ZZ)’:
deepness.cpp:41: error: no matching function for call to ‘floor(NTL::ZZ)’
/usr/include/bits/mathcalls.h:185: note: candidates are: double floor(double)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/cmath:262: note: long double std::floor(long double)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/cmath:258: note: float std::floor(float)
Stating that the floor mathematical operation cannot accept a ZZ class type. But I need the numbers to be pretty big. How can I accomplish what I want to do, which is to floor the function, while using the NTL library?
Note that it doesn't really make sense to apply floor to an integral type (well, it does, it's just a no-op). What you should be really worried about is the fact that your code is apparently passing something of type ZZ into floor!
That is, what can n * sqrt(2) possibly mean here?
Also, before even writing that, I'd've checked the documentation to see if integer * floating point actually exists in the library -- usually for that to be useful at all, you need arbitrary precision floating types available.
Checking through the headers, there is only one multiplication operator:
ZZ operator*(const ZZ& a, const ZZ& b);
and there is a conversion constructor:
explicit ZZ(long a); // promotion constructor
I can't figure out how your code is even compiling. Maybe you're using a different version of the library than I'm looking at, and the conversion constructor is implicit, and your double is getting "promoted" to a ZZ. This is surely not what you want, since promoting sqrt(2) to a ZZ is simply going to give you the integer 1.
You either need to:
look into whether or not NTL has arbitrary precision floating point capabilities
switch to a library that does have arbitrary precision floating point capabilities
convert your calculation to pure integer arithmetic
That last one is fairly easy here: you want
return SqrRoot(sqr(n) * 2); // sqr(n) will be a bit more efficient than `n * n`

Unexpected loss of precision when dividing doubles

I have a function getSlope which takes as parameters 4 doubles and returns another double calculated using this given parameters in the following way:
double QSweep::getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
The problem is that when calling this function with arguments for example:
getSlope(2.71156, -1.64161, 2.70413, -1.72219);
the returned result is:
10.8557
and this is not a good result for my computations.
I have calculated the slope using Mathematica and the result for the slope for the same parameters is:
10.8452
or with more digits for precision:
10.845222072678331.
The result returned by my program is not good in my further computations.
Moreover, I do not understant how does the program returns 10.8557 starting from 10.845222072678331 (supposing that this is the approximate result for the division)?
How can I get the good result for my division?
thank you in advance,
madalina
I print the result using the command line:
std::cout<<slope<<endl;
It may be that my parameters are maybe not good, as I read them from another program (which computes a graph; after I read this parameters fromt his graph I have just displayed them to see their value but maybe the displayed vectors have not the same internal precision for the calculated value..I do not know it is really strange. Some numerical errors appears..)
When the graph from which I am reading my parameters is computed, some numerical libraries written in C++ (with templates) are used. No OpenGL is used for this computation.
thank you,
madalina
I've tried with float instead of double and I get 10.845110 as a result. It still looks better than madalina result.
EDIT:
I think I know why you get this results. If you get a, b, c and d parameters from somewhere else and you print it, it gives you rounded values. Then if you put it to Mathemtacia (or calc ;) ) it will give you different result.
I tried changing a little bit one of your parameters. When I did:
double c = 2.7041304;
I get 10.845806. I only add 0.0000004 to c!
So I think your "errors" aren't errors. Print a, b, c and d with better precision and then put them to Mathematica.
The following code:
#include <iostream>
using namespace std;
double getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
int main( ) {
double s = getSlope(2.71156, -1.64161, 2.70413, -1.72219);
cout << s << endl;
}
gives a result of 10.8452 with g++. How are you printing out the result in your code?
Could it be that you use DirectX or OpenGL in your project? If so they can turn off double precision and you will get strange results.
You can check your precision settings with
std::sqrt(x) * std::sqrt(x)
The result has to be pretty close to x.
I met this problem long time ago and spend a month checking all the formulas. But then I've found
D3DCREATE_FPU_PRESERVE
The problem here is that (c-a) is small, so the rounding errors inherent in floating point operations is magnified in this example. A general solution is to rework your equation so that you're not dividing by a small number, I'm not sure how you would do it here though.
EDIT:
Neil is right in his comment to this question, I computed the answer in VB using Doubles and got the same answer as mathematica.
The results you are getting are consistent with 32bit arithmetic. Without knowing more about your environment, it's not possible to advise what to do.
Assuming the code shown is what's running, ie you're not converting anything to strings or floats, then there isn't a fix within C++. It's outside of the code you've shown, and depends on the environment.
As Patrick McDonald and Treb brought both up the accuracy of your inputs and the error on a-c, I thought I'd take a look at that. One technique to look at rounding errors is interval arithmetic, which makes the upper and lower bounds which value represents explicit (they are implicit in floating point numbers, and are fixed to the precision of the representation). By treating each value as an upper and lower bound, and by extending the bounds by the error in the representation ( approx x * 2 ^ -53 for a double value x ), you get a result which gives the lower and upper bounds on the accuracy of a value, taking into account worst case precision errors.
For example, if you have a value in the range [1.0, 2.0] and subtract from it a value in the range [0.0, 1.0], then the result must lie in the range [below(0.0),above(2.0)] as the minimum result is 1.0-1.0 and the maximum is 2.0-0.0. below and above are equivalent to floor and ceiling, but for the next representable value rather than for integers.
Using intervals which represent worst-case double rounding:
getSlope(
a = [2.7115599999999995262:2.7115600000000004144],
b = [-1.6416099999999997916:-1.6416100000000002357],
c = [2.7041299999999997006:2.7041300000000005888],
d = [-1.7221899999999998876:-1.7221900000000003317])
(d-b) = [-0.080580000000000526206:-0.080579999999999665783]
(c-a) = [-0.0074300000000007129439:-0.0074299999999989383218]
to double precision [10.845222072677243474:10.845222072679954195]
So although c-a is small compared to c or a, it is still large compared to double rounding, so if you were using the worst imaginable double precision rounding, then you could trust that value's to be precise to 12 figures - 10.8452220727. You've lost a few figures off double precision, but you're still working to more than your input's significance.
But if the inputs were only accurate to the number significant figures, then rather than being the double value 2.71156 +/- eps, then the input range would be [2.711555,2.711565], so you get the result:
getSlope(
a = [2.711555:2.711565],
b = [-1.641615:-1.641605],
c = [2.704125:2.704135],
d = [-1.722195:-1.722185])
(d-b) = [-0.08059:-0.08057]
(c-a) = [-0.00744:-0.00742]
to specified accuracy [10.82930108:10.86118598]
which is a much wider range.
But you would have to go out of your way to track the accuracy in the calculations, and the rounding errors inherent in floating point are not significant in this example - it's precise to 12 figures with the worst case double precision rounding.
On the other hand, if your inputs are only known to 6 figures, it doesn't actually matter whether you get 10.8557 or 10.8452. Both are within [10.82930108:10.86118598].
Better Print out the arguments, too. When you are, as I guess, transferring parameters in decimal notation, you will lose precision for each and every one of them. The problem being that 1/5 is an infinite series in binary, so e.g. 0.2 becomes .001001001.... Also, decimals are chopped when converting an binary float to a textual representation in decimal.
Next to that, sometimes the compiler chooses speed over precision. This should be a documented compiler switch.
Patrick seems to be right about (c-a) being the main cause:
d-b = -1,72219 - (-1,64161) = -0,08058
c-a = 2,70413 - 2,71156 = -0,00743
S = (d-b)/(c-a)= -0,08058 / -0,00743 = 10,845222
You start out with six digits precision, through the subtraction you get a reduction to 3 and four digits. My best guess is that you loose additonal precision because the number -0,00743 can not be represented exaclty in a double. Try using intermediate variables with a bigger precision, like this:
double QSweep::getSlope(double a, double b, double c, double d)
{
double slope;
long double temp1, temp2;
temp1 = (d-b);
temp2 = (c-a);
slope = temp1/temp2;
return slope;
}
While the academic discussion going on is great for learning about the limitations of programming languages, you may find the simplest solution to the problem is an data structure for arbitrary precision arithmetic.
This will have some overhead, but you should be able to find something with fairly guaranteeable accuracy.