I have a cross-platform application, which is an audio application and therefore uses sine waves a lot, and the std::sin() and other goniometric functions.
I noticed that particularly on the iOS platform, the precision of the std::sin() is extremely poor. I wrote the following test:
void TestSineZeroCrossings()
{
const static float kTwoPi = 6.28318530718f;
const static float epsilon = 1e-5f;
for (int ii = 0; ii < 10000; ++ii)
{
const float difference = std::abs(std::sin(kTwoPi * static_cast<float>(ii)));
if (difference > epsilon)
printf("Zero crossing fail, difference: %f\n", difference);
}
}
On Windows and MaxOSX this passes (i.e. no print-outs), but on iOS this fails on pretty much every iteration. In fact, only with an epsilon > 0.004f does it succeed. That results in clearly audible noise in my application.
Is there a way to tell the compiler to use a better implementation that's not as lossy?
I would assume the implementation is quite accurate.
Your actual problem is that kTwoPi * static_cast<float>(ii) gets rounded to the next float. E.g., for ii=10000 the value is (if I did not miscalculate): 62831.8515625
If you subtract 10000*2*pi in exact math from that you get approximately: -0.001509... And the sine of that value is approximately the same (and not 0). It is "relatively" close to zero but far away from your desired 10e-6 "accuracy".
If you want to have more accurate values for sin(x*pi), have a look at boost::math::sin_pi:
https://www.boost.org/doc/libs/1_69_0/libs/math/doc/html/math_toolkit/powers/sin_pi.html
If you want more precision, use double or long double rather than float.
For instance,
replace
const static float kTwoPi = 6.28318530718f;
const static float epsilon = 10e-6f;
with
const static double kTwoPi = 6.28318530718;
const static double epsilon = 10e-6;
and
const float difference = std::abs(std::sin(kTwoPi * static_cast<float>(ii)));
with
const double difference = std::abs(std::sin(kTwoPi * ii));
At the risk of repetition, your problem is obviously the use of float rather than double or long double.
You could verify this by doing
cout << kTwoPi << endl ;
and seeing how many digits get printed out and how they compare to your original value.
const static float kTwoPi = 6.28318530718f;
is roughly equivalent to
const static float kTwoPi = 6.283185 ;
on many (most?) systems.Your delta is way too small for a single precision value. Float is useless for most applications because of its usual lack of precision.
Related
As a premise, I am aware that this problem has been addressed already, but never in this specific scenario, from what I could find searching.
In a time-critical piece of code, I have a loop where a float value x must grow linearly from exactly 0 to-and-including exactly 1 in 'z' steps.
The un-optimized solution, but which would work without rounding errors, is:
const int z = (some number);
int c;
float x;
for(c=0; c<z; c++)
{
x = (float)c/(float)(z-1);
// do something with x here
}
obviously I can avoid the float conversions and use two loop variables and caching (float)(z-1):
const int z = (some number);
int c;
float xi,x;
const float fzm1 = (float)(z-1);
for(c=0,xi=0.f; c<z; c++, xi+=1.f)
{
x=xi/fzm1;
// do something with x
}
But who would ever repeat a division by a constant for every loop pass ? Obviously anyone would turn it into a multiplication:
const int z = (some number);
int c;
float xi,x;
const float invzm1 = 1.f/(float)(z-1);
for(c=0,xi=0.f; c<z; c++, xi+=1.f)
{
x=xi * invzm1;
// do something with x
}
Here is where obvious rounding issues may start to manifest.
For some integer values of z, (z-1)*(1.f/(float)(z-1)) won't give exactly one but 0.999999..., so the value assumed by x in the last loop cycle won't be exactly one.
If using an adder instead, i.e
const int z = (some number);
int c;
float x;
const float x_adder = 1.f/(float)(z-1);
for(c=0,x=0.f; c<z; c++, x+=x_adder)
{
// do something with x
}
the situation is even worse, because the error in x_adder will build up.
So the only solution I can see is using a conditional somewhere, like:
const int z = (some number);
int c;
float xi,x;
const float invzm1 = 1.f/(float)(z-1);
for(c=0,xi=0.f; c<z; c++, xi+=1.f)
{
x = (c==z-1) ? 1.f : xi * invzm1;
// do something with x
}
but in a time-critical loop a branch should be avoided if possible !
Oh, and I can't even split the loop and do
for(c=0,xi=0.f; c<z-1; c++, xi+=1.f) // note: loop runs now up to and including z-2
{
x=xi * invzm1;
// do something with x
}
x=1.f;
// do something with x
because I would have to replicate the whole block of code 'do something with x' which is not short or simple either, I cannot make of it a function call (would be inefficient, too many local variables to pass) nor I want to use #defines (would be very poor and inelegant and impractical).
Can you figure out any efficient or smart solution to this problem ?
First, a general consideration: xi += 1.f introduces a loop-carried dependency chain of however many cycles your CPU needs for floating point addition (probably 3 or 4). It also kills any attempt at vectorization unless you compile with -ffast-math. If you run on a modern super-scalar desktop CPU, I recommend using an integer counter and converting to float in each iteration.
In my opinion, avoiding int->float conversions is outdated advise from the era of x87 FPUs. Of course you have to consider the entire loop for the final verdict but the throughput is generally comparable to floating point addition.
For the actual problem, we may look at what others have done, for example Eigen in the implementation of their LinSpaced operation. There is also a rather extensive discussion in their bug tracker.
Their final solution is so simple that I think it is okay to paraphrase it here, simplified for your specific case:
float step = 1.f / (n - 1);
for(int i = 0; i < n; ++i)
float x = (i + 1 == n) ? 1.f : i * step;
The compiler may choose to peel off the last iteration to get rid of the branch but in general it is not too bad anyway. In scalar code branch prediction will work well. In vectorized code it's a packed compare and a blend instruction.
We may also force the decision to peel off the last iteration by restructuring the code appropriately. Lambdas are very helpful for this since they are a) convenient to use and b) very strongly inlined.
auto loop_body = [&](int i, float x) mutable {
...;
};
for(int i = 0; i < n - 1; ++i)
loop_body(i, i * step);
if(n > 0)
loop_body(n - 1, 1.f);
Checking with Godbolt (using a simple array initialization for the loop body), GCC only vectorizes the second version. Clang vectorizes both but does a better job with the second.
What you need is Bresenham's line algorithm.
It would allow you to avoid multiplication and divisions and use add/sub only. Just scale your range so that it could be represented by integer numbers and round up at final stage if precise split to parts is mathematically (or "representatively") impossible.
Consider using this:
const int z = (some number > 0);
const int step = 1000000/z;
for(int c=0; c<z-1; ++c)
{
x += step; //just if you really need the conversion, divide it by 1000000 when required
// do something with x
}
x = 1.f;
//do the last step with x
No conversions if you don't really need it, first and last values are as expected, multiplication is reduced to accumulation.
By changing of 1000000 you can manually control the precision.
I suggest that you start with the last alternative you have shown and use lambda to avoid passing local variables:
auto do_something_with_x = [&](float x){/*...*/}
for(c=0,xi=0.f; c<z-1; c++, xi+=1.f) // note: loop runs now up to and including z-2
{
x=xi * invzm1;
do_something_with_x(x);
}
do_something_with_x(1.f);
for signal processing I need to compute relatively large C arrays as shown in the code part below. This is working fine so far, unfortunately, the implementation is slow. The size of "calibdata" is arround 150k and needs to be calculated for different frequencies/phases. Is there a way to improve speed significantly? Doing the same with logical indexing in MATLAB is way faster.
What I tried already:
using taylor approximation of sine: no siginificant improvement.
using std::vector, also no siginificant improvement.
code:
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
for (int i = 0; i < size; i++)
result += calibdata[i] * cos((2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180) - (PI / 2)));
result = fabs(result / size);
return result;}
Best regards,
Thomas
When optimizing code for speed, step 1 is to enable compiler optimizations. I hope you've done that already.
Step 2 is to profile the code and see exactly how the time is being spent. Without profiling, you're just guessing, and you could end up trying to optimize the wrong thing.
For example, your guess seems to be that the cos function is the bottleneck. But the other possibility is that the calculation of the angle is the bottleneck. Here's how I would refactor the code to reduce the time spent calculating the angle.
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier)
{
double result = 0;
double angle = phase * (PI / 180) - (PI / 2);
double delta = 2 * PI * freqscale[currentcarrier] / fs;
for (int i = 0; i < size; i++)
{
result += calibdata[i] * cos( angle );
angle += delta;
}
return fabs(result / size);
}
Okay, I'm probably going to get flogged for this answer, but I would use the GPU for this. Because your array doesn't appear to be self-referential, the best speedup you're going to get for large arrays is through parallelization... by far. I don't use MATLAB, but I just did a quick search for GPU utilization on the MathWorks site:
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html?requestedDomain=www.mathworks.com
Outside of MATLAB you could use OpenCL or CUDA yourself.
Your enemies in execution time are:
Division
Function calls (including implicit ones in loops)
Accessing data from diffent areas
Operating dissimilar instructions
You should research on Data Driving programming and using the data cache effectively.
Division
Whether with hardware support or software support division takes a long time by its very nature. Eliminate if possibly by changing the numeric base or factoring out of the loop (if possible).
Function Calls
The most efficient method of execution is sequential. Processors are optimized for this. A branch may require the processor perform some additional calculation (branch prediction) or reloading of the instruction cache / pipeline. A waste of time (that could be spent executing data instructions).
The optimization for this is to use techniques like loop unrolling and inlining of small functions. Also reduce the quantity of branches by simplifying expressions and using Boolean algebra.
Accessing data from different areas
Modern processors are optimized to operate on local data (data in one area). One example is loading an internal cache with data. Specifically, loading a cache line with data. For example, if the data from your arrays is in one location and the cosine data in another, this may cause the data cache to be reloaded, again wasting time.
A better solution is to place all data contiguously or to contiguously access all the data. Rather than making many discontiguous accesses to the cosine table, look up a batch of cosine values sequentially (without any other data accesses between).
Dissimilar Instructions
Modern processors are more efficient at processing a batch of similar instructions. For example the pattern load, add, store is more efficient for blocks when all the loading is performed, then all adding, then all storing.
Summary
Here's an example:
register double result = 0.0;
register unsigned int i = 0U;
for (i = 0; i < size; i += 2)
{
register double cos_angle1 = /* ... */;
register double cos_angle2 = /* ... */;
result += calibdata[i + 0] * cos_angle1;
result += calibdata[i + 1] * cos_angle2;
}
The above loop is unrolled and like operations are performed in groups.
Although the keyword register may be deprecated, it is a suggestion to the compiler to use dedicated registers (if possible).
You can try to use the definition of cosine based on the complex exponential:
where j^2=-1.
Store exp((2 * PI*freqscale[currentcarrier] / fs)*j) and exp(phase*j). Evaluating cos(...) then resumes to a couple of products and additions in the for loops, and sin(), cos() and exp() are only called a couple of times.
Here goes the implementation:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex.h>
#include <time.h>
#define PI 3.141592653589
typedef struct cos_plan{
double complex* expo;
int size;
}cos_plan;
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
double result=0; //initialization
for (int i = 0; i < size; i++){
result += calibdata[i] * cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) );
//printf("i %d cos %g\n",i,cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) ));
}
result = fabs(result / size);
return result;
}
double phase_func2(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier, cos_plan* plan){
//first, let's compute the exponentials:
//double complex phaseexp=cos(phase*(PI / 180.) - (PI / 2.))+sin(phase*(PI / 180.) - (PI / 2.))*I;
//double complex phaseexpm=conj(phaseexp);
double phasesin=sin(phase*(PI / 180.) - (PI / 2.));
double phasecos=cos(phase*(PI / 180.) - (PI / 2.));
if (plan->size<size){
double complex *tmp=realloc(plan->expo,size*sizeof(double complex));
if(tmp==NULL){fprintf(stderr,"realloc failed\n");exit(1);}
plan->expo=tmp;
plan->size=size;
}
plan->expo[0]=1;
//plan->expo[1]=exp(2 *I* PI*freqscale[currentcarrier]/fs);
plan->expo[1]=cos(2 * PI*freqscale[currentcarrier]/fs)+sin(2 * PI*freqscale[currentcarrier]/fs)*I;
//printf("%g %g\n",creall(plan->expo[1]),cimagl(plan->expo[1]));
for(int i=2;i<size;i++){
if(i%2==0){
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2];
}else{
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2+1];
}
}
//computing the result
double result=0; //initialization
for(int i=0;i<size;i++){
//double coss=0.5*creall(plan->expo[i]*phaseexp+conj(plan->expo[i])*phaseexpm);
double coss=creall(plan->expo[i])*phasecos-cimagl(plan->expo[i])*phasesin;
//printf("i %d cos %g\n",i,coss);
result+=calibdata[i] *coss;
}
result = fabs(result / size);
return result;
}
int main(){
//the parameters
long n=100000000;
double* calibdata=malloc(n*sizeof(double));
if(calibdata==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
int freqnb=42;
double* freqscale=malloc(freqnb*sizeof(double));
if(freqscale==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
for (int i = 0; i < freqnb; i++){
freqscale[i]=i*i*0.007+i;
}
double fs=n;
double phase=0.05;
//populate calibdata
for (int i = 0; i < n; i++){
calibdata[i]=i/((double)n);
calibdata[i]=calibdata[i]*calibdata[i]-calibdata[i]+0.007/(calibdata[i]+3.0);
}
//call to sample code
clock_t t;
t = clock();
double res=phase_func(calibdata,n, freqscale, fs, phase, 13);
t = clock() - t;
printf("first call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//initialize
cos_plan plan;
plan.expo=malloc(n*sizeof(double complex));
plan.size=n;
t = clock();
res=phase_func2(calibdata,n, freqscale, fs, phase, 13,&plan);
t = clock() - t;
printf("second call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//cleaning
free(plan.expo);
free(calibdata);
free(freqscale);
return 0;
}
Compile with gcc main.c -o main -std=c99 -lm -Wall -O3. Using the code you provided, it take 8 seconds with size=100000000 on my computer while the execution time of the proposed solution takes 1.5 seconds... It is not so impressive, but it is not negligeable.
The solution that is presented does not involve any call to cos of sin in the for loops. Indeed, there are only multiplications and additions. The bottleneck is either the memory bandwidth or the tests and access to memory in the exponentiation by squaring (most likely first issue, since i add to use an additional array of complex).
For complex number in c, see:
How to work with complex numbers in C?
Computing e^(-j) in C
If the problem is memory bandwidth, then parallelism is required... and directly computing cos would be easier. Additional simplifications coud have be performed if freqscale[currentcarrier] / fs were an integer. Your problem is really close to the computation of Discrete Cosine Transform, the present trick is close to the Discrete Fourier Transform and the FFTW library is really good at computing these transforms.
Notice that the present code can produce innacurate results due to loss of significance : result can be much larger than cos(...)*calibdata[] when size is large. Using partial sums can resolve the issue.
Simple trig identity to eliminate the - (PI / 2). This is also more accurate than attempting the subtraction which uses machine_PI. This is important when values are near π/2.
cosine(x - π/2) == -sine(x)
Use of const and restrict: Good compilers can perform more optimizations with this knowledge. (See also #user3528438)
// double phase_func(double* calibdata, long size,
// double* freqscale, double fs, double phase, int currentcarrier) {
double phase_func(const double* restrict calibdata, long size,
const double* restrict freqscale, double fs, double phase, int currentcarrier) {
Some platforms perform faster calculations with float vs double with a tolerable loss of precision. YMMV. Profile code both ways.
// result += calibdata[i] * cos(...
result += calibdata[i] * cosf(...
Minimize recalculations.
double angle_delta = ...;
double angle_current = ...;
for (int i = 0; i < size; i++) {
result += calibdata[i] * cos(angle_current);
angle_current += angle_delta;
}
Unclear why code uses long size and and int currentcarrier. I'd expect the same type and to use type size_t. This is idiomatic for array indexing. #Daniel Jour
Reversing loops can allow a compare to 0 rather than compare to variable. Sometimes a modest performance gain.
Insure compiler optimizations are well enabled.
All together
double phase_func2(const double* restrict calibdata, size_t size,
const double* restrict freqscale, double fs, double phase,
size_t currentcarrier) {
double result = 0.0;
double angle_delta = 2.0 * PI * freqscale[currentcarrier] / fs;
double angle_current = angle_delta * (size - 1) + phase * (PI / 180);
size_t i = size;
while (i) {
result -= calibdata[--i] * sinf(angle_current);
angle_current -= angle_delta;
}
result = fabs(result / size);
return result;
}
Leveraging the cores you have, without resorting to the GPU, use OpenMP. Testing with VS2015, the invariants are lifted out of the loop by the optimizer. Enabling AVX2 and OpenMP.
double phase_func3(double* calibdata, const int size, const double* freqscale,
const double fs, const double phase, const size_t currentcarrier)
{
double result{};
constexpr double PI = 3.141592653589;
#pragma omp parallel
#pragma omp for reduction(+: result)
for (int i = 0; i < size; ++i) {
result += calibdata[i] *
cos( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.0) - (PI / 2.0)));
}
result = fabs(result / size);
return result;
}
The original version with AVX enabled took: ~1.4 seconds
and adding OpenMP brought it down to: ~0.51 seconds.
Pretty nice return for two pragmas and a compiler switch.
tl;dr: double b=a-(size_t)(a) faster than double b=a-trunc(a)
I am implementing a rotation function for an image and I noticed that the trunc function seems to be awfully slow.
Looping code for the image, the actual affectation of the pixels is commented out for the performance test so I don't even access the pixels.
double sina(sin(angle)), cosa(cos(angle));
int h = (int) (_in->h*cosa + _in->w*sina);
int w = (int) (_in->w*cosa + _in->h*sina);
int offsetx = (int)(_in->h*sina);
SDL_Surface* out = SDL_CreateARGBSurface(w, h); //wrapper over SDL_CreateRGBSurface
SDL_FillRect(out, NULL, 0x0);//transparent black
for (int y = 0; y < _in->h; y++)
for (int x = 0; x < _in->w; x++){
//calculate the new position
const double destY = y*cosa + x*sina;
const double destX = x*cosa - y*sina + offsetx;
So here is the code using trunc
size_t tDestX = (size_t) trunc(destX);
size_t tDestY = (size_t) trunc(destY);
double left = destX - trunc(destX);
double top = destY - trunc(destY);
And here is the faster equivalent
size_t tDestX = (size_t)(destX);
size_t tDestY = (size_t)(destY);
double left = destX - tDestX;
double top = destY - tDestY;
The answers suggest not to use trunc when converting back to integral so I also tried that case:
size_t tDestX = (size_t) (destX);
size_t tDestY = (size_t) (destY);
double left = destX - trunc(destX);
double top = destY - trunc(destY);
The fast version seems to take an average of 30ms to go through the full image (2048x1200) while the slow version using trunc takes about 135ms for the same image. The version with only two calls to trunc is still much slower than the one without (about 100ms).
As far as I understand C++ rules, both expressions should return always the same thing. Am I missing something here? dextX and destY are declared const so only one call should be made to the trunc function and even then it wouldn't explain the over three times slower factor by itself.
I'm compiling with Visual Studio 2013 with optimizations (/O2). Is there any reason to use the trunc function at all? Even for getting the fractional part using an integer seems to be faster.
The way you're using it, there's no reason for you to use the trunc function at all. It transforms a double into a double, which you then cast into an integral and throw away. The fact that the alternative is faster, is not that surprising.
On modern x86 CPUs, int <-> float conversions are quite fast - typically inline SSE code is generated for the conversion and the cost is of the order of a few instruction cycles.1
For trunc however a function call is required, and the function call overhead alone is almost certainly greater than than the cost of an inline float -> int conversion. Furthermore, the trunc function itself may be relatively costly - it has to be fully IEEE-754 compliant, so the full range of floating point values has to be dealt with correctly, as do edge cases such as NaN, INF, denorms, values which are out of range, etc. So overall I would expect the cost of trunc to be of the order of tens of instruction cycles, i.e. an order of magnitude or so greater than the cost of an inline float -> int conversion.
1. Note that float <-> int conversions are not always inexpensive - other CPU families, and even older x86 CPUs, may not have ISA support for such conversions, in which case a library function will normally be used, and the cost of this would be similar to that of trunc. Modern x86 CPUs are a special case in this regard.
I've been working on implementing black-body radiation according to Planck's law with the following:
double BlackBody(double T, double wavelength) {
wavelength /= 1e9; // pre-scale wavelength to meters
static const double h = 6.62606957e-34; // Planck constant
static const double c = 299792458.0; // speed of light in vacuum
static const double k = 1.3806488e-23; // Boltzmann constant
double exparg = h*c / (k*wavelength*T);
double exppart = std::exp(exparg) - 1.0;
double constpart = (2.0*h*c*c);
double powpart = pow(wavelength, -5.0);
double v = constpart * powpart / exppart;
return v;
}
I have a float[max-min+1] array, where static const int max=780, static const int min = 380. I simply iterate over the array, and put in what the BlackBody gives for the wavelength (wavelength = array-index + min). The IntensitySpectrum::BlackBody performs this iteration, while both min and max are static member vars, and the array is also inside IntensitySpectrum.
IntensitySpectrum spectrum;
Vec3 rgb = spectrum.ToRGB();
rgb /= std::max(rgb.x, std::max(rgb.y, rgb.z));
for (int xc = 0; xc < grapher.GetWidth(); xc++) {
if (xc % 10 == 0) {
spectrum.BlackBody(200.f + xc * 200.f);
spectrum.Scale(1.0f / 1e+14f);
rgb = spectrum.ToRGB();
rgb /= std::max(rgb.x, std::max(rgb.y, rgb.z));
}
for (int yc = 20; yc < 40; yc++) {
grapher(xc, yc) = grapher.FloatToUint(rgb.x, rgb.y, rgb.z);
}
}
The problem is that, the line spectrum.BlackBody() sets the 0th element of the array to NaN, and only the 0th. Also it does not happen for the very first iteration, but all the following ones where xc>=10.
The text from the VS debugger:
spectrum = {intensity=0x009bec50 {-1.#IND0000, 520718784., 537559104., 554832896., 572547904., 590712128., 609333504., ...} }
I tracked the error down, and exppart in the ::BlackBody() function becomes NaN, basically exp() returns NaN, even though it's argument is near 2.0, so definetely not overflow. But only for array index 0. It magically starts working for the rest 400 indices.
I know memory overruns might cause things like that. That's why I double checked my memory handling.
I'm linking Vec3 from another self-made library, which is much bigger, and might contain errors, but what I use from Vec3 has nothing to do with memory.
After many hours I'm completely clueless. What else can cause this? Is the optimizer or WINAPI fooling me...? (Uhm, yes, the program creates a window, with WINAPI, and uses a nearly empty WndProc that calls my code on WM_PAINT.)
Thanks for you help in advance.
Sorry for making it unclear. This is the layout:
// member
class IntensitySpectrum {
public:
void BlackBody(float temperature) {
// ...
this->intensity[i] = ::BlackBody(temperature, wavelength(i));
// ...
}
private:
static const int min = 380;
static const int max = 780;
float intensity[max-min+1];
}
// global
double BlackBody(double T, double wavelength);
If you happen to be using MSVC 2013, one possible explanation is that you have some code somewhere that is trying to convert a float infinity to int. A bug in MSVC 2013 causes an unbalanced push on the x87 FPU stack when this happens. Trigger that bug 8 times and your FPU stack is totally full, and any subsequent attempt to push a value (such as calling 'exp()') will result in an 'invalid operation' and return an indefinite (like 1.#IND). Note that even if you are compiling with SSE2 floating point instructions, this bug can still bite because the calling convention dictates that floating point return values are returned on the top of the FPU stack.
To check if this is your issue, have a look at your FPU registers just prior to the bad call to 'exp()'. If your TAGS register is all zero, then your FPU stack is full.
http://connect.microsoft.com/VisualStudio/feedback/details/806362/vc12-pollutes-the-floating-point-stack-when-casting-infinity-nan-to-unsigned-long
MS claims this will be fixed in update 2 for MSVC 2013.
The following function call only has 1 parameter:
spectrum.BlackBody(200.f + xc * 200.f);
So it cannot be calling the function you defined as
double BlackBody(double T, double wavelength)
If you look at the ::BlackBody implementation, I'm betting you have a divide by 0 error somewhere.
Say I have a method returning a double, but I want to determine the precision after the dot of the value to be returned. I don't know the value of the double varaible.
Example:
double i = 3.365737;
return i;
I want the return value to be with precision of 3 number after the dot
Meaning: the return value is 3.365.
Another example:
double i = 4644.322345;
return i;
I want the return value to be: 4644.322
What you want is truncation of decimal digits after a certain digit. You can easily do that with the floor function from <math.h> (or std::floor from <cmath> if you're using C++):
double TruncateNumber(double In, unsigned int Digits)
{
double f=pow(10, Digits);
return ((int)(In*f))/f;
}
Still, I think that in some cases you may get some strange results (the last digit being one over/off) due to how floating point internally works.
On the other hand, most of time you just pass around the double as is and truncate it only when outputting it on a stream, which is done automatically with the right stream flags.
You are going to need to take care with the borderline cases. Any implementation based solely on pow and casting or fmod will occasionally give wrong results, particularly so an implementation based on pow(- PRECISION).
The safest bet is to implement something that neither C nor C++ provide: A fixed point arithmetic capability. Lacking that, you will need to find the representations of the pertinent borderline cases. This question is similar to the question on how Excel does rounding. Adapting my answer there, How does Excel successfully Rounds Floating numbers even though they are imprecise? , to this problem,
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Truncate number to some precision.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double truncate (double x, int nplaces) {
bool is_neg = false;
// Things will be easier if we only have to deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the supposedly truncated value (round down) and the nearest
// truncated value above it.
double round_down, round_up;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x / scale);
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x * scale);
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
// Usually the round_down value is the desired value.
// On rare occasions it is the rounded-up value that is.
// This is one of those cases where you do want to compare doubles by ==.
if (x != round_up) x = round_down;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}
You cannot "remove" precision from a double. You could have: 4644.322000. It's a different number but the precision is the same.
As #David Heffernan said do it when you convert it to a string for display.
You want to truncate your double to n decimal places, then you can use this function:
#import <cmath>
double truncate_to_places(double d, int n) {
return d - fmod(d, pow(10.0, -n));
}
Instead of multiplying and dividing by powers of 10 like the other answers, you can use the fmod function to find the digits after the precision you want, and then subtract to remove them.
#include <math.h>
#define PRECISION 0.001
double truncate(double x) {
x -= fmod(x,PRECISION);
return x;
}
There is no good way to do this with plain doubles, but you can write a class or simply struct like
struct lim_prec_float {
float value;
int precision;
};
then have your function
lim_prec_float testfn() {
double i = 3.365737;
return lim_prec_float{i, 4};
}
(4 = 1 before point + 3 after. This uses a C++11 initialization list, it would be better if lim_prec_float was a class with proper constructors.)
When you now want to output the variable, do this with a custom
std::ostream &operator<<(std::ostream &tgt, const lim_prec_float &v) {
std::stringstream s;
s << std::setprecision(v.precision) << v.value;
return (tgt << s.str());
}
Now you can, for instance,
int main() {
std::cout << testfn() << std::endl
<< lim_prec_float{4644.322345, 7} << std::endl;
return 0;
}
which will output
3.366
4644.322
this is because std::setprecision means rounding to the desired number of places, which is likely what you really want. If you actually mean truncate, you can modify the operator<< with one of the truncation functions given by the other answers.
In the same way you format a date before displaying it, you should do the same with double.
However, here are two approaches I have used for rounding.
double roundTo3Places(double d) {
return round(d * 1000) / 1000.0;
}
double roundTo3Places(double d) {
return (long long) (d * 1000 + (d > 0 ? 0.5 : -0.5)) / 1000.0;
}
The later is faster, however numbers cannot be larger than 9e15