half precision muliplication seems to produce wrong result

half precision muliplication seems to produce wrong result - c++

First of all, IEEE754 half-precision floating point number uses 16 bits. It uses 1 bit sign, 5 bits exponent, and 10 bit mantissa. actual value can be calculated to be sign * 2^(exponent-15) * (1+mantisa/1024).
I'm trying to run a image detection program using half precision. The original program is using single precision (=float). I'm using the half precision class in http://half.sourceforge.net/. Using the class half, I can run the same program at least.(by using half instead of float and compiling with g++ instead of gcc, and after many many type castings..)
I found a problem where multiplication seems to be wrong.
here is the sample code to see the problem (To print half precision number, I should cast it to float to see the value. and automatic casting doesn't take place in operations of half and integer so I put some castings..) :
#include <stdio.h>
#include "half.h"
using half_float::half;
typedef half Dtype;
main()
{
#if 0 // method 0 : this makes sx 600, which is wrong.
int c = 325;
Dtype w_scale = (Dtype)1.847656;
Dtype sx = Dtype(c*w_scale);
printf("sx = %f\n", (float)sx); // <== shows 600.000 which is wrong.
#else // method 1, which also produces wrong result..
int c = 325;
Dtype w_scale = (Dtype)1.847656;
Dtype sx = (Dtype)((Dtype)c*w_scale);
printf("sx = %f\n", (float)sx);
printf("w_scale specified as 1.847656 was 0x%x\n", *(unsigned short *)&w_scale);
#endif
}
The result looks like this :
w_scale = 0x3f63
sx = 600
sx = 0x60b0
But the sx should be 325 * 1.847656 = 600.4882. What can be wrong?
ADD : When I first posted this question, I didn't expect the value to be exactly 600.4882 but somewhere close to it. I later found the half precision, with its limitation of expressing only 3~4 effective digits, the closest value of the multication just turned out to be just 600.00. Though everybody knows floating point has this kind of limitations, some people will make a mistake like me by overlooking the fact that half-precision can have only 3~4 effective digits. So I think this question is worth a look-at by future askers. (In stackoverflow, I think some people just take every questions as the same old question when it's actually a slightly different cases. ANd it doesn't harm to have a couple of similar questions.)

I figured it out why. The half-precision has an effective precision of approx log10(2^10) ~ 3 or 4 digits. I wanted the sx to be printed as 600.488 or something close but this cannot be represented using half-precision.
This part came during the image preprocessing that can be done without 16 bit precision (our tentative hardware), so I can just use float operation for this stage.
ADD : this anomaly came during image dimension calculation, and we don't have any reason to use 16 bit float for this case. Just image data (pixel, or feature map data) should use 16 bit float. Having written this, it's a general rule.

Related

Loss of precision with pow function when surpassing 10^10 limit?

Doing one of my first homeworks of uni, and have ran into this problem:
Task: Find a sum of all n elements where n is the count of numerals in a number (n=1, means 1, 2, 3... 8, 9 for example, answer is 45)
Problem: The code I wrote has gotten all the test answers correctly up to 10 to the power of 9, but when it reaches 10 to the power of 10 territory, then the answers start being wrong, it's really close to what I should be getting, but not quite there (For example, my output = 49499999995499995136, expected result = 49499999995500000000)
Would really appreciate some help/insights, am guessing it's something to do with the variable types, but not quite sure of a possible solution..
#include <iostream>
#include <cmath>
#include <iomanip>
using namespace std;
int main()
{
int n;
double ats = 0, maxi, mini;
cin >> n;
maxi = pow(10, n) - 1;
mini = pow(10, n-1) - 1;
ats = (maxi * (maxi + 1)) / 2 - (mini * (mini + 1)) / 2;
cout << setprecision(0) << fixed << ats;
}

The main reason of problems is pow() function. It works with double, not int. Loss of accuracy is price for representing huge numbers.
There are 3 way's to solve problem:
For small n you can make your own long long int pow(int x, int pow) function. But there is problem, that we can overflow even long long int
Use long arithmetic functions, as #rustyx sayed. You can write your own with vector, or find and include library.
There is Math solution specific for topic's task. It solves the big numbers problem.
You can write your formula like
((10^n) - 1) * (10^n) - (10^m - 1) * (10^m)) / 2 , (here m = n-1)
Then multiply numbers in numerator. Regroup them. Extract common multiples 10^(n-1). And then you can see, that answer have a structure:
X9...9Y0...0 for big enought n, where letter X and Y are constants.
So, you can just print the answer "string" without calculating.

I think you're stretching floating points beyond their precision. Let me explain:
The C pow() function takes doubles as arguments. You're passing ints, the compiler is adding the code to convert them to doubles before they reach pow(). (And anyway you're storing it as a double when you get the return value since you declared it that way).
Floating points are called that way precisely because the point "floats". Inside a double there's a sign bit, a few bits for the mantissa and a few bits for the exponent. In binary, elevating to a power of two is equivalent to moving the fractional point to the right (or to the left if you're elevating to a negative number). So basically the exponent is saying where the fractional point is, in binary. The great advantage of using this kind of in-memory representation for doubles is that you get a lot of precision for numbers close to 0, and gradually lose precision as numbers become bigger.
That last thing is exactly what's happening to you. Your number is too large to be stored exactly. So it's being rounded to the closest sum of powers of two (powers of two are the numbers that have all zeroes to the right in binary).
Quick experiment: press F12 in your browser, open the javascript console and type 49499999995499995136. In my case, in chrome, I reproduce the same problem.
If you really really really want precision with such big numbers then you can try some of these libraries, but that's too advanced for a student program, you don't need it. Just add an if block and print an error message if the number that the user typed is too big (professors love that, which is actually quite correct).

How do I reach accurate values using Newton-Raphson method?

Here is a link to the problem.
http://www.spoj.com/problems/TRIGALGE/
The problem is quite simple and we only have to solve the given equation . I decided to try using Newton-Raphson method ( https://en.wikipedia.org/wiki/Newton%27s_method ) .
Here is a link to my code that I had submitted but got a wrong answer-->http://ideone.com/dYev3P
I am unable to understand the logic behind the precision .
For ,
a=1 b=1 c=20
x should be , x=19.441787
but I am getting
x=19.441786
I printed the whole series for 100 iterations but nowhere did I get the exact value . Please tell me the correct approach and how to get correct precision while dealing with Floating point integers .

The basic idea of Newton-Raphson is successive approximations. Within certain bounds, you hope that each approximation is better than the previous (though that's not entirely guaranteed, by any means).
As far as your code goes, you're using floats, which are only good for about 7 significant digits of precision. That gets you 17.44179 as being about the best you should really hope for (and, indeed, after rounding to 7 digits, that's exactly what you got).
If you really need that 8th digit of precision, you should use double instead of float as your data type. That doesn't change the fact that you're dealing with successive approximations, but it does mean you can expect around 15 digits of precision instead of only 7.
I should probably also note that computer floating point is almost always a matter of approximation in general. With the right libraries, you can approximate to hundreds or even trillions of digits, but when you deal with floating point, you shouldn't normally have an expectation that one specific answer is right, and other answers are wrong, even if they're nearly equal to the "right" one. Rather the contrary, you should expect minor variations as a rule (with "minor" being a relative term--i.e., acceptable errors are relative to the magnitude of the numbers involved.

In addition to Jerry's answer don't forget to remove the .f after each 1.0 you have multiplied in order to treat them as floating point numbers. This and changing all your floats to double shall definitely do the trick.

If you just change you float to double you do indeed get 19.441786710 already after iteration 7.
#include <cmath>
#include <iostream>
void solve(int a,int b,int c){
double x = 0.2 ;
for(int i=0;i<15;i++){
x = x*1.0 - ( ( (a*x*1.0 + ( b*sin(x)*1.0 ) )*1.0 - c*1.0 )/
(a*1.0 + b*(cos(x))*1.0 ) ) ;
std::cout << i << " " << x-19.441787 << std::endl;
}
}
int main(){
solve(1,1,20) ;
}

Different results from similar floating-point functions

so i have 2 functions that should do the same thing
float ver1(float a0, float a1) {
float r0 = a0 - a1;
if (abs(r0) > PI) {
if (r0 > 0) {
r0 -= PI2;
} else {
r0 += PI2;
}
}
return r0;
}
float ver2(float a0, float a1) {
float a2 = a1 - PI2;
float r0 = a0 - a1;
float r1 = a0 - a2;
if (abs(r0) < abs(r1)) {
return r0;
}
if (abs(r0) > abs(r1)) {
return r1;
}
return 0;
}
note: PI and PI2 are float constants of pi and 2*pi
The odd thing is that sometimes they produce different results, for example if you feed them 0.28605145 and 5.9433694 then the first one results in 0.62586737 and the second one in 0.62586755 and i cant figure out whats causing this.
If you manually calculate what the result should be you'll find that the second answer is correct. This function i use in a 2d physical sim and the really odd thing is that the first answer (the wrong one) works there while the second one (the right one) makes it act all kinds of crazy. Such a tiny difference from an unknown source and such a profound effect :|
At this point im switchign to matrices anyway but this odd situation got me curious, anybody know whats going on?

float typically has a precision of about 24 bits, or about 7 decimal places.
You are subtracting two numbers of similar magnitude (r0+PI2 in the first, a1-PI2 in the second), and so are experiencing loss of significance - several of the most significant bits of the result are zero, so there are fewer bits left to represent the difference. That is why the answers match to only about 6 decimal places.
If you need more precision, then a double or a 32-bit or larger fixed-point representation might be more suitable than a float. There are also arbitrary-precision libraries available, such as GMP, which can represent numbers with all the precision you need, although arithmetic will be significantly slower than with built-in types.

You should use fabs() function instead of abs() because abs() only works with integer numbers. You'll get weird and wrong results when using abs() with floating points.

Floating point numbers don't behave like mathematical real numbers. Every sum of 2 may result in a "error". So I wouldn't call the first correct and the second incorrect just because of one example. You need to be careful of every action you do with floats if you want to keep the error small.
The error is generally smaller if the abs of the numbers are in the same range.
And if the ranges are different the error tend to be bigger.
For example 10000000.0 + 0.1 - 10000000.0 is hardly ever 0.1.
If you know the ranges of the input you can adjust the code to reduce errors.

Getting the fractional part of a float without using modf()

I'm developing for a platform without a math library, so I need to build my own tools. My current way of getting the fraction is to convert the float to fixed point (multiply with (float)0xFFFF, cast to int), get only the lower part (mask with 0xFFFF) and convert it back to a float again.
However, the imprecision is killing me. I'm using my Frac() and InvFrac() functions to draw an anti-aliased line. Using modf I get a perfectly smooth line. With my own method pixels start jumping around due to precision loss.
This is my code:
const float fp_amount = (float)(0xFFFF);
const float fp_amount_inv = 1.f / fp_amount;
inline float Frac(float a_X)
{
return ((int)(a_X * fp_amount) & 0xFFFF) * fp_amount_inv;
}
inline float Frac(float a_X)
{
return (0xFFFF - (int)(a_X * fp_amount) & 0xFFFF) * fp_amount_inv;
}
Thanks in advance!

If I understand your question correctly, you just want the part after the decimal right? You don't need it actually in a fraction (integer numerator and denominator)?
So we have some number, say 3.14159 and we want to end up with just 0.14159. Assuming our number is stored in float f;, we can do this:
f = f-(long)f;
Which, if we insert our number, works like this:
0.14159 = 3.14159 - 3;
What this does is remove the whole number portion of the float leaving only the decimal portion. When you convert the float to a long, it drops the decimal portion. Then when you subtract that from your original float, you're left with only the decimal portion. We need to use a long here because of the size of the float type (8 bytes on most systems). An integer (only 4 bytes on many systems) isn't necessarily large enough to cover the same range of numbers as a float, but a long should be.

As I suspected, modf does not use any arithmetic per se -- it's all shifts and masks, take a look here. Can't you use the same ideas on your platform?

I would recommend taking a look at how modf is implemented on the systems you use today. Check out uClibc's version.
http://git.uclibc.org/uClibc/tree/libm/s_modf.c
(For legal reasons, it appears to be BSD licensed, but you'd obviously want to double check)
Some of the macros are defined here.

There's a bug in your constants. You're basically trying to do a left shift of the number by 16 bits, mask off everything but the lower bits, then right shift by 16 bits again. Shifting is the same as multiplying by a power of 2, but you're not using a power of 2 - you're using 0xFFFF, which is off by 1. Replacing this with 0x10000 will make the formula work as intended.

I'm not completly sure, but I think that what you are doing is wrong, since you are only considering the mantissa and forgetting the exponent completely.
You need to use the exponent to shift the value in the mantissa to find the actual integer part.
For a description of the storage mechanism of 32bit floats, take a look here.

Why go to floating point at all for your line drawing? You could just stick to your fixed point version and use an integer/fixed point based line drawing routine instead - Bresenham's comes to mind. While this version isn't aliased, I know there are others that are.
Bresenham's line drawing

Seems like maybe you want this.
float f = something;
float fractionalPart = f - floor(f);

Your method is assuming that there are 16 bits in the fractional part (and as Mark Ransom notes, that means you should shift by 16 bits, i.e. multiply by by 0x1000). That might not be true. The exponent is what determines how many bit there are in the fractional part.
To put this in a formula, your method works by calculating (x modf 1.0) as ((x << 16) mod 1<<16) >> 16, and it's that hardcoded 16 which should depend on the exponent - the exact replacement depends on your float format.

double frac(double val)
{
return val - trunc(val);
}
// frac(1.0) = 1.0 - 1.0 = 0.0 correct
// frac(-1.0) = -1.0 - -1.0 = 0.0 correct
// frac(1.4) = 1.4 - 1.0 = 0.4 correct
// frac(-1.4) = -1.4 - -1.0 = -0.4 correct
Simple and works for -ve and +ve

One option is to use fmod(x, 1).

C/C++ rounding up decimals with a certain precision, efficiently

I'm trying to optimize the following. The code bellow does this :
If a = 0.775 and I need precision 2 dp then a => 0.78
Basically, if the last digit is 5, it rounds upwards the next digit, otherwise it doesn't.
My problem was that 0.45 doesnt round to 0.5 with 1 decimalpoint, as the value is saved as 0.44999999343.... and setprecision rounds it to 0.4.
Thats why setprecision is forced to be higher setprecision(p+10) and then if it really ends in a 5, add the small amount in order to round up correctly.
Once done, it compares a with string b and returns the result. The problem is, this function is called a few billion times, making the program craw. Any better ideas on how to rewrite / optimize this and what functions in the code are so heavy on the machine?
bool match(double a,string b,int p) { //p = precision no greater than 7dp
double t[] = {0.2, 0.02, 0.002, 0.0002, 0.00002, 0.000002, 0.0000002, 0.00000002};
stringstream buff;
string temp;
buff << setprecision(p+10) << setiosflags(ios_base::fixed) << a; // 10 decimal precision
buff >> temp;
if(temp[temp.size()-10] == '5') a += t[p]; // help to round upwards
ostringstream test;
test << setprecision(p) << setiosflags(ios_base::fixed) << a;
temp = test.str();
if(b.compare(temp) == 0) return true;
return false;
}

I wrote an integer square root subroutine with nothing more than a couple dozen lines of ASM, with no API calls whatsoever - and it still could only do about 50 million SqRoots/second (this was about five years ago ...).
The point I'm making is that if you're going for billions of calls, even today's technology is going to choke.
But if you really want to make an effort to speed it up, remove as many API usages as humanly possible. This may require you to perform API tasks manually, instead of letting the libraries do it for you. Specifically, remove any type of stream operation. Those are slower than dirt in this context. You may really have to improvise there.
The only thing left to do after that is to replace as many lines of C++ as you can with custom ASM - but you'll have to be a perfectionist about it. Make sure you are taking full advantage of every CPU cycle and register - as well as every byte of CPU cache and stack space.
You may consider using integer values instead of floating-points, as these are far more ASM-friendly and much more efficient. You'd have to multiply the number by 10^7 (or 10^p, depending on how you decide to form your logic) to move the decimal all the way over to the right. Then you could safely convert the floating-point into a basic integer.
You'll have to rely on the computer hardware to do the rest.
<--Microsoft Specific-->
I'll also add that C++ identifiers (including static ones, as Donnie DeBoer mentioned) are directly accessible from ASM blocks nested into your C++ code. This makes inline ASM a breeze.
<--End Microsoft Specific-->

Depending on what you want the numbers for, you might want to use fixed point numbers instead of floating point. A quick search turns up this.

I think you can just add 0.005 for precision to hundredths, 0.0005 for thousands, etc. snprintf the result with something like "%1.2f" (hundredths, 1.3f thousandths, etc.) and compare the strings. You should be able to table-ize or parameterize this logic.

You could save some major cycles in your posted code by just making that double t[] static, so that it's not allocating it over and over.

Try this instead:
#include <cmath>
double setprecision(double x, int prec) {
return
ceil( x * pow(10,(double)prec) - .4999999999999)
/ pow(10,(double)prec);
}
It's probably faster. Maybe try inlining it as well, but that might hurt if it doesn't help.
Example of how it works:
2.345* 100 (10 to the 2nd power) = 234.5
234.5 - .4999999999999 = 234.0000000000001
ceil( 234.0000000000001 ) = 235
235 / 100 (10 to the 2nd power) = 2.35
The .4999999999999 was chosen because of the precision for a c++ double on a 32 bit system. If you're on a 64 bit platform you'll probably need more nines. If you increase the nines further on a 32 bit system it overflows and rounds down instead of up, i. e. 234.00000000000001 gets truncated to 234 in a double in (my) 32 bit environment.

Using floating point (an inexact representation) means you've lost some information about the true number. You can't simply "fix" the value stored in the double by adding a fudge value. That might fix certain cases (like .45), but it will break other cases. You'll end up rounding up numbers that should have been rounded down.
Here's a related article:
http://www.theregister.co.uk/2006/08/12/floating_point_approximation/

I'm taking at guess at what you really mean to do. I suspect you're trying to see if a string contains a decimal representation of a double to some precision. Perhaps it's an arithmetic quiz program and you're trying to see if the user's response is "close enough" to the real answer. If that's the case, then it may be simpler to convert the string to a double and see if the absolute value of the difference between the two doubles is within some tolerance.
double string_to_double(const std::string &s)
{
std::stringstream buffer(s);
double d = 0.0;
buffer >> d;
return d;
}
bool match(const std::string &guess, double answer, int precision)
{
const static double thresh[] = { 0.5, 0.05, 0.005, 0.0005, /* etc. */ };
const double g = string_to_double(guess);
const double delta = g - answer;
return -thresh[precision] < delta && delta <= thresh[precision];
}
Another possibility is to round the answer first (while it's still numeric) BEFORE converting it to a string.
bool match2(const std::string &guess, double answer, int precision)
{
const static double thresh[] = {0.5, 0.05, 0.005, 0.0005, /* etc. */ };
const double rounded = answer + thresh[precision];
std::stringstream buffer;
buffer << std::setprecision(precision) << rounded;
return guess == buffer.str();
}
Both of these solutions should be faster than your sample code, but I'm not sure if they do what you really want.

As far as i see you are checking if a rounded on p points is equal b.
Insted of changing a to string, make other way and change string to double
- (just multiplications and addion or only additoins using small table)
- then substract both numbers and check if substraction is in proper range (if p==1 => abs(p-a) < 0.05)

Old time developers trick from the dark ages of Pounds, Shilling and pence in the old country.
The trick was to store the value as a whole number fo half-pennys. (Or whatever your smallest unit is). Then all your subsequent arithmatic is straightforward integer arithimatic and rounding etc will take care of itself.
So in your case you store your data in units of 200ths of whatever you are counting,
do simple integer calculations on these values and divide by 200 into a float varaible whenever you want to display the result.
I beleive Boost does a "BigDecimal" library these days, but, your requirement for run time speed would probably exclude this otherwise excellent solution.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js