I have the following code:
typedef __int64 BIG_INT;
typedef double CUT_TYPE;
#define CUT_IT(amount, percent) (amount * percent)
void main()
{
CUT_TYPE cut_percent = 1;
BIG_INT bintOriginal = 0x1FFFFFFFFFFFFFF;
BIG_INT bintAfter = CUT_IT(bintOriginal, cut_percent);
}
bintAfter's value after the calculation is 144115188075855872 instead of 144115188075855871 (see the "2" in the end, instead of "1"??).
On smaller values such as 0xFFFFFFFFFFFFF I get the correct result.
How do I get it to work, on 32bit app? What do I have to take in account?
My aim is to cut a certain percentage of a very big number.
I use VC++ 2008, Vista.
double has a 52 bit mantissa, you're losing precision when you try to load a 60+ bit value into it.
Floating point calculations aren't guaranteed to be perfectly accurate, and you've defined CUT_TYPE as double.
See this answer for more info: Dealing with accuracy problems in floating-point numbers
Related
I want to round a float to maximum 4 decimals places. It mean 0.333333333 will be 0.3333, but 0.33 is still 0.33
Use the std::round() function
The C++ standard library offers functions for performing rounding. For floats, it is:
float round ( float arg );
this will round arg to the nearest integral value. Now, you want a different decimal resolution. So don't round your value, round your value times 10000, so your singles digit is now the former 0.0001 digit. Or more generally:
float my_round(
float x,
int num_decimal_precision_digits)
{
float power_of_10 = std::pow(10, num_decimal_precision_digits);
return std::round(x * power_of_10) / power_of_10;
}
Note that there may be accuracy issues, as floating-point computations and representations are only accurate to within a certain number of digits, and in my_round we have at least four sources of such inaccuracy: The power-of-10 calculation, the multiplication, the devision and the actual rounding.
Cast it into a fixed-point type
If you want to have your results rounded, with a fixed number of decimal digits, you're hinting that you don't really need the "floating" aspect of floating point numbers. Well, in this case, cast your value to a type which represents such numbers. Essentially, that would be a (run-time-variable) integer numerator and a compile-time-fixed denominator (which in your case would be 10,000).
There's an old question here on the site about doing fixed-point math:
What's the best way to do fixed-point math?
but I would suggest you consider the CNL library as something recent/popular. Also, several proposals have been made to add fixed-point types to the standard library. I don't know which one is the farthest advance, but have a look at this one: Fixed-Point Real Numbers by John McFarlane.
Back to your specific case: Fixed-point types can typically be constructed from floating-point ones. Just do that.
Here is a solution, for example:
float ret = float(round(0.333333333 * 10000)) / 10000)
You can write it as a function. Maybe there would be a better way?
Assuming you need print rounded number, this is one of a proper solutions:
cout << setprecision(4) << x << '\n';
std::setprecision documentation.
Live demo
Until more details are not provided it is impossible to provide a better answer.
Please note if you are planing to round number x then print it, it will end with big headache, since some corner cases can produce much longer results then expected.
Use _vsnprintf
I think the best solution for this is to just format the string. Because what if we don't need to output this number to the console, but save it in a std::string variable or char[] or something like that. This solution is flexible because it is the same if you output a number to the console and used the std::setprecision() function, but also returning this number to char[].
So for this I used _vsnprintf and va_list. All it does is format the string as needed, in this case float value.
int FormatString(char* buf, size_t buf_size, const char* fmt, ...) {
va_list args;
va_start(args, fmt);
int w = _vsnprintf(buf, buf_size, fmt, args);
va_end(args);
if (buf == NULL)
return w;
if (w == -1 || w >= (int)buf_size)
w = (int)buf_size - 1;
buf[w] = 0;
return w;
}
int FormatStringFloat(char* buf, int buf_size, const void* p_data, const char* format) {
return FormatString(buf, buf_size, format, *(const float*)p_data);
}
Example
#include "iostream"
#include "string"
#define ARRAY_SIZE(_ARR) ((int)(sizeof(_ARR) / sizeof(*(_ARR))))
int main() {
float value = 3.343535f;
char buf[64];
FormatStringFloat(buf, ARRAY_SIZE(buf), (void*)&value, "%.2f");
std::cout << std::string(buf) << std::endl;
return 0;
}
So by using "%.2f" we get from the 3.343535 => 3.34.
I keep getting wrong outcome, while I try to sum big arrays. I have isolated problem to the following code sample (its not an sum of a big array, but I believe this is The problem)
Compilable Sample:
template <typename T>
void cpu_sum(const unsigned int size, T & od_value) {
od_value = 0;
for (unsigned int i = 0; i < size; i++) {
od_value += 1;
}
}
int main() {
typedef float Data;
const unsigned int size = 800000000;
Data sum;
cpu_sum(size, sum);
cout << setprecision(35) << sum << endl; // prints: 16777216 // ERROR !!!
getchar();
}
Environment:
OS: Windows 8.1 x64 home
IDE: Microsoft Visual Studio 2015
Error Description:
While my outcome should obviously be sum == 800000000 I keep getting sum == 16777216.
That is very weird for me, because float max value is far above this one, and yet it looks like sum variable reach its limit.
What did I miss??
It is a well known problem. Gradually your sum becomes so big that next summand becomes comparable with an epsilon (about 10^-14) of it. At that moment you start loosing precision.
Standard solution is to change summation tactics: when array larger than, say, 100 elements, split it in halves and sum each half separately. It goes on recursively and tend to keep precision much better.
As pointed out by Michael Simbursky, this is a well-known problem when trying to sum a large number of limited-precision numbers. The solution he provides works well, and is quite efficient.
For the curious reader, a slower method is to sort your array before computing the sum, assuring that the values are added in order of increasing absolute value. This ensures that the least-significant values make their contributions to the overall sum before being overwhelmed by the other values. This technique is used more as an example/illustration than in any serious programming ventures.
For serious scientific programming where the overall size of the data can vary a great deal, another algorithm to be aware of is the Kahan Summation Algorithm, and the referenced wikipedia page provides a nice description of it.
I've been stumped on this one for days. I've written this program from a book called Write Great Code Volume 1 Understanding the Machine Chapter four.
The project is to do Floating Point operations in C++. I plan to implement the other operations in C++ on my own; the book uses HLA (High Level Assembly) in the project for other operations like multiplication and division.
I wanted to display the exponent and other field values after they've been extracted from the FP number; for debugging. Yet I have a problem: when I look at these values in memory they are not what I think they should be. Key words: what I think. I believe I understand the IEEE FP format; its fairly simple and I understand all I've read so far in the book.
The big problem is why the Rexponent variable seems to be almost unpredictable; in this example with the given values its 5. Why is that? By my guess it should be two. Two because the decimal point is two digits right of the implied one.
I've commented the actual values that are produced in the program in to the code so you don't have to run the program to get a sense of whats happening (at least in the important parts).
It is unfinished at this point. The entire project has not been created on my computer yet.
Here is the code (quoted from the file which I copied from the book and then modified):
#include<iostream>
typedef long unsigned real; //typedef our long unsigned ints in to the label "real" so we don't confuse it with other datatypes.
using namespace std; //Just so I don't have to type out std::cout any more!
#define asreal(x) (*((float *) &x)) //Cast the address of X as a float pointer as a pointer. So we don't let the compiler truncate our FP values when being converted.
inline int extractExponent(real from) {
return ((from >> 23) & 0xFF) - 127; //Shift right 23 bits; & with eight ones (0xFF == 1111_1111 ) and make bias with the value by subtracting all ones from it.
}
void fpadd ( real left, real right, real *dest) {
//Left operand field containers
long unsigned int Lexponent = 0;
long unsigned Lmantissa = 0;
int Lsign = 0;
//RIGHT operand field containers
long unsigned int Rexponent = 0;
long unsigned Rmantissa = 0;
int Rsign = 0;
//Resulting operand field containers
long int Dexponent = 0;
long unsigned Dmantissa = 0;
int Dsign = 0;
std::cout << "Size of datatype: long unsigned int is: " << sizeof(long unsigned int); //For debugging
//Properly initialize the above variable's:
//Left
Lexponent = extractExponent(left); //Zero. This value is NOT a flat zero when displayed because we subtract 127 from the exponent after extracting it! //Value is: 0xffffff81
Lmantissa = extractMantissa (left); //Zero. We don't do anything to this number except add a whole number one to it. //Value is: 0x00000000
Lsign = extractSign(left); //Simple.
//Right
**Rexponent = extractExponent(right); //Value is: 0x00000005 <-- why???**
Rmantissa = extractMantissa (right);
Rsign = extractSign(right);
}
int main (int argc, char *argv[]) {
real a, b, c;
asreal(a) = -0.0;
asreal(b) = 45.67;
fpadd(a,b, &c);
printf("Sum of A and B is: %f", c);
std::cin >> a;
return 0;
}
Help would be much appreciated; I'm several days in to this project and very frustrated!
in this example with the given values its 5. Why is that?
The floating point number 45.67 is internally represented as
2^5 * 1.0110110101011100001010001111010111000010100011110110
which actually represents the number
45.6700000000000017053025658242404460906982421875
This is as close as you can get to 45.67 inside float.
If all you are interested in is the exponent of a number, simply compute its base 2 logarithm and round down. Since 45.67 is between 32 (2^5) and 64 (2^6), the exponent is 5.
Computers use binary representation for all numbers. Hence, the exponent is for base two, not base ten. int(log2(45.67)) = 5.
/*
* Returns time in s.usec
*/
float mtime()
{
struct timeval stime;
gettimeofday(&stime,0x0);
return (float)stime.tv_sec+((float)stime.tv_usec)/1000000000;
}
main(){
while(true){
cout<<setprecision(15)<<mtime()<<endl;
// shows the same time irregularly for some reason and can mess up triggers
usleep(500000);
}
}
Why does it show the same time irregularly? (compiled on ubuntu 64bit and C++)
What other standard methods are available to generate a unix timestamp with millisecond accuracy?
A float has between 6 and 9 decimal digits of precision.
So if integer part is e.g. 1,391,432,494 (UNIX time when I write this; requiring 10 digits), you're already out of digits for the fractional part. Not so good, and this is why float is failing for this.
Jumping to double gives you 15 digits so it seems to suffice as long as you can assume that the integer part is a UNIX timestamp, i.e. seconds since 1970 since that means it's not likely to use drastically more digits any time soon.
Seems float doesn't have enough precision, replaced with double and all is ok now.
/*
* Returns time in s.usec
*/
double mtime()
{
struct timeval stime;
gettimeofday(&stime,0x0);
return (double)stime.tv_sec+((double)stime.tv_usec)/1000000000;
}
Still don't exactly understand the reason for the random behavior...
PS. I was capturing a mtime() and comparing it with current time to get duration.
Program:
void DibLaplacian8Direct(CDib sourceImg)
{
register int i,j;
int w = sourceImg.GetWidth();
int h = sourceImg.GetHeight();
CDib cpyImage = sourceImg;
BYTE** pSourceImg = sourceImg.GetPtr();
BYTE** pCpyImage = cpyImage.GetPtr();
float G;
for(j =1;j<h-1;j++)
{
for(i =1;i<w-1;i++)
{
G = -1*pCpyImage[j-1][i-1] + -1*pCpyImage[j-1][i] + (-1)*pCpyImage[j-1][i+1]+
(-1)*pCpyImage[j][i-1] + 8*pCpyImage[j][i] + (-1)*pCpyImage[j][i+1]+
-1*pCpyImage[j+1][i-1] + (-1)*pCpyImage[j+1][i] + -1*pCpyImage[j+1][i+1];
pSourceImg[j][i] = (BYTE)G;
}
}
}
warning error:
warning.. Can't coonvert from int to float..
Warning 1 warning C4819: The file contains a character that cannot be represented in the current code page (1257). Save the file in Unicode format to prevent data loss D:\2nd\imagetool\dibfilter.cpp 1 1 ImageTool
I do't understand that why its making me warning of int to float.
and for warning 1,
I am using VS 2010.. i do't know that i am getting warning in StdAfx.h include file .
Amy one can help me with this .
The first warning is due to the fact that a float has only six significant figures whereas an int can have more. If it does, then accuracy is lost.
In general, you cannot convert an integer to floating point without possible losing data. Also, you cannot convert from floating point back to integer without losing the deceimal places, so you get a warning again.
A simple minimalistic code example of the above case:
#include<iostream>
using namespace std;
int main()
{
int a=10;
int b=3;
float c;
c=a/b;
cout << c << endl;
return 0;
}
If you are sure of the data being in the range and there wont be any loss of accuracy you can use typecasting to get rid of the warning.
G = (float) (.....)
Check this for the second warning.
To get rid of the second warning you need to save the file in Unicode format.
Go to file->advanced save options and under that select the new encoding you want to save it as. UTF-8 or UNICODE codepage 1200 are the settings you want.
It is important to understand what the compiler is telling you with warning 20. The issue is that floating point numbers have only 23 bits of precision, while ints have 31. If your numbers is larger than 2^23, you will lose the low bits by storing in a float.
Now your number can never be larger than 2^23, so you are fine. Still, it is important to know what is going on here. There is a reason for that warning, and simply putting in the cast without understanding what is going on may mess you up some day.
In your specific case, I am not at all clear on why you are using a float. You are adding nine integers, none of which can be greater than 2^11. You have plenty of precision in an int to do that. Using a float is just going to slow your program down. (Probably quite a bit.)
Finally, that last cast to BYTE is worrisome. What happens if your value is out of range? Probably not what you want. For example if BYTE is unsigned, and your float ends up -3.0, you will be storing 253 as the result.