long double increment operator not working on large numbers - c++

I'm converting a C++ system from solaris (SUN box and solaris compiler) to linux (intel box and gcc compiler). I'm running into several problems when dealing with large "long double" values. (We use "long double" due to some very very large integers... not for any decimal precision). It manifests itself in several weird ways but I've simplified it to the following program. It's trying to increment a number but doesn't. I don't get any compile or runtime errors... it just doesn't increment the number.
I've also randomly tried a few different compiler switches, (-malign-double and -m128bit-long-double with various combinations of these turned on and off), but no difference.
I've run this in gdb too and gdb's "print" command shows the same value as the cout statement.
Anyone seen this behavior?
Thanks
compile commands
$ /usr/bin/c++ --version
c++ (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4)
$ /usr/bin/c++ -g -Wall -fPIC -c SimpleLongDoubleTest.C -o SimpleLongDoubleTest.o
$ /usr/bin/c++ -g SimpleLongDoubleTest.o -o SimpleLongDoubleTest
$ ./SimpleLongDoubleTest
Maximum value for long double: 1.18973e+4932
digits 10 = 18
ld = 1268035319515045691392
ld = 1268035319515045691392
ld = 1268035319515045691392
ld = 1268035319515045691392
ld = 1268035319515045691392
SimpleLongDoubleTest.C
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <limits>
#include <iomanip>
int main( int argc, char* argv[] )
{
std::cout << "Maximum value for long double: "
<< std::numeric_limits<long double>::max() << '\n';
std::cout << "digits 10 = " << std::numeric_limits<long double>::digits10
<< std::endl;
// this doesn't work (there might be smaller numbers that also doen't work...
// That is, I'm not sure the exact number between this and the number defined
// below where things break)
long double ld = 1268035319515045691392.0L ;
// but this or any smaller number works (there might be larger numbers that
// work... That is, I'm not sure the exact number between this and the number
// defined above where things break)
//long double ld = 268035319515045691392.0L ;
for ( int i = 0 ; i < 5 ; i++ )
{
ld++ ;
std::cout << std::setiosflags( std::ios::fixed )
<< std::setprecision( 0 )
<< "ld = " << ld
<< std::endl ;
}
}

This is expected behavior. Float, double, long double etc. are internally represented in form of (2^exp-bias)*1 + xxxxx, where xxxxx is a N digit binary number, where N=23 for floats, 52 for doubles and possibly 64 for long doubles. When the number grows larger than 2^N, it's no longer possible to add '1' to that variable -- one can only add multiples of 2^(n-N).
It's also possible that your architecture equates long double as double. (even though x86 can use internally 80-bit doubles).
See also Wikipedia article -- 128 bit double is rather an exception than a norm. (sparc supports it).

Related

Can't make C++ program use high enough precision

...
cout << setprecision(100) << pow((3+sqrt(5.0)),28) << endl;
...
outputs
135565048129406451712
which isn't precise enough but.
$ bc <<< "scale = 100; (3+sqrt(5.0))^28"
outputs
135565048129406369791.9994684648068789538123313610677119237534230237579838585720347675878761558402979025019238688523799354
which is what I want. I'm setting the cout precision so it must be the sqrt, pow or + are losing the precision?
Setting precision on cout doesn't have any effect on how the underlying computation is done in C++. floats typically have about 8 digits of precision, doubles about 16; your C++ output has only the first 15 digits matching the bc output.
If you want more precision then you'll have to use another method, such as an arbitrary precision numerical library. That's how the bc program implements arbitrary precision math.
For example, using:
https://gmplib.org
#include <gmp.h>
#include <gmpxx.h>
#include <iostream>
#include <iomanip>
int main() {
mpf_set_default_prec(402);
mpf_class a = 3_mpf + sqrt(5_mpf);
mpf_class output;
mpf_pow_ui(output.get_mpf_t(), a.get_mpf_t(), 28);
std::cout << std::setprecision(121);
std::cout << output << '\n';
}
This prints:
135565048129406369791.9994684648068789538123313610677119237534230237579838585720347675878761558402979528909982661363879709
Interestingly this is different from the output of bc <<< "scale = 100; (3+sqrt(5.0))^28", but if you set the scale higher for bc you'll see that gmp's output is correct.
It looks like bc is willing to print out however many digits it has even if the operands to expressions that produced those digits didn't have enough precision to get them right. In contrast GMP appears to set the precision for results based on what's accurate given the precision of the inputs.

C++ precision and truncation with file stream [duplicate]

I have a file.txt with hundreds of numbers.
They have many digits (max 20) after the point and I need to get them all without truncation, otherwise they introduce errors in the following computations. I made these numbers with matlab so it has a monstrous precision but now I must replicate this behaviour in my program.
I've done this way:
fstream in;
in.open(file.txt, ios::in);
long double number;
in>>number;
I also tried this
in.precision(20);
in>>number;
before each ">>" operation but it is vain
std::numeric_limits::min std::numeric_limits::digits10 can tell you what your target's actual precision is for long double.
If you find that it's insufficient to represent your data, you probably want arbitrary precision. There are a couple of arbitrary precision number libraries you can use, none of which are standard in C++.
boost::multiprecision
GNU MP
MPFR
The following works fine on my system (Win7, VS2012):
#include <fstream>
#include <iostream>
int main (void)
{
std::ifstream file ("test.txt") ;
long double d = 0 ;
file >> d ;
std::cout.precision (20) ;
std::cout << d << "\n" ;
return 0 ;
}
The text file:
2.7239385667867091
The output:
2.7239385667867091
If this doesn't work on your system, then you need to use a third-party number library.

Left bit shift by 16 in Ruby and C++

I have the following code in Ruby.
x = 33078
x << 16
# => 2167799808
In C++ That code is
int x = 33078;
x << 16
# => -2127167488
I know this has to do with overflows, but how can I get the C++ to give the same result as Ruby?
33078 << 16 does not fit into an integer and that is why in C++ it overflows and gets to a negative value. Meanwhile in ruby the type is automatically converted to something big enough to store the result of this computation.
If you want to be able to compute this value in C++, use a type with higher max value. unsigned int will be enough in this case but if you want to compute bigger values you may need long long or even unsigned long long.
You need to use an integer that is the same byte size as a Ruby int.
pry(main)> x = 33078
=> 33078
pry(main)> x.size
=> 8
Try
long int x
Generally int's in C are 32bit, not 64bit ( or 8 bytes ).
#include <iostream>
int main()
{
uint64_t x= 33078;
std::cout<< (x<< 16);
}
$ g++ -std=c++11 test.cpp && ./a.out
$ 2167799808

a c++ program returns different results in two IDE

I write the following c++ program in CodeBlocks, and the result was 9183. again I write it in Eclipse and after run, it returned 9220. Both use MinGW. The correct result is 9183. What's wrong with this code?
Thanks.
source code:
#include <iostream>
#include <set>
#include <cmath>
int main()
{
using namespace std;
set<double> set_1;
for(int a = 2; a <= 100; a++)
{
for(int b = 2; b <= 100; b++)
{
set_1.insert(pow(double(a), b));
}
}
cout << set_1.size();
return 0;
}
You are probably seeing precision errors due to CodeBlocks compiling in 32-bit mode and Eclipse compiling in 64-bit mode:
$ g++ -m32 test.cpp
$ ./a.out
9183
$ g++ -m64 test.cpp
$ ./a.out
9220
If I cast both arguments to double I get what you would expect:
pow(static_cast<double>(a), static_cast<double>(b))
The difference appears to be due to whether the floating point operations are using 53-bit precision or 64-bit precision. If you add the following two lines in front of the loop (assuming Intel architecture), it will use 53-bit precision and give the 9220 result when compiled as a 32-bit application:
uint16_t precision = 0x27f;
asm("fldcw %0" : : "m" (*&precision));
It is bits 8 and 9 of the FPU that control this precision. The above sets those two bits to 10. Setting them to 11 results in 64-bit precision. And, just for completeness, if you set the bits to 00 (value 0x7f), the size is printed as 9230.
Actually you're not really supposed to rely on == (or technically, x <= y && y <= x) for doubles anyway. So this code produces implementation-dependent results (not strictly speaking UB, per comments, but what I meant :) )

Undefined behavior when exceed 64 bits

I have written a function that converts a decimal number to a binary number. I enter my decimal number as a long long int. It works fine with small numbers, but my task is to determine how the computer handles overflow so when I enter (2^63) - 1 the function outputs that the decimal value is 9223372036854775808 and in binary it is equal to -954437177. When I input 2^63 which is a value a 64 bit machine can't hold, I get warnings that the integer constant is so large that it is unsigned and that the decimal constant is unsigned only in ISO C90 and the output of the decimal value is negative 2^63 and binary number is 0. I'm using gcc as a compiler. Is that outcome correct?
The code is provided below:
#include <iostream>
#include<sstream>
using namespace std;
int main()
{
long long int answer;
long long dec;
string binNum;
stringstream ss;
cout<<"Enter the decimal to be converted:"<< endl;;
cin>>dec;
cout<<"The dec number is: "<<dec<<endl;
while(dec>0)
{
answer = dec%2;
dec=dec/2;
ss<<answer;
binNum=ss.str();
}
cout<<"The binary of the given number is: ";
for (int i=sizeof(binNum);i>=0;i--){
cout<<binNum[i];}
return 0;
}
First, “on a 64-bit computer” is meaningless: long long is guaranteed at least 64 bits regardless of computer. If could press a modern C++ compiler onto a Commodore 64 or a Sinclair ZX80, or for that matter a KIM-1, a long long would still be at least 64 bits. This is a machine-independent guarantee given by the C++ standard.
Secondly, specifying a too large value is not the same as “overflow”.
The only thing that makes this question a little bit interesting is that there is a difference. And that the standard treats these two cases differently. For the case of initialization of a signed integer with an integer value a conversion is performed if necessary, with implementation-defined effect if the value cannot be represented, …
C++11 §4.7/3:
“If the destination type is signed, the value is unchanged if it can be represented in the destination type (and bit-field width); otherwise, the value is implementation-defined”
while for the case of e.g. a multiplication that produces a value that cannot be represented by the argument type, the effect is undefined (e.g., might even crash) …
C++11 §5/4:
“If during the evaluation of an expression, the result is not mathematically defined or not in the range of representable values for its type, the behavior is undefined.”
Regarding the code I I only discovered it after writing the above, but it does look like it will necessarily produce overflow (i.e. Undefined Behavior) for sufficiently large number. Put your digits in a vector or string. Note that you can also just use a bitset to display the binary digits.
Oh, the KIM-1. Not many are familiar with it, so here’s a photo:
It was, reportedly, very nice, in spite of the somewhat restricted keyboard.
This adaptation of your code produces the answer you need. Your code is apt to produce the answer with the bits in the wrong order. Exhaustive testing of decimal values 123, 1234567890, 12345678901234567 show it working OK (G++ 4.7.1 on Mac OS X 10.7.4).
#include <iostream>
#include<sstream>
using namespace std;
int main()
{
long long int answer;
long long dec;
string binNum;
cout<<"Enter the decimal to be converted:"<< endl;;
cin>>dec;
cout<<"The dec number is: "<<dec<<endl;
while(dec>0)
{
stringstream ss;
answer = dec%2;
dec=dec/2;
ss<<answer;
binNum.insert(0, ss.str());
// cout << "ss<<" << ss.str() << ">> bn<<" << binNum.c_str() << ">>" << endl;
}
cout<<"The binary of the given number is: " << binNum.c_str() << endl;
return 0;
}
Test runs:
$ ./bd
Enter the decimal to be converted:
123
The dec number is: 123
The binary of the given number is: 1111011
$ ./bd
Enter the decimal to be converted:
1234567890
The dec number is: 1234567890
The binary of the given number is: 1001001100101100000001011010010
$ ./bd
Enter the decimal to be converted:
12345678901234567
The dec number is: 12345678901234567
The binary of the given number is: 101011110111000101010001011101011010110100101110000111
$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
obase=2
123
1111011
1234567890
1001001100101100000001011010010
12345678901234567
101011110111000101010001011101011010110100101110000111
$
When I compile this with the largest value possible for a 64 bit machine, nothing shows up for my binary value.
$ bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
2^63-1
9223372036854775807
quit
$ ./bd
Enter the decimal to be converted:
9223372036854775807
The dec number is: 9223372036854775807
The binary of the given number is: 111111111111111111111111111111111111111111111111111111111111111
$
If you choose a larger value for the largest value that can be represented, all bets are off; you may get a 0 back from cin >> dec; and the code does not handle 0 properly.
Prelude
The original code in the question was:
#include <iostream>
using namespace std;
int main()
{
int rem,i=1,sum=0;
long long int dec = 9223372036854775808; // = 2^63 9223372036854775807 = 2^63-1
cout<<"The dec number is"<<dec<<endl;
while(dec>0)
{
rem=dec%2;
sum=sum + (i*rem);
dec=dec/2;
i=i*10;
}
cout<<"The binary of the given number is:"<<sum<<endl;
return 0;
}
I gave this analysis of the earlier code:
You are multiplying the plain int variable i by 10 for every bit position in the 64-bit number. Given that i is probably a 32-bit quantity, you are running into signed integer overflow, which is undefined behaviour. Even if i was a 128-bit quantity, it would not be big enough to handle all possible 64-bit numbers (such as 263-1) accurately.