Out of nowhere I get quite a big result for this function... It should be very simple, but I can't see it now.
double prob_calculator_t::pimpl_t::B_full_term() const
{
double result = 0.0;
for (uint32_t j=0, j_end=U; j<j_end; j++)
{
uint32_t inhabited_columns = doc->row_sums[j];
// DEBUG
cout << "inhabited_columns: " << inhabited_columns << endl;
cout << "log_of_sum[j]: " << log_of_sum[j] << endl;
cout << "sum_of_log[j]: " << sum_of_log[j] << endl;
// end DEBUG
result += ( -inhabited_columns * log( log_of_sum[j] ) + sum_of_log[ j ] );
cout << "result: " << result << endl;
}
return result;
}
and where is the trace:
inhabited_columns: 1
log_of_sum[j]: 110.56
sum_of_log[j]: -2.81341
result: 2.02102e+10
inhabited_columns: 42
log_of_sum[j]: 110.56
sum_of_log[j]: -143.064
result: 4.04204e+10
Thanks for the help!
inhabited_columns is unsigned and I see a unary - just before it: -inhabited_columns.
(Note that unary - has a really high operator precedence; higher than * etc).
That is where your problem is! To quote Mike Seymour's answer:
When you negate it, the result is still unsigned; the value is reduced
modulo 232 to give a large positive value.
One fix would be to write
-(inhabited_columns * log(log_of_sum[j]))
as then the negation will be carried out in floating point
inhabited_columns is an unsigned type. When you negate it, the result is still unsigned; the value is reduced modulo 232 to give a large positive value.
You should change it to a sufficiently large signed type (maybe int32_t, if you're not going to have more than a couple of billion columns), or perhaps double since you're about to use it in double-precision arithmetic.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Consider the following code for integral types:
template <class T>
std::string as_binary_string( T value ) {
return std::bitset<sizeof( T ) * 8>( value ).to_string();
}
int main() {
unsigned char a(2);
char b(4);
unsigned short c(2);
short d(4);
unsigned int e(2);
int f(4);
unsigned long long g(2);
long long h(4);
std::cout << "a = " << +a << " " << as_binary_string( a ) << std::endl;
std::cout << "b = " << +b << " " << as_binary_string( b ) << std::endl;
std::cout << "c = " << c << " " << as_binary_string( c ) << std::endl;
std::cout << "d = " << c << " " << as_binary_string( d ) << std::endl;
std::cout << "e = " << e << " " << as_binary_string( e ) << std::endl;
std::cout << "f = " << f << " " << as_binary_string( f ) << std::endl;
std::cout << "g = " << g << " " << as_binary_string( g ) << std::endl;
std::cout << "h = " << h << " " << as_binary_string( h ) << std::endl;
std::cout << "\nPress any key and enter to quit.\n";
char q;
std::cin >> q;
return 0;
}
Pretty straight forward, works well and is quite simple.
EDIT
How would one go about writing a function to extract the binary or bit pattern of arbitrary floating point types at compile time?
When it comes to floats I have not found anything similar in any existing libraries of my own knowledge. I've searched google for days looking for one, so then I resorted into trying to write my own function without any success. I no longer have the attempted code available since I've originally asked this question so I can not exactly show you all of the different attempts of implementations along with their compiler - build errors. I was interested in trying to generate the bit pattern for floats in a generic way during compile time and wanted to integrate that into my existing class that seamlessly does the same for any integral type. As for the floating types themselves, I have taken into consideration the different formats as well as architecture endian. For my general purposes the standard IEEE versions of the floating point types is all that I should need to be concerned with.
iBug had suggested for me to write my own function when I originally asked this question, while I was in the attempt of trying to do so. I understand binary numbers, memory sizes, and the mathematics, but when trying to put it all together with how floating point types are stored in memory with their different parts {sign bit, base & exp } is where I was having the most trouble.
Since then with the suggestions those who have given a great answer - example I was able to write a function that would fit nicely into my already existing class template and now it works for my intended purposes.
What about writing one by yourself?
static_assert(sizeof(float) == sizeof(uint32_t));
static_assert(sizeof(double) == sizeof(uint64_t));
std::string as_binary_string( float value ) {
std::uint32_t t;
std::memcpy(&t, &value, sizeof(value));
return std::bitset<sizeof(float) * 8>(t).to_string();
}
std::string as_binary_string( double value ) {
std::uint64_t t;
std::memcpy(&t, &value, sizeof(value));
return std::bitset<sizeof(double) * 8>(t).to_string();
}
You may need to change the helper variable t in case the sizes for the floating point numbers are different.
You can alternatively copy them bit-by-bit. This is slower but serves for arbitrarily any type.
template <typename T>
std::string as_binary_string( T value )
{
const std::size_t nbytes = sizeof(T), nbits = nbytes * CHAR_BIT;
std::bitset<nbits> b;
std::uint8_t buf[nbytes];
std::memcpy(buf, &value, nbytes);
for(int i = 0; i < nbytes; ++i)
{
std::uint8_t cur = buf[i];
int offset = i * CHAR_BIT;
for(int bit = 0; bit < CHAR_BIT; ++bit)
{
b[offset] = cur & 1;
++offset; // Move to next bit in b
cur >>= 1; // Move to next bit in array
}
}
return b.to_string();
}
You said it doesn't need to be standard. So, here is what works in clang on my computer:
#include <iostream>
#include <algorithm>
using namespace std;
int main()
{
char *result;
result=new char[33];
fill(result,result+32,'0');
float input;
cin >>input;
asm(
"mov %0,%%eax\n"
"mov %1,%%rbx\n"
".intel_syntax\n"
"mov rcx,20h\n"
"loop_begin:\n"
"shr eax\n"
"jnc loop_end\n"
"inc byte ptr [rbx+rcx-1]\n"
"loop_end:\n"
"loop loop_begin\n"
".att_syntax\n"
:
: "m" (input), "m" (result)
);
cout <<result <<endl;
delete[] result;
return 0;
}
This code makes a bunch of assumptions about the computer architecture and I am not sure on how many computers it would work.
EDIT:
My computer is a 64-bit Mac-Air. This program basically works by allocating a 33-byte string and filling the first 32 bytes with '0' (the 33rd byte will automatically be '\0').
Then it uses inline assembly to store the float into a 32-bit register and then it repeatedly shifts it to the right by one bit.
If the last bit in the register was 1 before the shift, it gets stored into the carry flag.
The assembly code then checks the carry flag and, if it contains 1, it increases the corresponding byte in the string by 1.
Since it was previously initialized to '0', it will turn to '1'.
So, effectively, when the loop in the assembly is finished, the binary representation of a float is stored into a string.
This code only works for x64 (it uses 64-bit registers "rbx" and "rcx" to store the pointer and the counter for the loop), but I think it's easy to tweak it to work on other processors.
An IEEE floating point number looks like the following
sign exponent mantissa
1 bit 11 bits 52 bits
Note that there's a hidden 1 before the mantissa, and the exponent
is biased so 1023 = 0, not two's complement.
By memcpy()ing to a 64 bit unsigned integer you can then apply AND and
OR masks to get the bit pattern. The arrangement could be big endian
or little endian.
You can easily work out which arrangement you have by passing easy numbers
such as 1 or 2.
Generally people either use std::hexfloat or cast a pointer to the floating-point value to a pointer to an unsigned integer of the same size and print the indirected value in hex format. Both methods facilitate bit-level analysis of floating-point in a productive fashion.
You could roll your by casting the address of the float/double to a char and iterating it that way:
#include <memory>
#include <iostream>
#include <limits>
#include <iomanip>
template <typename T>
std::string getBits(T t) {
std::string returnString{""};
char *base{reinterpret_cast<char *>(std::addressof(t))};
char *tail{base + sizeof(t) - 1};
do {
for (int bits = std::numeric_limits<unsigned char>::digits - 1; bits >= 0; bits--) {
returnString += ( ((*tail) & (1 << bits)) ? '1' : '0');
}
} while (--tail >= base);
return returnString;
}
int main() {
float f{10.0};
double d{100.0};
double nd{-100.0};
std::cout << std::setprecision(1);
std::cout << getBits(f) << std::endl;
std::cout << getBits(d) << std::endl;
std::cout << getBits(nd) << std::endl;
}
Output on my machine (note the sign flip in the third output):
01000001001000000000000000000000
0100000001011001000000000000000000000000000000000000000000000000
1100000001011001000000000000000000000000000000000000000000000000
I was writing a little function to calculate the binomial coefficiant using the tgamma function provided by c++. tgamma returns float values, but I wanted to return an integer. Please take a look at this example program comparing three ways of converting the float back to an int:
#include <iostream>
#include <cmath>
int BinCoeffnear(int n,int k){
return std::nearbyint( std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)) );
}
int BinCoeffcast(int n,int k){
return static_cast<int>( std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)) );
}
int BinCoeff(int n,int k){
return (int) std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1));
}
int main()
{
int n = 7;
int k = 2;
std::cout << "Correct: " << std::tgamma(7+1) / (std::tgamma(2+1)*std::tgamma(7-2+1)); //returns 21
std::cout << " BinCoeff: " << BinCoeff(n,k); //returns 20
std::cout << " StaticCast: " << BinCoeffcast(n,k); //returns 20
std::cout << " nearby int: " << BinCoeffnear(n,k); //returns 21
return 0;
}
why is it, that even though the calculation returns a float equal to 21, 'normal' conversion fails and only nearbyint returns the correct value. What is the nicest way to implement this?
EDIT: according to c++ documentation here tgamma(int) returns a double.
From this std::tgamma reference:
If arg is a natural number, std::tgamma(arg) is the factorial of arg-1. Many implementations calculate the exact integer-domain factorial if the argument is a sufficiently small integer.
It seems that the compiler you're using is doing that, calculating the factorial of 7 for the expression std::tgamma(7+1).
The result might differ between compilers, and also between optimization levels. As demonstrated by Jonas there is a big difference between optimized and unoptimized builds.
The remark by #nos is on point. Note that the first line
std::cout << "Correct: " <<
std::tgamma(7+1) / (std::tgamma(2+1)*std::tgamma(7-2+1));
Prints a double value and does not perform a floating point to integer conversion.
The result of your calculation in floating point is indeed less than 21, yet this double precision value is printed by cout as 21.
On my machine (x86_64, gnu libc, g++ 4.8, optimization level 0) setting cout.precision(18) makes the results explicit.
Correct: 20.9999999999999964 BinCoeff: 20 StaticCast: 20 nearby int: 21
In this case practical to replace integer operations with floating point operations, but one has to keep in mind that the result must be integer. The intention is to use std::round.
The problem with std::nearbyint is that depending on the rounding mode it may produce different results.
std::fesetround(FE_DOWNWARD);
std::cout << " nearby int: " << BinCoeffnear(n,k);
would return 20.
So with std::round the BinCoeff function might look like
int BinCoeffRound(int n,int k){
return static_cast<int>(
std::round(
std::tgamma(n+1) /
(std::tgamma(k+1)*std::tgamma(n-k+1))
));
}
Floating-point numbers have rounding errors associated with them. Here is a good article on the subject: What Every Computer Scientist Should Know About Floating-Point Arithmetic.
In your case the floating-point number holds a value very close but less than 21. Rules for implicit floating–integral conversions say:
The fractional part is truncated, that is, the fractional part is
discarded.
Whereas std::nearbyint:
Rounds the floating-point argument arg to an integer value in floating-point format, using the current rounding mode.
In this case the floating-point number will be exactly 21 and the following implicit conversion would return 21.
The first cout outputs 21 because of rounding that happens in cout by default. See std::setprecition.
Here's a live example.
What is the nicest way to implement this?
Use the exact integer factorial function that takes and returns unsigned int instead of tgamma.
the problem is on handling the floats.
floats cant 2 as 2 but as 1.99999 something like that.
So converting to int will drop out the decimal part.
So instead of converting to int immediately first round it to by calling the ceil function w/c declared in cmath or math.h.
this code will return all 21
#include <iostream>
#include <cmath>
int BinCoeffnear(int n,int k){
return std::nearbyint( std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)) );
}
int BinCoeffcast(int n,int k){
return static_cast<int>( ceil(std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1))) );
}
int BinCoeff(int n,int k){
return (int) ceil(std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)));
}
int main()
{
int n = 7;
int k = 2;
std::cout << "Correct: " << (std::tgamma(7+1) / (std::tgamma(2+1)*std::tgamma(7-2+1))); //returns 21
std::cout << " BinCoeff: " << BinCoeff(n,k); //returns 20
std::cout << " StaticCast: " << BinCoeffcast(n,k); //returns 20
std::cout << " nearby int: " << BinCoeffnear(n,k); //returns 21
std::cout << "\n" << (int)(2.9995) << "\n";
}
Be gentle ... I'm 5 weeks into studying C++. I've dug and dug and cannot figure out why Visual Studio Express (and online compilers) are throwing errors about this.
Note that I've included all my declarations for the sake of clarity -- most are used in different section of code. The line that gets the errors is this one: newsharePrice = perchangeEnd * profitLoss << '\n';
The error I get is c2296, left operand has type double. I have no idea why it doesn't like this ... I multiply other doubles just fine.
double numberShares,
sharePrice,
profitLoss,
profitGain,
commissionPaid,
commissionCost,
sharesCost,
totalshareCost,
newtotalshareCost,
newcommissionCost,
newsharePrice;
double perChange;
double perchangeEnd;
const int minVALUE = 1;
const int maxVALUE = 100;
int seed = time(0);
srand (seed);
perChange = (rand() % (maxVALUE - minVALUE + 1)) + minVALUE;
cout << perChange << '\n';
perchangeEnd = perChange / 100;
int flip = rand() % 2 + 1;
if (flip == 1)
profitLoss = 1;
else
profitLoss = -1;
newsharePrice = perchangeEnd * profitLoss << '\n';
newsharePrice = newsharePrice + sharePrice;
cout << newsharePrice << '\n';
newtotalshareCost = numberShares * newsharePrice;
cout << "You've now paid " << newtotalshareCost << " for your shares." << '\n';
newcommissionCost = newtotalshareCost * commissionRate;
cout << "The new amount of commission for this is " << newcommissionCost << " ." << '/n';
Well, just read the problematic line:
newsharePrice = perchangeEnd * profitLoss << '\n';
// ▲▲▲▲▲▲▲▲
That << '\n' is not part of the multiplication; a copy-pasta fail from your cout lines?
In this context, the compiler has no choice but to assume you're trying to perform a bitwise left-shift operation, which cannot be performed on doubles; only on integers.
While the compilation error is now fixed, your domain error is still there (today is Friday, isn't it?). Why would share price fluctuation affect your commission in any way? You already hace your position. You also measure your number of shares with floating-point precision. While in some cases you might have uneven number of shares, this happens quite seldom. Are you really account for this or just incorrectly use double? Most systmes would count number of shares as integer. Also, you can have negative position, which after all calctulations will give negative commission! Brokers would not agree to that ;). The last, but not the least, in US commission is rarely expressed as percentage of transaction value. It is usually charged in a form of cents per share (or fixed transaction cost for most retail brokers).
I am racking my brain trying to figure out why this code does not get the right result. I am looking for the hexadecimal representations of the floating point positive and negative overflow/underflow levels. The code is based off this site and a Wikipedia entry:
7f7f ffff ≈ 3.4028234 × 1038 (max single precision) -- from wikipedia entry, corresponds to positive overflow
Here's the code:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <cmath>
using namespace std;
int main(void) {
float two = 2;
float twentyThree = 23;
float one27 = 127;
float one49 = 149;
float posOverflow, negOverflow, posUnderflow, negUnderflow;
posOverflow = two - (pow(two, -twentyThree) * pow(two, one27));
negOverflow = -(two - (pow(two, one27) * pow(two, one27)));
negUnderflow = -pow(two, -one49);
posUnderflow = pow(two, -one49);
cout << "Positive overflow occurs when value greater than: " << hex << *(int*)&posOverflow << endl;
cout << "Neg overflow occurs when value less than: " << hex << *(int*)&negOverflow << endl;
cout << "Positive underflow occurs when value greater than: " << hex << *(int*)&posUnderflow << endl;
cout << "Neg overflow occurs when value greater than: " << hex << *(int*)&negUnderflow << endl;
}
The output is:
Positive overflow occurs when value greater than: f3800000
Neg overflow occurs when value less than: 7f800000
Positive underflow occurs when value greater than: 1
Neg overflow occurs when value greater than: 80000001
To get the hexadecimal representation of the floating point, I am using a method described here:
Why isn't the code working? I know it'll work if positive overflow = 7f7f ffff.
Your expression for the highest representable positive float is wrong. The page you linked uses (2-pow(2, -23)) * pow(2, 127), and you have 2 - (pow(2, -23) * pow(2, 127)). Similarly for the smallest representable negative float.
Your underflow expressions look correct, however, and so do the hexadecimal outputs for them.
Note that posOverflow and negOverflow are simply +FLT_MAX and -FLT_MAX. But note that your posUnderflow and negUnderflow are actually smaller than FLT_MIN(because they are denormal, and FLT_MIN is the smallest positive normal float).
Floating point loses precision as the number gets bigger. A number of the magnitude 2127 does not change when you add 2 to it.
Other than that, I'm not really following your code. Using words to spell out numbers makes it hard for me to read.
Here is the standard way to get the floating-point limits of your machine:
#include <limits>
#include <iostream>
#include <iomanip>
std::ostream &show_float( std::ostream &s, float f ) {
s << f << " = ";
std::ostream s_hex( s.rdbuf() );
s_hex << std::hex << std::setfill( '0' );
for ( char const *c = reinterpret_cast< char const * >( & f );
c != reinterpret_cast< char const * >( & f + 1 );
++ c ) {
s_hex << std::setw( 2 ) << ( static_cast< unsigned int >( * c ) & 0xff );
}
return s;
}
int main() {
std::cout << std::hex;
std::cout << "Positive overflow occurs when value greater than: ";
show_float( std::cout, std::numeric_limits< float >::max() ) << '\n';
std::cout << "Neg overflow occurs when value less than: ";
show_float( std::cout, - std::numeric_limits< float >::max() ) << '\n';
std::cout << "Positive underflow occurs when value less than: ";
show_float( std::cout, std::numeric_limits< float >::denormal_min() ) << '\n';
std::cout << "Neg underflow occurs when value greater than: ";
show_float( std::cout, - std::numeric_limits< float >::min() ) << '\n';
}
output:
Positive overflow occurs when value greater than: 3.40282e+38 = ffff7f7f
Neg overflow occurs when value less than: -3.40282e+38 = ffff7fff
Positive underflow occurs when value less than: 1.17549e-38 = 00008000
Neg underflow occurs when value greater than: -1.17549e-38 = 00008080
The output depends on the endianness of the machine. Here the bytes are reversed due to little-endian order.
Note, "underflow" in this case isn't a catastrophic zero result, but just denormalization which gradually reduces precision. (It may be catastrophic to performance, though.) You might also check numeric_limits< float >::denorm_min() which produces 1.4013e-45 = 01000000.
Your code assumes integers have the same size as a float (so do all but a few of the posts on the page you've linked, btw.) You probably want something along the lines of:
for (size_t s = 0; s < sizeof(myVar); ++s) {
unsigned char *byte = reinterpret_cast<unsigned char*>(myVar)[s];
//sth byte is byte
}
that is, something akin to the templated code on that page.
Your compiler may not be using those specific IEEE 754 types. You'll need to check its documentation.
Also, consider using std::numeric_limits<float>.min()/max() or cfloat FLT_ constants for determining some of those values.
I'm making an attempt to learn C++ over again, using Sams Teach Yourself C++ in 21 Days (6th ed.). I'm trying to work through it very thoroughly, making sure I understand each chapter (although I'm acquainted with C-syntax languages already).
Near the start of chapter 5 (Listing 5.2), a point is made about unsigned integer overflow. Based on their example I wrote this:
#include <iostream>
int main () {
unsigned int bignum = 100;
unsigned int smallnum = 50;
unsigned int udiff;
int diff;
udiff = bignum - smallnum;
std::cout << "Difference (1) is " << udiff << "\n";
udiff = smallnum - bignum;
std::cout << "Difference (2) is " << udiff << "\n";
diff = bignum - smallnum;
std::cout << "Difference (3) is " << diff << "\n";
diff = smallnum - bignum;
std::cout << "Difference (4) is " << diff << "\n";
return 0;
}
This gives the following output, which is not surprising to me:
Difference (1) is 50
Difference (2) is 4294967246
Difference (3) is 50
Difference (4) is -50
If I change the program so that the line declaring bignum reads instead unsigned int bignum = 3000000000; then the output is instead
Difference (1) is 2999999950
Difference (2) is 1294967346
Difference (3) is -1294967346
Difference (4) is 1294967346
The first of these is obviously fine. The number 1294967346 is explained by the fact that 1294967346 is precisely 2^32 - 3000000000. I don't understand why the second line doesn't read 1294967396, owing to the 50 contributed by smallnum.
The third and fourth lines I can't explain. How do these results come about?
Edit: For the third line - does it give this result just by finding the solution modulo 2^32 that fits in the range of values allowed for a signed int?
2^32 - 3000000000 = 1294967296 (!)