Going crazy, why are my variables changing on me? - c++

Okay I've had this happen to me before where variables randomly change numbers because of memory allocation issues or wrong addressing etc, such as when you go out of bounds with an array. However, I'm not using arrays, or pointers or addresses so I have no idea why after executing this loop it suddenly decides that "exponent" after being set to 0 is equal to 288 inside the loop:
EDIT: It decides to break on specifically: 0x80800000.
This does not break in one test, we have a "testing" client which iterates through several test cases, each time it calls this again, each time the function is called again the values should be set equal to their original values.
/*
* float_i2f - Return bit-level equivalent of expression (float) x
* Result is returned as unsigned int, but
* it is to be interpreted as the bit-level representation of a
* single-precision floating point values.
* Legal ops: Any integer/unsigned operations incl. ||, &&. also if, while
* Max ops: 30
* Rating: 4
*/
unsigned float_i2f(int x) {
int sign= 0;
int a=0;
int exponent=0;
int crash_test=0;
int exp=0;
int fraction=0;
int counter=0;
if (x == 0) return 0;
if (!(x ^ (0x01 << 31)))
{
return 0xCF << 24;
}
if (x>>31)
{
sign = 0xFF << 31;
x = (~x) + 1;
}
else
{
sign = 0x00;
}
//printf(" After : %x ", x);
a = 1;
exponent = 0;
crash_test = 0;
while ((a*2) <= x)
{
if (a == 0) a =1;
if (a == 1) crash_test = exponent;
/*
if(exponent == 288)
{exponent =0;
counter ++;
if(counter <=2)
printf("WENT OVERBOARD WTF %d ORIGINAL %d", a, crash_test);
}
*/
if (exponent > 300) break;
exponent ++;
a *= 2;
}
exp = (exponent + 0x7F) << 23;
fraction = (~(((0x01)<< 31) >> 7)) & (x << (25 - (exponent + 1)));
return sign | exp | fraction;
}

Use a debugger or IDE, set a watch/breakpoint/assert on the value of exponent (e.g. (exponent > 100).
What was the offending value of x that float_i2f() was called with? Did exponent blow up for all x, or some range?
(Did you just say when x = 0x80800000 ? Did you set a watch on exponent and step that in a debugger for that value? Should answer your question. Did you check that 0x807FFFFF works, for example?)

I tried it myself with Visual Studio, and an input of "10", and it seemed to work OK.
Q: Can you give me an input value of "x" where it fails?
Q: What compiler are you using? What platform are you running on?

You have line that increments exponent at the end of your while loop.
while((a*2) <= x)
{
if(a == 0) a =1;
if(a == 1) crash_test = exponent;
/*
if(exponent == 288)
{
exponent =0;
counter ++;
if(counter <=2)
printf("WENT OVERBOARD WTF %d ORIGINAL %d", a, crash_test);
}
*/
if(exponent > 300) break;
exponent ++;
a *= 2;
}

The variable exponent isn't doing anything mysterious. You are incrementing exponent each time through the loop, so it eventually hits any number you like. The real question is why doesn't your loop exit when you think it should?
Your loop condition depends on a. Try printing out the successive values of a as your loop repeats. Do you notice anything funny happening after a reaches 1073741824? Have you heard about integer overflow in your classes yet?

Just handle the case where "a" goes negative (or better, validate your input so it never goes negative int he first place), and you should be fine :)

There were many useless attempts at optimization in there, I've removed them so the code is easier to read. Also I used <stdint.h> types as appropriate.
There was signed integer overflow in a *= 2 in the loop, but the main problem was lack of constants and weird computation of magic numbers.
This still isn't exemplary because the constants should all be named, but this seems to work reliably.
#include <stdio.h>
#include <stdint.h>
uint32_t float_i2f(int32_t x) {
uint32_t sign= 0;
uint32_t exponent=0;
uint32_t fraction=0;
if (x == 0) return 0;
if ( x == 0x80000000 )
{
return 0xCF000000u;
}
if ( x < 0 )
{
sign = 0x80000000u;
x = - x;
}
else
{
sign = 0;
}
/* Count order of magnitude, this will be excessive by 1. */
for ( exponent = 1; ( 1u << exponent ) <= x; ++ exponent ) ;
if ( exponent < 24 ) {
fraction = 0x007FFFFF & ( x << 24 - exponent ); /* strip leading 1-bit */
} else {
fraction = 0x007FFFFF & ( x >> exponent - 24 );
}
exponent = (exponent + 0x7E) << 23;
return sign | exponent | fraction;
}

a overflows. a*2==0 when a==1<<31, so every time exponent%32==0, a==0 and you loop until exponent==300.
There are a few other issues as well:
Your fraction calculation is off when exponent>=24. Negative left shifts do not automatically turn into positive right shifts.
The mask to generate the fraction is also slightly wrong. The leading bit is always assumed to be 1, and the mantissa is only 23 bits, so fraction for x<2^23 should be:
fraction = (~(((0x01)<< 31) >> 8)) & (x << (24 - (exponent + 1)));
The loop to calculate the exponent fails when abs(x)>=1<<31 (and incidentally results in precision loss if you don't round appropriately); a loop that takes the implicit 1 into account would be better here.

Related

Adding positive and negative numbers in IEEE-754 format

My problem seems to be pretty simple: I wrote a program that manually adds floating point numbers together. This program has certain restrictions. (such as no iostream or use of any unary operators), so that is the reason for the lack of those things. As for the problem, the program seems to function correctly when adding two positive floats (1.5 + 1.5 = 3.0, for example), but when adding two negative numbers (10.0 + -5.0) I get very wacky numbers. Here is the code:
#include <cstdio>
#define BIAS32 127
struct Real
{
//sign bit
int sign;
//UNBIASED exponent
long exponent;
//Fraction including implied 1. at bit index 23
unsigned long fraction;
};
Real Decode(int float_value);
int Encode(Real real_value);
Real Normalize(Real value);
Real Add(Real left, Real right);
unsigned long Add(unsigned long leftop, unsigned long rightop);
unsigned long Multiply(unsigned long leftop, unsigned long rightop);
void alignExponents(Real* left, Real* right);
bool is_neg(Real real);
int Twos(int op);
int main(int argc, char* argv[])
{
int left, right;
char op;
int value;
Real rLeft, rRight, result;
if (argc < 4) {
printf("Usage: %s <left> <op> <right>\n", argv[0]);
return -1;
}
sscanf(argv[1], "%f", (float*)&left);
sscanf(argv[2], "%c", &op);
sscanf(argv[3], "%f", (float*)&right);
rLeft = Decode(left);
rRight = Decode(right);
if (op == '+') {
result = Add(rLeft, rRight);
}
else {
printf("Unknown operator '%c'\n", op);
return -2;
}
value = Encode(result);
printf("%.3f %c %.3f = %.3f (0x%08x)\n",
*((float*)&left),
op,
*((float*)&right),
*((float*)&value),
value
);
return 0;
}
Real Decode(int float_value)
{ // Test sign bit of float_value - Test exponent bits of float_value & apply bias - Test mantissa bits of float_value
Real result{ float_value >> 31 & 1 ? 1 : 0, ((long)Add(float_value >> 23 & 0xFF, -BIAS32)), (unsigned long)float_value & 0x7FFFFF };
return result;
};
int Encode(Real real_value)
{
int x = 0;
x |= real_value.fraction; // Set the fraction bits of x
x |= real_value.sign << 31; // Set the sign bits of x
x |= Add(real_value.exponent, BIAS32) << 23; // Set the exponent bits of x
return x;
}
Real Normalize(Real value)
{
if (is_neg(value))
{
value.fraction = Twos(value.fraction);
}
unsigned int i = 0;
while (i < 9)
{
if ((value.fraction >> Add(23, i)) & 1) // If there are set bits past the mantissa section
{
value.fraction >>= 1; // shift mantissa right by 1
value.exponent = Add(value.exponent, 1); // increment exponent to accomodate for shift
}
i = Add(i, 1);
}
return value;
}
Real Add(Real left, Real right)
{
Real a = left, b = right;
alignExponents(&a, &b); // Aligns exponents of both operands
unsigned long sum = Add(a.fraction, b.fraction);
Real result = Normalize({ a.sign, a.exponent, sum }); // Normalize result if need be
return result;
}
unsigned long Add(unsigned long leftop, unsigned long rightop)
{
unsigned long sum = 0, test = 1; // sum initialized to 0, test created to compare bits
while (test) // while test is not 0
{
if (leftop & test) // if the digit being tested is 1
{
if (sum & test) sum ^= test << 1; // if the sum tests to 1, carry a bit over
sum ^= test;
}
if (rightop & test)
{
if (sum & test) sum ^= test << 1;
sum ^= test;
}
test <<= 1;
}
return sum;
}
void alignExponents(Real* a, Real* b)
{
if (a->exponent != b->exponent) // If the exponents are not equal
{
if (a->exponent > b->exponent)
{
int disp = a->exponent - b->exponent; // number of shifts needed based on difference between two exponents
b->fraction |= 1 << 23; // sets the implicit bit for shifting
b->exponent = a->exponent; // sets exponents equal to each other
b->fraction >>= disp; // mantissa is shifted over to accomodate for the increase in power
return;
}
int disp = b->exponent - a->exponent;
a->fraction |= 1 << 23;
a->exponent = b->exponent;
a->fraction >>= disp;
return;
}
return;
}
bool is_neg(Real real)
{
if (real.sign) return true;
return false;
}
int Twos(int op)
{
return Add(~op, -1); // NOT the operand and add 1 to it
}
On top of that, I just tested the values 10.5 + 5.5 and got a 24.0, so there appears to be even more wrong with this than I initially thought. I've been working on this for days and would love some help/advice.
Here is some help/advice. Now that you have worked on some of the code, I suggest going back and reworking your data structure. The declaration of such a crucial data structure would benefit from a lot more comments, making sure you know exactly what each field means.
For example, the implicit bit is not always 1. It is zero if the exponent is zero. That should be dealt with in your Encode and Decode functions. For the rest of your code, it is just a significand bit and should not have any special handling.
When you start thinking about rounding, you will find you often need more than 23 bits in an intermediate result.
Making the significand of negative numbers 2's complement will create a problem of having the same information stored two ways. You will have both a sign bit as though doing sign-and-magnitude and have the sign encoded in the signed integer signficand. Keeping them consistent will be a mess. Whatever you decide about how Real will store negative numbers, document it and keep it consistent throughout.
If I were implementing this I would start by defining Real very, very carefully. I would then decide what operations I wanted to be able to do on Real, and write functions to do them. If you get those right each function will be relatively simple.

how can i get numerator and denominator from a fractional number?

How can I get numerator and denominator from a fractional number? for example, from "1.375" i want to get "1375/1000" or "11/8" as a result. How can i do it with c++??
I have tried to do it by separating the numbers before the point and after the point but it doesn't give any idea how to get my desired output.
You didn't really specify whether you need to convert a floating point or a string to ratio, so I'm going to assume the former one.
Instead of trying string or arithmetic-based approaches, you can directly use properties of IEEE-754 encoding.
Floats (called binary32 by the standard) are encoded in memory like this:
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
^ ^
bit 31 bit 0
where S is sign bit, Es are exponent bits (8 of them) Ms are mantissa bits (23 bits).
The number can be decoded like this:
value = (-1)^S * significand * 2 ^ expoenent
where:
significand = 1.MMMMMMMMMMMMMMMMMMMMMMM (as binary)
exponent = EEEEEEEE (as binary) - 127
(note: this is for so called "normal numbers", there are also zeroes, subnormals, infinities and NaNs - see Wikipedia page I linked)
This can be used here. We can rewrite the equation above like this:
(-1)^S * significand * exponent = (-1)^s * (significand * 2^23) * 2 ^ (exponent - 23)
The point is that significand * 2^23 is an integer (equal to 1.MMMMMMMMMMMMMMMMMMMMMMM, binary - by multiplying by 2^23, we moved the point 23 places right).2 ^ (exponent - 23) is an integer too, obviously.
In other words: we can write the number as:
(significand * 2^23) / 2^(-(exponent - 23)) (when exponent - 23 < 0)
or
[(significand * 2^23) * 2^(exponent - 23)] / 1 (when exponent - 23 >= 0)
So we have both numerator and denominator - directly from binary representation of the number.
All of the above could be implemented like this in C++:
struct Ratio
{
int64_t numerator; // numerator includes sign
uint64_t denominator;
float toFloat() const
{
return static_cast<float>(numerator) / denominator;
}
static Ratio fromFloat(float v)
{
// First, obtain bitwise representation of the value
const uint32_t bitwiseRepr = *reinterpret_cast<uint32_t*>(&v);
// Extract sign, exponent and mantissa bits (as stored in memory) for convenience:
const uint32_t signBit = bitwiseRepr >> 31u;
const uint32_t expBits = (bitwiseRepr >> 23u) & 0xffu; // 8 bits set
const uint32_t mntsBits = bitwiseRepr & 0x7fffffu; // 23 bits set
// Handle some special cases:
if(expBits == 0 && mntsBits == 0)
{
// special case: +0 and -0
return {0, 1};
}
else if(expBits == 255u && mntsBits == 0)
{
// special case: +inf, -inf
// Let's agree that infinity is always represented as 1/0 in Ratio
return {signBit ? -1 : 1, 0};
}
else if(expBits == 255u)
{
// special case: nan
// Let's agree, that if we get NaN, we returns max int64_t by 0
return {std::numeric_limits<int64_t>::max(), 0};
}
// mask lowest 23 bits (mantissa)
uint32_t significand = (1u << 23u) | mntsBits;
const int64_t signFactor = signBit ? -1 : 1;
const int32_t exp = expBits - 127 - 23;
if(exp < 0)
{
return {signFactor * static_cast<int64_t>(significand), 1u << static_cast<uint32_t>(-exp)};
}
else
{
return {signFactor * static_cast<int64_t>(significand * (1u << static_cast<uint32_t>(exp))), 1};
}
}
};
(hopefully comments and description above are understandable - let me know, if there's something to improve)
I've omitted checks for out of range values for simplicity.
We can use it like this:
float fv = 1.375f;
Ratio rv = Ratio::fromFloat(fv);
std::cout << "fv = " << fv << ", rv = " << rv << ", rv.toFloat() = " << rv.toFloat() << "\n";
And the output is:
fv = 1.375, rv = 11534336/8388608, rv.toFloat() = 1.375
As you can see, exactly the same values on both ends.
The problem is that numerators and denumerators are big. This is because the code always multiplies significand by 2^23, even if smaller value would be enough to make it integer (this is equivalent to writing 0.2 as 2000000/10000000 instead of 2/10 - it's the same thing, only written differently).
This can be solved by changing the code to multiply significand (and divide exponent) by minimum number, like this (ellipsis stands for parts which are the same as above):
// counts number of subsequent least significant bits equal to 0
// example: for 1001000 (binary) returns 3
uint32_t countTrailingZeroes(uint32_t v)
{
uint32_t counter = 0;
while(counter < 32 && (v & 1u) == 0)
{
v >>= 1u;
++counter;
}
return counter;
}
struct Ratio
{
...
static Ratio fromFloat(float v)
{
...
uint32_t significand = (1u << 23u) | mntsBits;
const uint32_t nTrailingZeroes = countTrailingZeroes(significand);
significand >>= nTrailingZeroes;
const int64_t signFactor = signBit ? -1 : 1;
const int32_t exp = expBits - 127 - 23 + nTrailingZeroes;
if(exp < 0)
{
return {signFactor * static_cast<int64_t>(significand), 1u << static_cast<uint32_t>(-exp)};
}
else
{
return {signFactor * static_cast<int64_t>(significand * (1u << static_cast<uint32_t>(exp))), 1};
}
}
};
And now, for the following code:
float fv = 1.375f;
Ratio rv = Ratio::fromFloat(fv);
std::cout << "fv = " << fv << ", rv = " << rv << ", rv.toFloat() = " << rv.toFloat() << "\n";
We get:
fv = 1.375, rv = 11/8, rv.toFloat() = 1.375
In C++ you can use the Boost rational class. But you need to give numerator and denominator.
For this you need to find out no of digits in the input string after the decimal point. You can do this by string manipulation functions. Read the input character by character and find no of characters after the .
char inputstr[30];
int noint=0, nodec=0;
char intstr[30], dec[30];
int decimalfound = 0;
int denominator = 1;
int numerator;
scanf("%s",inputstr);
len = strlen(inputstr);
for (int i=0; i<len; i++)
{
if (decimalfound ==0)
{
if (inputstr[i] == '.')
{
decimalfound = 1;
}
else
{
intstr[noint++] = inputstr[i];
}
}
else
{
dec[nodec++] = inputstr[i];
denominator *=10;
}
}
dec[nodec] = '\0';
intstr[noint] = '\0';
numerator = atoi(dec) + (atoi(intstr) * 1000);
// You can now use the numerator and denominator as the fraction,
// either in the Rational class or you can find gcd and divide by
// gcd.
What about this simple code:
double n = 1.375;
int num = 1, den = 1;
double frac = (num * 1.f / den);
double margin = 0.000001;
while (abs(frac - n) > margin){
if (frac > n){
den++;
}
else{
num++;
}
frac = (num * 1.f / den);
}
I don't really tested too much, it's only an idea.
I hope I'll be forgiven for posting an answer which uses "only the C language". I know you tagged the question with C++ - but I couldn't turn down the bait, sorry. This is still valid C++ at least (although it does, admittedly, use mainly C string-processing techniques).
int num_string_float_to_rat(char *input, long *num, long *den) {
char *tok = NULL, *end = NULL;
char buf[128] = {'\0'};
long a = 0, b = 0;
int den_power = 1;
strncpy(buf, input, sizeof(buf) - 1);
tok = strtok(buf, ".");
if (!tok) return 1;
a = strtol(tok, &end, 10);
if (*end != '\0') return 2;
tok = strtok(NULL, ".");
if (!tok) return 1;
den_power = strlen(tok); // Denominator power of 10
b = strtol(tok, &end, 10);
if (*end != '\0') return 2;
*den = static_cast<int>(pow(10.00, den_power));
*num = a * *den + b;
num_simple_fraction(num, den);
return 0;
}
Sample usage:
int rc = num_string_float_to_rat("0015.0235", &num, &den);
// Check return code -> should be 0!
printf("%ld/%ld\n", num, den);
Output:
30047/2000
Full example at http://codepad.org/CFQQEZkc .
Notes:
strtok() is used to parse the input in to tokens (no need to reinvent the wheel in that regard). strtok() modifies its input - so a temporary buffer is used for safety
it checks for invalid characters - and will return a non-zero return code if found
strtol() has been used instead of atoi() - as it can detect non-numeric characters in the input
scanf() has not been used to slurp the input - due to rounding issues with floating point numbers
the base for strtol() has been explicitly set to 10 to avoid problems with leading zeros (otherwise a leading zero will cause the number to be interpreted as octal)
it uses a num_simple_fraction() helper (not shown) - which in turn uses a gcd() helper (also not shown) - to convert the result to a simple fraction
log10() of the numerator is determined by calculating the length of the token after the decimal point
I'd do this in three steps.
1) find the decimal point, so that you know how large the denominator has to be.
2) get the numerator. That's just the original text with the decimal point removed.
3) get the denominator. If there was no decimal point, the denominator is 1. Otherwise, the denominator is 10^n, where n is the number of digits to the right of the (now-removed) decimal point.
struct fraction {
std::string num, den;
};
fraction parse(std::string input) {
// 1:
std::size_t dec_point = input.find('.');
// 2:
if (dec_point == std::string::npos)
dec_point = 0;
else {
dec_point = input.length() - dec_point;
input.erase(input.begin() + dec_point);
}
// 3:
int denom = 1;
for (int i = 1; i < dec_point; ++i)
denom *= 10;
string result = { input, std::to_string(denom) };
return result;
}

Converting a 'long' type into a binary String

My objective is to write an algorithm that would be able to convert a long number into a binary number stored in a string.
Here is my current block of code:
#include <iostream>
#define LONG_SIZE 64; // size of a long type is 64 bits
using namespace std;
string b10_to_b2(long x)
{
string binNum;
if(x < 0) // determine if the number is negative, a number in two's complement will be neg if its' first bit is zero.
{
binNum = "1";
}
else
{
binNum = "0";
}
int i = LONG_SIZE - 1;
while(i > 0)
{
i --;
if( (x & ( 1 << i) ) == ( 1 << i) )
{
binNum = binNum + "1";
}
else
{
binNum = binNum + "0";
}
}
return binNum;
}
int main()
{
cout << b10_to_b2(10) << endl;
}
The output of this program is:
00000000000000000000000000000101000000000000000000000000000001010
I want the output to be:
00000000000000000000000000000000000000000000000000000000000001010
Can anyone identify the problem? For whatever reason the function outputs 10 represented by 32 bits concatenated with another 10 represented by 32 bits.
why would you assume long is 64 bit?
try const size_t LONG_SIZE=sizeof(long)*8;
check this, the program works correctly with my changes
http://ideone.com/y3OeB3
Edit: and ad #Mats Petersson pointed out you can make it more robust by changing this line
if( (x & ( 1 << i) ) == ( 1 << i) )
to something like
if( (x & ( 1UL << i) ) ) where that UL is important, you can see his explanation the the comments
Several suggestions:
Make sure you use a type that is guaranteed to be 64-bit, such as uint64_t, int64_t or long long.
Use above mentioned 64-bit type for your variable i to guarantee that the 1 << i calculates correctly. This is caused by the fact that shift is only guaranteed by the standard when the number of bits shifted are less or equal to the number of bits in the type being shifted - and 1 is the type int, which for most modern platforms (evidently including yours) is 32 bits.
Don't put semicolon on the end of your #define LONG_SIZE - or better yet, use const int long_size = 64; as this allows all manner of better behaviour, for example that you in the debugger can print long_size and get 64, where print LONG_SIZE where LONG_SIZE is a macro will yield an error in the debugger.

How to convert a double to a C# decimal in C++?

Given the reprensentation of decimal I have --you can find it here for instance--, I tried to convert a double this way:
explicit Decimal(double n)
{
DoubleAsQWord doubleAsQWord;
doubleAsQWord.doubleValue = n;
uint64 val = doubleAsQWord.qWord;
const uint64 topBitMask = (int64)(0x1 << 31) << 32;
//grab the 63th bit
bool isNegative = (val & topBitMask) != 0;
//bias is 1023=2^(k-1)-1, where k is 11 for double
uint32 exponent = (((uint64)(val >> 31) >> 21) & 0x7FF) - 1023;
//exclude both sign and exponent (<<12, >>12) and normalize mantissa
uint64 mantissa = ((uint64)(0x1 << 31) << 21) | (val << 12) >> 12;
// normalized mantissa is 53 bits long,
// the exponent does not care about normalizing bit
uint8 scale = exponent + 11;
if (scale > 11)
scale = 11;
else if (scale < 0)
scale = 0;
lo_ = ((isNegative ? -1 : 1) * n) * std::pow(10., scale);
signScale_ = (isNegative ? 0x1 : 0x0) | (scale << 1);
// will always be 0 since we cannot reach
// a 128 bits precision with a 64 bits double
hi_ = 0;
}
The DoubleAsQWord type is used to "cast" from double to its uint64 representation:
union DoubleAsQWord
{
double doubleValue;
uint64 qWord;
};
My Decimal type has these fields:
uint64 lo_;
uint32 hi_;
int32 signScale_;
All this stuff is encapsulated in my Decimal class. You can notice I extract the mantissa even if I'm not using it. I'm still thinking of a way to guess the scale accurately.
This is purely practical, and seems to work in the case of a stress test:
BOOST_AUTO_TEST_CASE( convertion_random_stress )
{
const double EPSILON = 0.000001f;
srand(time(0));
for (int i = 0; i < 10000; ++i)
{
double d1 = ((rand() % 10) % 2 == 0 ? -1 : 1)
* (double)(rand() % 1000 + 1000.) / (double)(rand() % 42 + 2.);
Decimal d(d1);
double d2 = d.toDouble();
double absError = fabs(d1 - d2);
BOOST_CHECK_MESSAGE(
absError <= EPSILON,
"absError=" << absError << " with " << d1 << " - " << d2
);
}
}
Anyway, how would you convert from double to this decimal representation?
I think you guys will be interested in an implementation of a C++ wrapper to the Intel Decimal Floating-Point Math Library:
C++ Decimal Wrapper Class
Intel DFP
What about using VarR8FromDec Function ?
EDIT: This function is declared on Windows system only. However an equivalent C implementation is available with WINE, here: http://source.winehq.org/source/dlls/oleaut32/vartype.c
Perhaps you are looking for System::Convert::ToDecimal()
http://msdn.microsoft.com/en-us/library/a69w9ca0%28v=vs.80%29.aspx
Alternatively you could try recasting the Double as a Decimal.
An example from the MSDN.
http://msdn.microsoft.com/en-us/library/aa326763%28v=vs.71%29.aspx
// Convert the double argument; catch exceptions that are thrown.
void DecimalFromDouble( double argument )
{
Object* decValue;
// Convert the double argument to a Decimal value.
try
{
decValue = __box( (Decimal)argument );
}
catch( Exception* ex )
{
decValue = GetExceptionType( ex );
}
Console::WriteLine( formatter, __box( argument ), decValue );
}
If you do not have access to the .Net routines then this is tricky. I have done this myself for my hex editor (so that users can display and edit C# Decimal values using the Properties dialog) - see http://www.hexedit.com for more information. Also the source for HexEdit is freely available - see my article at http://www.codeproject.com/KB/cpp/HexEdit.aspx.
Actually my routines convert between Decimal and strings but you can of course use sprintf to convert the double to a string first. (Also when you talk about double I think you explicitly mean IEEE 64-bit floating point format, though this is what most compilers/systems use nowadays.)
Note that there are a few gotchas if you want to handle precisely all valid Decimal values and return an error for any value that cannot be converted, since the format is not well documented. (The Decimal format is aweful really, eg the same number can have many representations.)
Here is my code that converts a string to a Decimal. Note that it uses the the GNU Multiple Precision Arithmetic Library (functions that start with mpz_). The String2Decimal function obviously returns false if it fails for some reason, such as the value being too big. The parameter 'presult' must point to a buffer of at least 16 bytes, to store the result.
bool String2Decimal(const char *ss, void *presult)
{
bool retval = false;
// View the decimal (result) as four 32 bit integers
unsigned __int32 *dd = (unsigned __int32 *)presult;
mpz_t mant, max_mant;
mpz_inits(mant, max_mant, NULL);
int exp = 0; // Exponent
bool dpseen = false; // decimal point seen yet?
bool neg = false; // minus sign seen?
// Scan the characters of the value
const char *pp;
for (pp = ss; *pp != '\0'; ++pp)
{
if (*pp == '-')
{
if (pp != ss)
goto exit_func; // minus sign not at start
neg = true;
}
else if (isdigit(*pp))
{
mpz_mul_si(mant, mant, 10);
mpz_add_ui(mant, mant, unsigned(*pp - '0'));
if (dpseen) ++exp; // Keep track of digits after decimal pt
}
else if (*pp == '.')
{
if (dpseen)
goto exit_func; // more than one decimal point
dpseen = true;
}
else if (*pp == 'e' || *pp == 'E')
{
char *end;
exp -= strtol(pp+1, &end, 10);
pp = end;
break;
}
else
goto exit_func; // unexpected character
}
if (*pp != '\0')
goto exit_func; // extra characters after end
if (exp < -28 || exp > 28)
goto exit_func; // exponent outside valid range
// Adjust mantissa for -ve exponent
if (exp < 0)
{
mpz_t tmp;
mpz_init_set_si(tmp, 10);
mpz_pow_ui(tmp, tmp, -exp);
mpz_mul(mant, mant, tmp);
mpz_clear(tmp);
exp = 0;
}
// Get max_mant = size of largest mantissa (2^96 - 1)
//mpz_set_str(max_mant, "79228162514264337593543950335", 10); // 2^96 - 1
static unsigned __int32 ffs[3] = { 0xFFFFffffUL, 0xFFFFffffUL, 0xFFFFffffUL };
mpz_import(max_mant, 3, -1, sizeof(ffs[0]), 0, 0, ffs);
// Check for mantissa too big.
if (mpz_cmp(mant, max_mant) > 0)
goto exit_func; // value too big
else if (mpz_sgn(mant) == 0)
exp = 0; // if mantissa is zero make everything zero
// Set integer part
dd[2] = mpz_getlimbn(mant, 2);
dd[1] = mpz_getlimbn(mant, 1);
dd[0] = mpz_getlimbn(mant, 0);
// Set exponent and sign
dd[3] = exp << 16;
if (neg && mpz_sgn(mant) > 0)
dd[3] |= 0x80000000;
retval = true; // indicate success
exit_func:
mpz_clears(mant, max_mant, NULL);
return retval;
}
How about this:
1) sprintf number into s
2) find decimal point (strchr), store in idx
3) atoi = obtain integer part easily, use union to separate high/lo
4) use strlen - idx to obtain number of digits after point
sprintf may be slow but you´ll get the solution under 2 minutes of typing...

Rounding up to the nearest multiple of a number

OK - I'm almost embarrassed posting this here (and I will delete if anyone votes to close) as it seems like a basic question.
Is this the correct way to round up to a multiple of a number in C++?
I know there are other questions related to this but I am specficially interested to know what is the best way to do this in C++:
int roundUp(int numToRound, int multiple)
{
if(multiple == 0)
{
return numToRound;
}
int roundDown = ( (int) (numToRound) / multiple) * multiple;
int roundUp = roundDown + multiple;
int roundCalc = roundUp;
return (roundCalc);
}
Update:
Sorry I probably didn't make intention clear. Here are some examples:
roundUp(7, 100)
//return 100
roundUp(117, 100)
//return 200
roundUp(477, 100)
//return 500
roundUp(1077, 100)
//return 1100
roundUp(52, 20)
//return 60
roundUp(74, 30)
//return 90
This works for positive numbers, not sure about negative. It only uses integer math.
int roundUp(int numToRound, int multiple)
{
if (multiple == 0)
return numToRound;
int remainder = numToRound % multiple;
if (remainder == 0)
return numToRound;
return numToRound + multiple - remainder;
}
Edit: Here's a version that works with negative numbers, if by "up" you mean a result that's always >= the input.
int roundUp(int numToRound, int multiple)
{
if (multiple == 0)
return numToRound;
int remainder = abs(numToRound) % multiple;
if (remainder == 0)
return numToRound;
if (numToRound < 0)
return -(abs(numToRound) - remainder);
else
return numToRound + multiple - remainder;
}
Without conditions:
int roundUp(int numToRound, int multiple)
{
assert(multiple);
return ((numToRound + multiple - 1) / multiple) * multiple;
}
This works like rounding away from zero for negative numbers
Version that works also for negative numbers:
int roundUp(int numToRound, int multiple)
{
assert(multiple);
int isPositive = (int)(numToRound >= 0);
return ((numToRound + isPositive * (multiple - 1)) / multiple) * multiple;
}
Tests
If multiple is a power of 2 (faster in ~3.7 times)
int roundUp(int numToRound, int multiple)
{
assert(multiple && ((multiple & (multiple - 1)) == 0));
return (numToRound + multiple - 1) & -multiple;
}
Tests
This works when factor will always be positive:
int round_up(int num, int factor)
{
return num + factor - 1 - (num + factor - 1) % factor;
}
Edit: This returns round_up(0,100)=100. Please see Paul's comment below for a solution that returns round_up(0,100)=0.
This is a generalization of the problem of "how do I find out how many bytes n bits will take? (A: (n bits + 7) / 8).
int RoundUp(int n, int roundTo)
{
// fails on negative? What does that mean?
if (roundTo == 0) return 0;
return ((n + roundTo - 1) / roundTo) * roundTo; // edit - fixed error
}
int roundUp(int numToRound, int multiple)
{
if(multiple == 0)
{
return 0;
}
return ((numToRound - 1) / multiple + 1) * multiple;
}
And no need to mess around with conditions
This is the modern c++ approach using a template function which is working for float, double, long, int and short (but not for long long, and long double because of the used double values).
#include <cmath>
#include <iostream>
template<typename T>
T roundMultiple( T value, T multiple )
{
if (multiple == 0) return value;
return static_cast<T>(std::round(static_cast<double>(value)/static_cast<double>(multiple))*static_cast<double>(multiple));
}
int main()
{
std::cout << roundMultiple(39298.0, 100.0) << std::endl;
std::cout << roundMultiple(20930.0f, 1000.0f) << std::endl;
std::cout << roundMultiple(287399, 10) << std::endl;
}
But you can easily add support for long long and long double with template specialisation as shown below:
template<>
long double roundMultiple<long double>( long double value, long double multiple)
{
if (multiple == 0.0l) return value;
return std::round(value/multiple)*multiple;
}
template<>
long long roundMultiple<long long>( long long value, long long multiple)
{
if (multiple == 0.0l) return value;
return static_cast<long long>(std::round(static_cast<long double>(value)/static_cast<long double>(multiple))*static_cast<long double>(multiple));
}
To create functions to round up, use std::ceil and to always round down use std::floor. My example from above is rounding using std::round.
Create the "round up" or better known as "round ceiling" template function as shown below:
template<typename T>
T roundCeilMultiple( T value, T multiple )
{
if (multiple == 0) return value;
return static_cast<T>(std::ceil(static_cast<double>(value)/static_cast<double>(multiple))*static_cast<double>(multiple));
}
Create the "round down" or better known as "round floor" template function as shown below:
template<typename T>
T roundFloorMultiple( T value, T multiple )
{
if (multiple == 0) return value;
return static_cast<T>(std::floor(static_cast<double>(value)/static_cast<double>(multiple))*static_cast<double>(multiple));
}
For anyone looking for a short and sweet answer. This is what I used. No accounting for negatives.
n - (n % r)
That will return the previous factor.
(n + r) - (n % r)
Will return the next. Hope this helps someone. :)
float roundUp(float number, float fixedBase) {
if (fixedBase != 0 && number != 0) {
float sign = number > 0 ? 1 : -1;
number *= sign;
number /= fixedBase;
int fixedPoint = (int) ceil(number);
number = fixedPoint * fixedBase;
number *= sign;
}
return number;
}
This works for any float number or base (e.g. you can round -4 to the nearest 6.75). In essence it is converting to fixed point, rounding there, then converting back. It handles negatives by rounding AWAY from 0. It also handles a negative round to value by essentially turning the function into roundDown.
An int specific version looks like:
int roundUp(int number, int fixedBase) {
if (fixedBase != 0 && number != 0) {
int sign = number > 0 ? 1 : -1;
int baseSign = fixedBase > 0 ? 1 : 0;
number *= sign;
int fixedPoint = (number + baseSign * (fixedBase - 1)) / fixedBase;
number = fixedPoint * fixedBase;
number *= sign;
}
return number;
}
Which is more or less plinth's answer, with the added negative input support.
First off, your error condition (multiple == 0) should probably have a return value. What? I don't know. Maybe you want to throw an exception, that's up to you. But, returning nothing is dangerous.
Second, you should check that numToRound isn't already a multiple. Otherwise, when you add multiple to roundDown, you'll get the wrong answer.
Thirdly, your casts are wrong. You cast numToRound to an integer, but it's already an integer. You need to cast to to double before the division, and back to int after the multiplication.
Lastly, what do you want for negative numbers? Rounding "up" can mean rounding to zero (rounding in the same direction as positive numbers), or away from zero (a "larger" negative number). Or, maybe you don't care.
Here's a version with the first three fixes, but I don't deal with the negative issue:
int roundUp(int numToRound, int multiple)
{
if(multiple == 0)
{
return 0;
}
else if(numToRound % multiple == 0)
{
return numToRound
}
int roundDown = (int) (( (double) numToRound / multiple ) * multiple);
int roundUp = roundDown + multiple;
int roundCalc = roundUp;
return (roundCalc);
}
Round to Power of Two:
Just in case anyone needs a solution for positive numbers rounded to the nearest multiple of a power of two (because that's how I ended up here):
// number: the number to be rounded (ex: 5, 123, 98345, etc.)
// pow2: the power to be rounded to (ex: to round to 16, use '4')
int roundPow2 (int number, int pow2) {
pow2--; // because (2 exp x) == (1 << (x -1))
pow2 = 0x01 << pow2;
pow2--; // because for any
//
// (x = 2 exp x)
//
// subtracting one will
// yield a field of ones
// which we can use in a
// bitwise OR
number--; // yield a similar field for
// bitwise OR
number = number | pow2;
number++; // restore value by adding one back
return number;
}
The input number will stay the same if it is already a multiple.
Here is the x86_64 output that GCC gives with -O2 or -Os (9Sep2013 Build - godbolt GCC online):
roundPow2(int, int):
lea ecx, [rsi-1]
mov eax, 1
sub edi, 1
sal eax, cl
sub eax, 1
or eax, edi
add eax, 1
ret
Each C line of code corresponds perfectly with its line in the assembly: http://goo.gl/DZigfX
Each of those instructions are extremely fast, so the function is extremely fast too. Since the code is so small and quick, it might be useful to inline the function when using it.
Credit:
Algorithm: Hagen von Eitzen # Math.SE
Godbolt Interactive Compiler: #mattgodbolt/gcc-explorer on GitHub
I'm using:
template <class _Ty>
inline _Ty n_Align_Up(_Ty n_x, _Ty n_alignment)
{
assert(n_alignment > 0);
//n_x += (n_x >= 0)? n_alignment - 1 : 1 - n_alignment; // causes to round away from zero (greatest absolute value)
n_x += (n_x >= 0)? n_alignment - 1 : -1; // causes to round up (towards positive infinity)
//n_x += (_Ty(-(n_x >= 0)) & n_alignment) - 1; // the same as above, avoids branch and integer multiplication
//n_x += n_alignment - 1; // only works for positive numbers (fastest)
return n_x - n_x % n_alignment; // rounds negative towards zero
}
and for powers of two:
template <class _Ty>
bool b_Is_POT(_Ty n_x)
{
return !(n_x & (n_x - 1));
}
template <class _Ty>
inline _Ty n_Align_Up_POT(_Ty n_x, _Ty n_pot_alignment)
{
assert(n_pot_alignment > 0);
assert(b_Is_POT(n_pot_alignment)); // alignment must be power of two
-- n_pot_alignment;
return (n_x + n_pot_alignment) & ~n_pot_alignment; // rounds towards positive infinity (i.e. negative towards zero)
}
Note that both of those round negative values towards zero (that means round to positive infinity for all values), neither of them relies on signed overflow (which is undefined in C/C++).
This gives:
n_Align_Up(10, 100) = 100
n_Align_Up(110, 100) = 200
n_Align_Up(0, 100) = 0
n_Align_Up(-10, 100) = 0
n_Align_Up(-110, 100) = -100
n_Align_Up(-210, 100) = -200
n_Align_Up_POT(10, 128) = 128
n_Align_Up_POT(130, 128) = 256
n_Align_Up_POT(0, 128) = 0
n_Align_Up_POT(-10, 128) = 0
n_Align_Up_POT(-130, 128) = -128
n_Align_Up_POT(-260, 128) = -256
Round to nearest multiple that happens to be a power of 2
unsigned int round(unsigned int value, unsigned int multiple){
return ((value-1u) & ~(multiple-1u)) + multiple;
}
This can be useful for when allocating along cachelines, where the rounding increment you want is a power of two, but the resulting value only needs to be a multiple of it. On gcc the body of this function generates 8 assembly instructions with no division or branches.
round( 0, 16) -> 0
round( 1, 16) -> 16
round( 16, 16) -> 16
round(257, 128) -> 384 (128 * 3)
round(333, 2) -> 334
Probably safer to cast to floats and use ceil() - unless you know that the int division is going to produce the correct result.
int noOfMultiples = int((numToRound / multiple)+0.5);
return noOfMultiples*multiple
C++ rounds each number down,so if you add 0.5 (if its 1.5 it will be 2) but 1.49 will be 1.99 therefore 1.
EDIT - Sorry didn't see you wanted to round up, i would suggest using a ceil() method instead of the +0.5
well for one thing, since i dont really understand what you want to do, the lines
int roundUp = roundDown + multiple;
int roundCalc = roundUp;
return (roundCalc);
could definitely be shortened to
int roundUp = roundDown + multiple;
return roundUp;
may be this can help:
int RoundUpToNearestMultOfNumber(int val, int num)
{
assert(0 != num);
return (floor((val + num) / num) * num);
}
To always round up
int alwaysRoundUp(int n, int multiple)
{
if (n % multiple != 0) {
n = ((n + multiple) / multiple) * multiple;
// Another way
//n = n - n % multiple + multiple;
}
return n;
}
alwaysRoundUp(1, 10) -> 10
alwaysRoundUp(5, 10) -> 10
alwaysRoundUp(10, 10) -> 10
To always round down
int alwaysRoundDown(int n, int multiple)
{
n = (n / multiple) * multiple;
return n;
}
alwaysRoundDown(1, 10) -> 0
alwaysRoundDown(5, 10) -> 0
alwaysRoundDown(10, 10) -> 10
To round the normal way
int normalRound(int n, int multiple)
{
n = ((n + multiple/2)/multiple) * multiple;
return n;
}
normalRound(1, 10) -> 0
normalRound(5, 10) -> 10
normalRound(10, 10) -> 10
I found an algorithm which is somewhat similar to one posted above:
int[(|x|+n-1)/n]*[(nx)/|x|], where x is a user-input value and n is the multiple being used.
It works for all values x, where x is an integer (positive or negative, including zero). I wrote it specifically for a C++ program, but this can basically be implemented in any language.
For negative numToRound:
It should be really easy to do this but the standard modulo % operator doesn't handle negative numbers like one might expect. For instance -14 % 12 = -2 and not 10. First thing to do is to get modulo operator that never returns negative numbers. Then roundUp is really simple.
public static int mod(int x, int n)
{
return ((x % n) + n) % n;
}
public static int roundUp(int numToRound, int multiple)
{
return numRound + mod(-numToRound, multiple);
}
This is what I would do:
#include <cmath>
int roundUp(int numToRound, int multiple)
{
// if our number is zero, return immediately
if (numToRound == 0)
return multiple;
// if multiplier is zero, return immediately
if (multiple == 0)
return numToRound;
// how many times are number greater than multiple
float rounds = static_cast<float>(numToRound) / static_cast<float>(multiple);
// determine, whether if number is multiplier of multiple
int floorRounds = static_cast<int>(floor(rounds));
if (rounds - floorRounds > 0)
// multiple is not multiplier of number -> advance to the next multiplier
return (floorRounds+1) * multiple;
else
// multiple is multiplier of number -> return actual multiplier
return (floorRounds) * multiple;
}
The code might not be optimal, but I prefer clean code than dry performance.
int roundUp (int numToRound, int multiple)
{
return multiple * ((numToRound + multiple - 1) / multiple);
}
although:
won't work for negative numbers
won't work if numRound + multiple overflows
would suggest using unsigned integers instead, which has defined overflow behaviour.
You'll get an exception is multiple == 0, but it isn't a well-defined problem in that case anyway.
c:
int roundUp(int numToRound, int multiple)
{
return (multiple ? (((numToRound+multiple-1) / multiple) * multiple) : numToRound);
}
and for your ~/.bashrc:
roundup()
{
echo $(( ${2} ? ((${1}+${2}-1)/${2})*${2} : ${1} ))
}
I use a combination of modulus to nullify the addition of the remainder if x is already a multiple:
int round_up(int x, int div)
{
return x + (div - x % div) % div;
}
We find the inverse of the remainder then modulus that with the divisor again to nullify it if it is the divisor itself then add x.
round_up(19, 3) = 21
Here's my solution based on the OP's suggestion, and the examples given by everyone else. Since most everyone was looking for it to handle negative numbers, this solution does just that, without the use of any special functions, i.e. abs, and the like.
By avoiding the modulus and using division instead, the negative number is a natural result, although it's rounded down. After the rounded down version is calculated, then it does the required math to round up, either in the negative or positive direction.
Also note that no special functions are used to calculate anything, so there is a small speed boost there.
int RoundUp(int n, int multiple)
{
// prevent divide by 0 by returning n
if (multiple == 0) return n;
// calculate the rounded down version
int roundedDown = n / multiple * multiple;
// if the rounded version and original are the same, then return the original
if (roundedDown == n) return n;
// handle negative number and round up according to the sign
// NOTE: if n is < 0 then subtract the multiple, otherwise add it
return (n < 0) ? roundedDown - multiple : roundedDown + multiple;
}
I think this should help you. I have written the below program in C.
# include <stdio.h>
int main()
{
int i, j;
printf("\nEnter Two Integers i and j...");
scanf("%d %d", &i, &j);
int Round_Off=i+j-i%j;
printf("The Rounded Off Integer Is...%d\n", Round_Off);
return 0;
}
Endless possibilities, for signed integers only:
n + ((r - n) % r)
/// Rounding up 'n' to the nearest multiple of number 'b'.
/// - Not tested for negative numbers.
/// \see http://stackoverflow.com/questions/3407012/
#define roundUp(n,b) ( (b)==0 ? (n) : ( ((n)+(b)-1) - (((n)-1)%(b)) ) )
/// \c test->roundUp().
void test_roundUp() {
// yes_roundUp(n,b) ( (b)==0 ? (n) : ( (n)%(b)==0 ? n : (n)+(b)-(n)%(b) ) )
// yes_roundUp(n,b) ( (b)==0 ? (n) : ( ((n + b - 1) / b) * b ) )
// no_roundUp(n,b) ( (n)%(b)==0 ? n : (b)*( (n)/(b) )+(b) )
// no_roundUp(n,b) ( (n)+(b) - (n)%(b) )
if (true) // couldn't make it work without (?:)
{{ // test::roundUp()
unsigned m;
{ m = roundUp(17,8); } ++m;
assertTrue( 24 == roundUp(17,8) );
{ m = roundUp(24,8); }
assertTrue( 24 == roundUp(24,8) );
assertTrue( 24 == roundUp(24,4) );
assertTrue( 24 == roundUp(23,4) );
{ m = roundUp(23,4); }
assertTrue( 24 == roundUp(21,4) );
assertTrue( 20 == roundUp(20,4) );
assertTrue( 20 == roundUp(19,4) );
assertTrue( 20 == roundUp(18,4) );
assertTrue( 20 == roundUp(17,4) );
assertTrue( 17 == roundUp(17,0) );
assertTrue( 20 == roundUp(20,0) );
}}
}
This is getting the results you are seeking for positive integers:
#include <iostream>
using namespace std;
int roundUp(int numToRound, int multiple);
int main() {
cout << "answer is: " << roundUp(7, 100) << endl;
cout << "answer is: " << roundUp(117, 100) << endl;
cout << "answer is: " << roundUp(477, 100) << endl;
cout << "answer is: " << roundUp(1077, 100) << endl;
cout << "answer is: " << roundUp(52,20) << endl;
cout << "answer is: " << roundUp(74,30) << endl;
return 0;
}
int roundUp(int numToRound, int multiple) {
if (multiple == 0) {
return 0;
}
int result = (int) (numToRound / multiple) * multiple;
if (numToRound % multiple) {
result += multiple;
}
return result;
}
And here are the outputs:
answer is: 100
answer is: 200
answer is: 500
answer is: 1100
answer is: 60
answer is: 90
I think this works:
int roundUp(int numToRound, int multiple) {
return multiple? !(numToRound%multiple)? numToRound : ((numToRound/multiple)+1)*multiple: numToRound;
}
The accepted answer doesn't work very well, I thought I'd try my hand at this problem, this should round up all integers you throw at it:
int round_up(int input, unsigned int multiple) {
if (input < 0) { return input - input % multiple; }
return input + multiple - (((input - 1) % multiple) + 1);
}
If the number is negative it's easy, take the remainder and add it onto the input, that'll do the trick.
If the number is not negative, you have to subtract the remainder from the multiple and add that to round up. The problem with that is that if input is exactly on a multiple, it will still get rounded up to the next multiple because multiple - 0 = multiple.
To remedy this we do a cool little hack: subtract one from input before doing the remainder, then add it back on to the resulting remainder. This doesn't affect anything at all unless input is on a multiple. In that case, subtracting one will cause the remainder to the previous multiple to be calculated. After adding one again, you'll have exactly the multiple. Obviously subtracting this from itself yields 0, so your input value doesn't change.