Is there any relationship between remainder obtained through modulo 2 binary division and remainder obtained in a normal decimal division? - crc

For some cases, modulo 2 binary division is giving the same remainder as a base 10 modulus but for some cases it is not. Is there some relationship between the two remainders?
1.) q = 101000110100000
p = 110101
modulo 2 binary division remainder = 01110
and In base 10,
q = 20896
p = 53
and q%p = 14 which is the same as 01110
2.) q = 11001001000
p = 1001
modulo 2 binary division remainder is 011
and In base 10,
q = 1608
p = 9
and q%p = 6 which is different from 011.
So is there some relationship or it is totally unrelated? I want to know if I can derive base 2 modulo division remainder by doing decimal modulus.

No. There is no relationship. A polynomial over GF(2) can be represented as a string of bits. An integer can be represented as a string of bits. There the similarity ends. They are two entirely different beasts.
And there is no inherent "base 10" or "decimal" here, except in displaying the numbers. You are comparing integer modulo with polynomial modulo. The integers don't care what base you display them in.


Why is the result of a bitwise shift unrecoverable if there is a mathematical equivalent of the same operation?

Take for example the number 91. That number in binary is 1011011. If you shift that number to the right by 5 bits, you would get 2 (10 in binary). According to a google search, bit shifting to the left or right by a certain amount of bits is the same as multiplying or dividing the number by 2 to the power of the number of bits to be shifted, respectively. so to get from 91 to 2 by bit shifting, the equation would look like this: 91 / 2^5, which is also 91 / 32. Now, of course if you did that in your calculator, there would be some decimal values, which aren't included when bit shifting. The resulting 2 is actually 2.84357. I'm sure you know that if you do a certain operation on a number and then you do the inverse, the result would be what you had in the first place. So does decimal precision have something to do with this?
There is a mathematical equivalent of shifting to the right... and the mathematical operation is UNRECOVERABLE.
You seem to think that shifting to the right is:
bit shifting to the left or right by a certain amount of bits is the same as multiplying or dividing the number by 2
This is what you will hear people casually say, but it is only half right. As it it is not the same but only similar.
The correct statement is:
shifting a base-2 number one digit to the right is THE SAME as dividing by two in the integer domain
If you have an integer calculator, if you did 91/32 you will get 2. You will not get ANY decimal point because we are operating in the integer domain.
For real numbers, the equivalent operation is:
Which is also unrecoverable because it also results in 2.
The lesson here is be careful when listening to what people CASUALLY say. Casual speech is often imprecise and assumes the listener is familiar with the subject. You need to dig deeper what the statement is actually trying to say.
As for why it is unrecoverable? Division of integers give two results: the quotient (which is the main result) and the remainder. When we divide 91 by 32 we are doing this:
32 ) 91
So we get the result of 2 and a remainder of 27. The reason you can't get 91 by multiplying 2*32 is because we threw away the remainder.
You can get the result back if you saved the remainder. However, calculating the remainder is not a matter of simple shifts. Here's an example of how to make it reversable in C:
int test () {
int a = 91;
int b = 32;
int result;
int remainder;
result = a / b; // result will be 2
remainder = a % b; // remainder will be 27
return (result * b) + remainder; // returns 91
You can only recover the result of an operation if it has a 1-1 mapping between the inputs and outputs, i.e. it has an inverse function. But not all mathematical functions have an inverse function
For example if f(x) = x >> n with >> is the shift operator then it'll be equivalent to
f(x) = ⌊x/2n⌋
with ⌊ ⌋ being the floor function. Since there are many inputs that lead to the same output, the relationship isn't 1-1 and there can't be an inverse function for it. This function works the same for both signed and unsigned right shift:
91 >> 5 == floor(91.0/32.0) == 2
-91 >> 5 == floor(-91.0/32.0) == -3
Similarly for an unsigned left shift function g(x) = x << n then the equivalent is
g(x) = (x * 2n) mod 2N
with N being the size in bits of x, because integer math in hardware, C and many other languages always reduce modulo 2N due to the limit of register size and the use of two's complement. And it's clear that the modulo function also isn't invertible/recoverable. The signed left shift is almost the same with some small modifications

When Will static_casting the Result of ceil Compromise the Result?

static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?
For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•be for a non-negative e is an integer). If so, then M•b0 is an integer larger than M•be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.
Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095

Digit wise modulo for calculating power function for very very large positive integers

Hi I am writing a code to calculate P^Q where
P, Q are positive integers which can have number of digits upto 100000
I want the result as
result = (P^Q)modulo(10^9+7)
P = 34534985349875439875439875349875
Q = 93475349759384754395743975349573495
Answer = 735851262
I tried using the trick:
(P^Q)modulo(10^9+7) = (P*P*...(Q times))modulo(10^9+7)
(P*P*...(Q times))modulo(10^9+7) = ((Pmodulo(10^9+7))*(Pmodulo(10^9+7))...(Q times))modulo(10^9+7)
Since both P and Q are very large, I should store them in an array and do modulo digit by digit.
Is there any efficient way of doing this or some number theory algorithm which I am missing?
Thanks in advance
Here is a rather efficient way:
1)Compute p1 = P modulo 10^9 + 7
2)Compute q1 = Q modulo 10^9 + 6
3)Then P^Q modulo 10^9 + 7 is equal to p1^q1 modulo 10^9 + 7. This equality is true because of Fermat's little theorem. Note that p1 and q1 are small enough to fit in 32-bit integer, so you can implement binary exponention with standard integer type(for intermidiate computations, 64-bit integer type is sufficient because initial values fit in 32-bits).

C++ floating-point console output issue

float x = 384.951257;
std::cout << std::fixed << std::setprecision(6) << x << std::endl;
The output is 384.951263. Why? I'm using gcc.
float is usually only 32-bit. With about 3 bits per decimal digit (210 roughly equals 103) that means it can't possibly represent more than about 11 decimal digits, and accounting for other information it also needs to represent, such as magnitude, let's say 6-7 decimal digits. Hey, that's what you got!
Check e.g. Wikipedia for details.
Use double or long double for better precision. double is the default in C++. E.g., the literal 3.14 is of type double.
Floats have a limited resolution. So it gets rounded when you assing the value to x.
All answers here talk as though the issue is due to floating-point numbers and their capacity, but those are just implementation details; the issue is deeper than that. This issue occurs when representing decimal numbers using binary number system. Even something as simple as 0.1)10 is not precisely representable in binary, since it can only represent those numbers as a finite fraction where the denominator is a power of 2. Unfortunately, this does not include most of the numbers that can be represented as finite fraction in base 10, like 0.1.
The single-precision float datatype usually gets mapped to binary32 as called by the IEEE 754 standard, has 32-bits which is partitioned into 1 sign bit, 8 exponent bits and 23 significand bits (excluding the hidden/implicit bit). Thus we've to calculate upto 24 bits when converting to binary32.
Other answers here evade the actual calculations involved, I'll try to do it. This method is explained in greater detail here. So lets convert the real number into a binary number:
Integer part 384)10 = 110000000)2 (using the usual method of successive division by 2)
Fractional part 0.951257)10 can be converted by successive multiplication by 2 and taking the integer part
0.951257 * 2 = 1.902514
0.902514 * 2 = 1.805028
0.805028 * 2 = 1.610056
0.610056 * 2 = 1.220112
0.220112 * 2 = 0.440224
0.440224 * 2 = 0.880448
0.880448 * 2 = 1.760896
0.760896 * 2 = 1.521792
0.521792 * 2 = 1.043584
0.043584 * 2 = 0.087168
0.087168 * 2 = 0.174336
0.174336 * 2 = 0.348672
0.348672 * 2 = 0.697344
0.697344 * 2 = 1.394688
0.394688 * 2 = 0.789376
Gathering the obtined fractional part in binary we've 0.111100111000010)2. The overall number in binary would be 110000000.111100111000010)2; this has 24 bits as required.
Converting this back to decimal would give you 384 + (15585 / 16384) = 384.951232)10. With the rounding mode (round to nearest) enabled this comes to, what you see, 384.951263)10.
This can be verified here.

Is floating-point addition and multiplication associative?

I had a problem when I was adding three floating point values and comparing them to 1.
cout << ((0.7 + 0.2 + 0.1)==1)<<endl; //output is 0
cout << ((0.7 + 0.1 + 0.2)==1)<<endl; //output is 1
Why would these values come out different?
Floating point addition is not necessarily associative. If you change the order in which you add things up, this can change the result.
The standard paper on the subject is What Every Computer Scientist Should Know about Floating Point Arithmetic. It gives the following example:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the latter).
What is likely, with currently popular machines and software, is:
The compiler encoded .7 as 0x1.6666666666666p-1 (this is the hexadecimal numeral 1.6666666666666 multiplied by 2 to the power of -1), .2 as 0x1.999999999999ap-3, and .1 as 0x1.999999999999ap-4. Each of these is the number representable in floating-point that is closest to the decimal numeral you wrote.
Observe that each of these hexadecimal floating-point constants has exactly 53 bits in its significand (the "fraction" part, often inaccurately called the mantissa). The hexadecimal numeral for the significand has a "1" and thirteen more hexadecimal digits (four bits each, 52 total, 53 including the "1"), which is what the IEEE-754 standard provides for, for 64-bit binary floating-point numbers.
Let's add the numbers for .7 and .2: 0x1.6666666666666p-1 and 0x1.999999999999ap-3. First, scale the exponent of the second number to match the first. To do this, we will multiply the exponent by 4 (changing "p-3" to "p-1") and multiply the significand by 1/4, giving 0x0.66666666666668p-1. Then add 0x1.6666666666666p-1 and 0x0.66666666666668p-1, giving 0x1.ccccccccccccc8p-1. Note that this number has more than 53 bits in the significand: The "8" is the 14th digit after the period. Floating-point cannot return a result with this many bits, so it has to be rounded to the nearest representable number. In this case, there are two numbers that are equally near, 0x1.cccccccccccccp-1 and 0x1.ccccccccccccdp-1. When there is a tie, the number with a zero in the lowest bit of the significand is used. "c" is even and "d" is odd, so "c" is used. The final result of the addition is 0x1.cccccccccccccp-1.
Next, add the number for .1 (0x1.999999999999ap-4) to that. Again, we scale to make the exponents match, so 0x1.999999999999ap-4 becomes 0x.33333333333334p-1. Then add that to 0x1.cccccccccccccp-1, giving 0x1.fffffffffffff4p-1. Rounding that to 53 bits gives 0x1.fffffffffffffp-1, and that is the final result of .7+.2+.1.
Now consider .7+.1+.2. For .7+.1, add 0x1.6666666666666p-1 and 0x1.999999999999ap-4. Recall the latter is scaled to 0x.33333333333334p-1. Then the exact sum is 0x1.99999999999994p-1. Rounding that to 53 bits gives 0x1.9999999999999p-1.
Then add the number for .2 (0x1.999999999999ap-3), which is scaled to 0x0.66666666666668p-1. The exact sum is 0x2.00000000000008p-1. Floating-point significands are always scaled to start with 1 (except for special cases: zero, infinity, and very small numbers at the bottom of the representable range), so we adjust this to 0x1.00000000000004p0. Finally, we round to 53 bits, giving 0x1.0000000000000p0.
Thus, because of errors that occur when rounding, .7+.2+.1 returns 0x1.fffffffffffffp-1 (very slightly less than 1), and .7+.1+.2 returns 0x1.0000000000000p0 (exactly 1).
Floating point multiplication is not associative in C or C++.
using namespace std;
int main() {
int counter = 0;
while(counter++ < 10){
float a = rand() / 100000;
float b = rand() / 100000;
float c = rand() / 100000;
if (a*(b*c) != (a*b)*c){
printf("Not equal\n");
return 0;
In this program, about 30% of the time, (a*b)*c is not equal to a*(b*c).
Neither addition nor multiplication is associative with IEEE 743 double precision (64-bit) numbers. Here are examples for each (evaluated with Python 3.9.7):
>>> (.1 + .2) + .3
>>> .1 + (.2 + .3)
>>> (.1 * .2) * .3
>>> .1 * (.2 * .3)
Similar answer to Eric's, but for addition, and with Python.
import random
n = 1000
a = [random.random() for i in range(n)]
b = [random.random() for i in range(n)]
c = [random.random() for i in range(n)]
sum(1 if (a[i] + b[i]) + c[i] != a[i] + (b[i] + c[i]) else 0 for i in range(n))