Optimizing Fixed-Point Sqrt

Optimizing Fixed-Point Sqrt - c++

I made what I think is a good fixed-point square root algorithm:
template<int64_t M, int64_t P>
typename enable_if<M + P == 32, FixedPoint<M, P>>::type sqrt(FixedPoint<M, P> f)
{
if (f.num == 0)
return 0;
//Reduce it to the 1/2 to 2 range (based around FixedPoint<2, 30> to avoid left/right shift branching)
int64_t num{ f.num }, faux_half{ 1 << 29 };
ptrdiff_t mag{ 0 };
while (num < (faux_half)) {
num <<= 2;
++mag;
}
int64_t res = (M % 2 == 0 ? SQRT_32_EVEN_LOOKUP : SQRT_32_ODD_LOOKUP)[(num >> (30 - 4)) - (1LL << 3)];
res >>= M / 2 + mag - 1; //Finish making an excellent guess
for (int i = 0; i < 2; ++i)
// \ | /
// \ | /
// _| V L
res = (res + (int64_t(f.num) << P) / res) >> 1; //Use Newton's method to improve greatly on guess
// 7 A r
// / | \
// / | \
// The Infamous Time Eater
return FixedPoint<M, P>(res, true);
}
However, after profiling (in release mode) I found out that the division here takes up 83% of the time this algorithm spends. I can speed it up 6x by replacing the division with multiplication, but that's just wrong. I found out that integer division is much slower than multiplication, unfortunately. Is there any way to optimize this?
In case this table is necessary.
const array<int32_t, 24> SQRT_32_EVEN_LOOKUP = {
0x2d413ccd, //magic numbers calculated by taking 0.5 + 0.5 * i / 8 from i = 0 to 23, multiplying by 2^30, and converting to hex
0x30000000,
0x3298b076,
0x3510e528,
0x376cf5d1,
0x39b05689,
0x3bddd423,
0x3df7bd63,
0x40000000,
0x41f83d9b,
0x43e1db33,
0x45be0cd2,
0x478dde6e,
0x49523ae4,
0x4b0bf165,
0x4cbbb9d6,
0x4e623850,
0x50000000,
0x5195957c,
0x532370b9,
0x54a9fea7,
0x5629a293,
0x57a2b749,
0x59159016
};
SQRT_32_ODD_LOOKUP is just SQRT_32_EVEN_LOOKUP divided by sqrt(2).

Reinventing the wheel, really, and not in a good way. The correct solution is to calculate 1/sqrt(x) using NR, and then multiply once to get x/sqrt(x) - just check for x==0 up front.
The reason why this is so much better is that the NR step for y=1/sqrt(x) is just y = (3-x*y*y)*y/2. That's all straightforward multiplication.

Related

how to wrap radians between -pi and pi with mod? [duplicate]

I'm looking for some nice C code that will accomplish effectively:
while (deltaPhase >= M_PI) deltaPhase -= M_TWOPI;
while (deltaPhase < -M_PI) deltaPhase += M_TWOPI;
What are my options?

Edit Apr 19, 2013:
Modulo function updated to handle boundary cases as noted by aka.nice and arr_sea:
static const double _PI= 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348;
static const double _TWO_PI= 6.2831853071795864769252867665590057683943387987502116419498891846156328125724179972560696;
// Floating-point modulo
// The result (the remainder) has same sign as the divisor.
// Similar to matlab's mod(); Not similar to fmod() - Mod(-3,4)= 1 fmod(-3,4)= -3
template<typename T>
T Mod(T x, T y)
{
static_assert(!std::numeric_limits<T>::is_exact , "Mod: floating-point type expected");
if (0. == y)
return x;
double m= x - y * floor(x/y);
// handle boundary cases resulted from floating-point cut off:
if (y > 0) // modulo range: [0..y)
{
if (m>=y) // Mod(-1e-16 , 360. ): m= 360.
return 0;
if (m<0 )
{
if (y+m == y)
return 0 ; // just in case...
else
return y+m; // Mod(106.81415022205296 , _TWO_PI ): m= -1.421e-14
}
}
else // modulo range: (y..0]
{
if (m<=y) // Mod(1e-16 , -360. ): m= -360.
return 0;
if (m>0 )
{
if (y+m == y)
return 0 ; // just in case...
else
return y+m; // Mod(-106.81415022205296, -_TWO_PI): m= 1.421e-14
}
}
return m;
}
// wrap [rad] angle to [-PI..PI)
inline double WrapPosNegPI(double fAng)
{
return Mod(fAng + _PI, _TWO_PI) - _PI;
}
// wrap [rad] angle to [0..TWO_PI)
inline double WrapTwoPI(double fAng)
{
return Mod(fAng, _TWO_PI);
}
// wrap [deg] angle to [-180..180)
inline double WrapPosNeg180(double fAng)
{
return Mod(fAng + 180., 360.) - 180.;
}
// wrap [deg] angle to [0..360)
inline double Wrap360(double fAng)
{
return Mod(fAng ,360.);
}

One-liner constant-time solution:
Okay, it's a two-liner if you count the second function for [min,max) form, but close enough — you could merge them together anyways.
/* change to `float/fmodf` or `long double/fmodl` or `int/%` as appropriate */
/* wrap x -> [0,max) */
double wrapMax(double x, double max)
{
/* integer math: `(max + x % max) % max` */
return fmod(max + fmod(x, max), max);
}
/* wrap x -> [min,max) */
double wrapMinMax(double x, double min, double max)
{
return min + wrapMax(x - min, max - min);
}
Then you can simply use deltaPhase = wrapMinMax(deltaPhase, -M_PI, +M_PI).
The solutions is constant-time, meaning that the time it takes does not depend on how far your value is from [-PI,+PI) — for better or for worse.
Verification:
Now, I don't expect you to take my word for it, so here are some examples, including boundary conditions. I'm using integers for clarity, but it works much the same with fmod() and floats:
Positive x:
wrapMax(3, 5) == 3: (5 + 3 % 5) % 5 == (5 + 3) % 5 == 8 % 5 == 3
wrapMax(6, 5) == 1: (5 + 6 % 5) % 5 == (5 + 1) % 5 == 6 % 5 == 1
Negative x:
Note: These assume that integer modulo copies left-hand sign; if not, you get the above ("Positive") case.
wrapMax(-3, 5) == 2: (5 + (-3) % 5) % 5 == (5 - 3) % 5 == 2 % 5 == 2
wrapMax(-6, 5) == 4: (5 + (-6) % 5) % 5 == (5 - 1) % 5 == 4 % 5 == 4
Boundaries:
wrapMax(0, 5) == 0: (5 + 0 % 5) % 5 == (5 + 0) % 5 == 5 % 5 == 0
wrapMax(5, 5) == 0: (5 + 5 % 5) % 5 == (5 + 0) % 5== 5 % 5 == 0
wrapMax(-5, 5) == 0: (5 + (-5) % 5) % 5 == (5 + 0) % 5 == 5 % 5 == 0
Note: Possibly -0 instead of +0 for floating-point.
The wrapMinMax function works much the same: wrapping x to [min,max) is the same as wrapping x - min to [0,max-min), and then (re-)adding min to the result.
I don't know what would happen with a negative max, but feel free to check that yourself!

If ever your input angle can reach arbitrarily high values, and if continuity matters, you can also try
atan2(sin(x),cos(x))
This will preserve continuity of sin(x) and cos(x) better than modulo for high values of x, especially in single precision (float).
Indeed, exact_value_of_pi - double_precision_approximation ~= 1.22e-16
On the other hand, most library/hardware use a high precision approximation of PI for applying the modulo when evaluating trigonometric functions (though x86 family is known to use a rather poor one).
Result might be in [-pi,pi], you'll have to check the exact bounds.
Personaly, I would prevent any angle to reach several revolutions by wrapping systematically and stick to a fmod solution like the one of boost.

There is also fmod function in math.h but the sign causes trouble so that a subsequent operation is needed to make the result fir in the proper range (like you already do with the while's). For big values of deltaPhase this is probably faster than substracting/adding `M_TWOPI' hundreds of times.
deltaPhase = fmod(deltaPhase, M_TWOPI);
EDIT:
I didn't try it intensively but I think you can use fmod this way by handling positive and negative values differently:
if (deltaPhase>0)
deltaPhase = fmod(deltaPhase+M_PI, 2.0*M_PI)-M_PI;
else
deltaPhase = fmod(deltaPhase-M_PI, 2.0*M_PI)+M_PI;
The computational time is constant (unlike the while solution which gets slower as the absolute value of deltaPhase increases)

I would do this:
double wrap(double x) {
return x-2*M_PI*floor(x/(2*M_PI)+0.5);
}
There will be significant numerical errors. The best solution to the numerical errors is to store your phase scaled by 1/PI or by 1/(2*PI) and depending on what you are doing store them as fixed point.

Instead of working in radians, use angles scaled by 1/(2π) and use modf, floor etc. Convert back to radians to use library functions.
This also has the effect that rotating ten thousand and a half revolutions is the same as rotating half then ten thousand revolutions, which is not guaranteed if your angles are in radians, as you have an exact representation in the floating point value rather than summing approximate representations:
#include <iostream>
#include <cmath>
float wrap_rads ( float r )
{
while ( r > M_PI ) {
r -= 2 * M_PI;
}
while ( r <= -M_PI ) {
r += 2 * M_PI;
}
return r;
}
float wrap_grads ( float r )
{
float i;
r = modff ( r, &i );
if ( r > 0.5 ) r -= 1;
if ( r <= -0.5 ) r += 1;
return r;
}
int main ()
{
for (int rotations = 1; rotations < 100000; rotations *= 10 ) {
{
float pi = ( float ) M_PI;
float two_pi = 2 * pi;
float a = pi;
a += rotations * two_pi;
std::cout << rotations << " and a half rotations in radians " << a << " => " << wrap_rads ( a ) / two_pi << '\n' ;
}
{
float pi = ( float ) 0.5;
float two_pi = 2 * pi;
float a = pi;
a += rotations * two_pi;
std::cout << rotations << " and a half rotations in grads " << a << " => " << wrap_grads ( a ) / two_pi << '\n' ;
}
std::cout << '\n';
}}

Here is a version for other people finding this question that can use C++ with Boost:
#include <boost/math/constants/constants.hpp>
#include <boost/math/special_functions/sign.hpp>
template<typename T>
inline T normalizeRadiansPiToMinusPi(T rad)
{
// copy the sign of the value in radians to the value of pi
T signedPI = boost::math::copysign(boost::math::constants::pi<T>(),rad);
// set the value of rad to the appropriate signed value between pi and -pi
rad = fmod(rad+signedPI,(2*boost::math::constants::pi<T>())) - signedPI;
return rad;
}
C++11 version, no Boost dependency:
#include <cmath>
// Bring the 'difference' between two angles into [-pi; pi].
template <typename T>
T normalizeRadiansPiToMinusPi(T rad) {
// Copy the sign of the value in radians to the value of pi.
T signed_pi = std::copysign(M_PI,rad);
// Set the value of difference to the appropriate signed value between pi and -pi.
rad = std::fmod(rad + signed_pi,(2 * M_PI)) - signed_pi;
return rad;
}

I encountered this question when searching for how to wrap a floating point value (or a double) between two arbitrary numbers. It didn't answer specifically for my case, so I worked out my own solution which can be seen here. This will take a given value and wrap it between lowerBound and upperBound where upperBound perfectly meets lowerBound such that they are equivalent (ie: 360 degrees == 0 degrees so 360 would wrap to 0)
Hopefully this answer is helpful to others stumbling across this question looking for a more generic bounding solution.
double boundBetween(double val, double lowerBound, double upperBound){
if(lowerBound > upperBound){std::swap(lowerBound, upperBound);}
val-=lowerBound; //adjust to 0
double rangeSize = upperBound - lowerBound;
if(rangeSize == 0){return upperBound;} //avoid dividing by 0
return val - (rangeSize * std::floor(val/rangeSize)) + lowerBound;
}
A related question for integers is available here:
Clean, efficient algorithm for wrapping integers in C++

A two-liner, non-iterative, tested solution for normalizing arbitrary angles to [-π, π):
double normalizeAngle(double angle)
{
double a = fmod(angle + M_PI, 2 * M_PI);
return a >= 0 ? (a - M_PI) : (a + M_PI);
}
Similarly, for [0, 2π):
double normalizeAngle(double angle)
{
double a = fmod(angle, 2 * M_PI);
return a >= 0 ? a : (a + 2 * M_PI);
}

In the case where fmod() is implemented through truncated division and has the same sign as the dividend, it can be taken advantage of to solve the general problem thusly:
For the case of (-PI, PI]:
if (x > 0) x = x - 2PI * ceil(x/2PI) #Shift to the negative regime
return fmod(x - PI, 2PI) + PI
And for the case of [-PI, PI):
if (x < 0) x = x - 2PI * floor(x/2PI) #Shift to the positive regime
return fmod(x + PI, 2PI) - PI
[Note that this is pseudocode; my original was written in Tcl, and I didn't want to torture everyone with that. I needed the first case, so had to figure this out.]

deltaPhase -= floor(deltaPhase/M_TWOPI)*M_TWOPI;

The way suggested you suggested is best. It is fastest for small deflections. If angles in your program are constantly being deflected into the proper range, then you should only run into big out of range values rarely. Therefore paying the cost of a complicated modular arithmetic code every round seems wasteful. Comparisons are cheap compared to modular arithmetic (http://embeddedgurus.com/stack-overflow/2011/02/efficient-c-tip-13-use-the-modulus-operator-with-caution/).

In C99:
float unwindRadians( float radians )
{
const bool radiansNeedUnwinding = radians < -M_PI || M_PI <= radians;
if ( radiansNeedUnwinding )
{
if ( signbit( radians ) )
{
radians = -fmodf( -radians + M_PI, 2.f * M_PI ) + M_PI;
}
else
{
radians = fmodf( radians + M_PI, 2.f * M_PI ) - M_PI;
}
}
return radians;
}

If linking against glibc's libm (including newlib's implementation) you can access
__ieee754_rem_pio2f() and __ieee754_rem_pio2() private functions:
extern __int32_t __ieee754_rem_pio2f (float,float*);
float wrapToPI(float xf){
const float p[4]={0,M_PI_2,M_PI,-M_PI_2};
float yf[2];
int q;
int qmod4;
q=__ieee754_rem_pio2f(xf,yf);
/* xf = q * M_PI_2 + yf[0] + yf[1] /
* yf[1] << y[0], not sure if it could be ignored */
qmod4= q % 4;
if (qmod4==2)
/* (yf[0] > 0) defines interval (-pi,pi]*/
return ( (yf[0] > 0) ? -p[2] : p[2] ) + yf[0] + yf[1];
else
return p[qmod4] + yf[0] + yf[1];
}
Edit: Just realised that you need to link to libm.a, I couldn't find the symbols declared in libm.so

I have used (in python):
def WrapAngle(Wrapped, UnWrapped ):
TWOPI = math.pi * 2
TWOPIINV = 1.0 / TWOPI
return UnWrapped + round((Wrapped - UnWrapped) * TWOPIINV) * TWOPI
c-code equivalent:
#define TWOPI 6.28318531
double WrapAngle(const double dWrapped, const double dUnWrapped )
{
const double TWOPIINV = 1.0/ TWOPI;
return dUnWrapped + round((dWrapped - dUnWrapped) * TWOPIINV) * TWOPI;
}
notice that this brings it in the wrapped domain +/- 2pi so for +/- pi domain you need to handle that afterward like:
if( angle > pi):
angle -= 2*math.pi

Calculating the summation of powers of a number modulo a number

There are 3 numbers: T, N, M. 1 ≤ T, M ≤ 10^9, 1 ≤ N ≤ 10^18 .
What is asked in the problem is to compute [Σ(T^i)]mod(m) where i varies from 0 to n. Obviously, O(N) or O(M) solutions wouldn't work because of 1 second time limit. How should I proceed?

As pointed out in previous answers, you may use the formula for geometric progression sum. However there is a small problem - if m is not prime, computing (T^n - 1) / (T - 1) can not be done directly - the division will not be a well-defined operations. In fact there is a solution that can handle even non prime modules and will have a complexity O(log(n) * log(n)). The approach is similar to binary exponentiation. Here is my code written in c++ for this(note that my solution uses binary exponentiation internally):
typedef long long ll;
ll binary_exponent(ll x, ll y, ll mod) {
ll res = 1;
ll p = x;
while (y) {
if (y % 2) {
res = (res * p) % mod;
}
p = (p * p) % mod;
y /= 2;
}
return res;
}
ll gp_sum(ll a, int n, ll mod) {
ll A = 1;
int num = 0;
ll res = 0;
ll degree = 1;
while (n) {
if (n & (1 << num)) {
n &= (~(1 << num));
res = (res + (A * binary_exponent(a, n, mod)) % mod) % mod;
}
A = (A + (A * binary_exponent(a, degree, mod)) % mod) % mod;
degree *= 2;
num++;
}
return res;
}
In this solution A stores consecutively the values 1, 1 + a, 1 + a + a^2 + a^3, ...1 + a + a^2 + ... a ^ (2^n - 1).
Also just like in binary exponentiation if I want to compute the sum of n degrees of a, I split n to sum of powers of two(essentially using the binary representation of n). Now having the above sequence of values for A, I choose the appropriate lengths(the ones that correspond to 1 bits of the binary representation of n) and multiply the sum by some value of a accumulating the result in res. Computing the values of A will take O(log(n)) time and for each value I may have to compute a degree of a which will result in another O(log(n)) - thus overall we have O(log(n) * log (n)).
Let's take an example - we want to compute 1 + a + a^2 .... + a ^ 10. In this case, we call gp_sum(a, 11, mod).
On the first iteration n & (1 << 0) is not zero as the first bit of 11(1011(2)) is 1. Thus I turn off this bit setting n to 10 and I accumulate in res: 0 + 1 * (a ^ (10)) = a^10. A is now a + 1.
The next second bit is also set in 10(1010(2)), so now n becomes 8 and res is a^10 + (a + 1)*(a^8)=a^10 + a^9 + a^8. A is now 1 + a + a^2 + a^3
Next bit is 0, thus res stays the same, but A will become 1 + a + a^2 + ... a^7.
On the last iteration the bit is 1 so we have:
res = a^10 + a^9 + a^8 + a^0 *(1 + a + a^2 + ... +a^7) = 1 + a .... + a ^10.

One can use an algorithm which is similar to binary exponentiation:
// Returns a pair <t^n mod m, sum of t^0..t^n mod m>,
// I assume that int is big enough to hold all values without overflowing.
pair<int, int> calc(int t, int n, int m)
if n == 0 // Base case. t^0 is always 1.
return (1 % m, 1 % m)
if n % 2 == 1
// We just compute the result for n - 1 and then add t^n.
(prevPow, prevSum) = calc(t, n - 1, m)
curPow = prevPow * t % m
curSum = (prevSum + curPow) % m
return (curPow, curSum)
// If n is even, we compute the sum for the first half.
(halfPow, halfSum) = calc(t, n / 2, m)
curPow = halfPow * halfPow % m // t^n = (t^(n/2))^2
curSum = (halfSum * halfPow + halfSum) % m
return (curPow, curSum)
The time complexity is O(log n)(the analysis is the same as for the binary exponentiation algorithm). Why is it better than a closed form formula for geometric progression? The latter involves division by (t - 1). But it is not guaranteed that there is an inverse of t - 1 mod m.

you can use this:
a^1 + a^2 + ... + a^n = a(1-a^n) / (1-a)
so, you just need to calc:
a * (1 - a^n) / (1 - a) mod M
and you can find O(logN) way to calc a^n mod M

It's a geometric series whose sum is equal to :

performance of log10 function returning an int

Today I needed a cheap log10 function, of which I only used the int part. Assuming the result is floored, so the log10 of 999 would be 2. Would it be beneficial writing a function myself? And if so, which way would be the best to go. Assuming the code would not be optimized.
The alternatives to log10 I've though of;
use a for loop dividing or multiplying by 10;
use a string parser(probably extremely expensive);
using an integer log2() function multiplying by a constant.
Thank you on beforehand:)

The operation can be done in (fast) constant time on any architecture that has a count-leading-zeros or similar instruction (which is most architectures). Here's a C snippet I have sitting around to compute the number of digits in base ten, which is essentially the same task (assumes a gcc-like compiler and 32-bit int):
unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
static unsigned int baseTenDigits(unsigned int x) {
static const unsigned char guess[33] = {
0, 0, 0, 0, 1, 1, 1, 2, 2, 2,
3, 3, 3, 3, 4, 4, 4, 5, 5, 5,
6, 6, 6, 6, 7, 7, 7, 8, 8, 8,
9, 9, 9
};
static const unsigned int tenToThe[] = {
1, 10, 100, 1000, 10000, 100000,
1000000, 10000000, 100000000, 1000000000,
};
unsigned int digits = guess[baseTwoDigits(x)];
return digits + (x >= tenToThe[digits]);
}
GCC and clang compile this down to ~10 instructions on x86. With care, one can make it faster still in assembly.
The key insight is to use the (extremely cheap) base-two logarithm to get a fast estimate of the base-ten logarithm; at that point we only need to compare against a single power of ten to decide if we need to adjust the guess. This is much more efficient than searching through multiple powers of ten to find the right one.
If the inputs are overwhelmingly biased to one- and two-digit numbers, a linear scan is sometimes faster; for all other input distributions, this implementation tends to win quite handily.

One way to do it would be loop with subtracting powers of 10. This powers could be computed and stored in table. Here example in python:
table = [10**i for i in range(1, 10)]
# [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]
def fast_log10(n):
for i, k in enumerate(table):
if n - k < 0:
return i
Usage example:
>>> fast_log10(1)
0
>>> fast_log10(10)
1
>>> fast_log10(100)
2
>>> fast_log10(999)
2
fast_log10(1000)
3
Also you may use binary search with this table. Then algorithm complexity would be only O(lg(n)), where n is number of digits.
Here example with binary search in C:
long int table[] = {10, 100, 1000, 10000, 1000000};
#define TABLE_LENGHT sizeof(table) / sizeof(long int)
int bisect_log10(long int n, int s, int e) {
int a = (e - s) / 2 + s;
if(s >= e)
return s;
if((table[a] - n) <= 0)
return bisect_log10(n, a + 1, e);
else
return bisect_log10(n, s, a);
}
int fast_log10(long int n){
return bisect_log10(n, 0, TABLE_LENGHT);
}
Note for small numbers this method would slower then upper method.
Full code here.

Well, there's the old standby - the "poor man's log function".
(If you want to handle more than 63 integer digits, change the first "if" to a "while".)
n = 1;
if (v >= 1e32){n += 32; v /= 1e32;}
if (v >= 1e16){n += 16; v /= 1e16;}
if (v >= 1e8){n += 8; v /= 1e8;}
if (v >= 1e4){n += 4; v /= 1e4;}
if (v >= 1e2){n += 2; v /= 1e2;}
if (v >= 1e1){n += 1; v /= 1e1;}
so if you feed in 123456.7, here's how it goes:
n = 1;
if (v >= 1e32) no
if (v >= 1e16) no
if (v >= 1e8) no
if (v >= 1e4) yes, so n = 5, v = 12.34567
if (v >= 1e2) no
if (v >= 1e1) yes, so n = 6, v = 1.234567
so result is n = 6
Here's a variation that uses multiplication, rather than division:
int n = 1;
double d = 1, temp;
temp = d * 1e32; if (v >= temp){n += 32; d = temp;}
temp = d * 1e16; if (v >= temp){n += 16; d = temp;}
temp = d * 1e8; if (v >= temp){n += 8; d = temp;}
temp = d * 1e4; if (v >= temp){n += 4; d = temp;}
temp = d * 1e2; if (v >= temp){n += 2; d = temp;}
temp = d * 1e1; if (v >= temp){n += 1; d = temp;}
and an execution looks like this
v = 123456.7
n = 1
d = 1
temp = 1e32, if (v >= 1e32) no
temp = 1e16, if (v >= 1e16) no
temp = 1e8, if (v >= 1e8) no
temp = 1e4, if (v >= 1e4) yes, so n = 5, d = 1e4;
temp = 1e6, if (v >= 1e6) no
temp = 1e5, if (v >= 1e5) yes, so n = 6, d = 1e5;

If you want to have a faster log function you need to approximate their result. E.g. the exp function can be approximated using a 'short' taylor approximation. You can find example approximations for exp, log, root and power here
edit:
You can find a short performance comparsion here

Because an unsigned < or >= test is done simply by subtracting and checking the carry flag, it is possible to put both arrays (guess and negated tenToThe) in a single 64-bit value, combine both array lookups into one, and use the carry from 32-bit addition to adjust the guess. The high 32 bits of guess[n] provide the value of log10(2^n*2-1), while the low 32 bits contain -10^log10(2^n*2-1).
static unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
unsigned int baseTenDigits(unsigned int x) {
static uint64_t guess[33] = {
/* 1 */ 0, 0, 0,
/* 8 */ (1ull<<32)-10, (1ull<<32)-10, (1ull<<32)-10,
/* 64 */ (2ull<<32)-100, (2ull<<32)-100, (2ull<<32)-100,
/* 512 */ (3ull<<32)-1000, (3ull<<32)-1000, (3ull<<32)-1000,
(3ull<<32)-1000,
/* 8192 */ (4ull<<32)-10000, (4ull<<32)-10000, (4ull<<32)-10000,
/* 65536 */ (5ull<<32)-100000, (5ull<<32)-100000, (5ull<<32)-100000,
/* 524288 */ (6ull<<32)-1000000, (6ull<<32)-1000000, (6ull<<32)-1000000,
(6ull<<32)-1000000,
/* 8388608 */ (7ull<<32)-10000000, (7ull<<32)-10000000,
(7ull<<32)-10000000,
/* 67108864 */ (8ull<<32)-100000000, (8ull<<32)-100000000,
(8ull<<32)-100000000,
/* 536870912 */ (9ull<<32)-1000000000, (9ull<<32)-1000000000,
(9ull<<32)-1000000000,
};
uint64_t adjust = guess[baseTwoDigits(x)];
return (adjust + x) >> 32;
}

Without any specifications, I will just give a general answer:
The log function will be pretty efficient in most languages as it is such a basic function.
The fact that you are only interested in integers could give you some leverage, but probably this is not enough to easily beat the builtin standard solutions.
One of the few things that I can think of to be faster than a builtin function is a table lookup, so if you are only interested in the numbers upto 10000 for instance, you could simply create a table that you could use to lookup any of these values when you need them.
Obviously this solution will not scale well, but it may be just what you need.
Sidenote: If you are importing the data for example, it may actually be faster to look at the string diecty length (rather than first converting the string to a number and than looking at the value of the string). Of course this will require the input to be stored in just the right format, otherwise it won't gain you anything.

Fast bignum square computation

To speed up my bignum divisons I need to speed up operation y = x^2 for bigints which are represented as dynamic arrays of unsigned DWORDs. To be clear:
DWORD x[n+1] = { LSW, ......, MSW };
where n+1 is number of used DWORDs
so value of number x = x[0]+x[1]<<32 + ... x[N]<<32*(n)
The question is: How do I compute y = x^2 as fast as possible without precision loss?
- Using C++ and with integer arithmetics (32bit with Carry) at disposal.
My current approach is applying multiplication y = x*x and avoid multiple multiplications.
For example:
x = x[0] + x[1]<<32 + ... x[n]<<32*(n)
For simplicity, let me rewrite it:
x = x0+ x1 + x2 + ... + xn
where index represent the address inside the array, so:
y = x*x
y = (x0 + x1 + x2 + ...xn)*(x0 + x1 + x2 + ...xn)
y = x0*(x0 + x1 + x2 + ...xn) + x1*(x0 + x1 + x2 + ...xn) + x2*(x0 + x1 + x2 + ...xn) + ...xn*(x0 + x1 + x2 + ...xn)
y0 = x0*x0
y1 = x1*x0 + x0*x1
y2 = x2*x0 + x1*x1 + x0*x2
y3 = x3*x0 + x2*x1 + x1*x2
...
y(2n-3) = xn(n-2)*x(n ) + x(n-1)*x(n-1) + x(n )*x(n-2)
y(2n-2) = xn(n-1)*x(n ) + x(n )*x(n-1)
y(2n-1) = xn(n )*x(n )
After a closer look, it is clear that almost all xi*xj appears twice (not the first and last one) which means that N*N multiplications can be replaced by (N+1)*(N/2) multiplications. P.S. 32bit*32bit = 64bit so the result of every mul+add operation is handled as 64+1 bit.
Is there a better way to compute this fast? All I found during searches were sqrts algorithms, not sqr...
Fast sqr
!!! Beware that all numbers in my code are MSW first,... not as in above test (there are LSW first for simplicity of equations, otherwise it would be an index mess).
Current functional fsqr implementation
void arbnum::sqr(const arbnum &x)
{
// O((N+1)*N/2)
arbnum c;
DWORD h, l;
int N, nx, nc, i, i0, i1, k;
c._alloc(x.siz + x.siz + 1);
nx = x.siz - 1;
nc = c.siz - 1;
N = nx + nx;
for (i=0; i<=nc; i++)
c.dat[i]=0;
for (i=1; i<N; i++)
for (i0=0; (i0<=nx) && (i0<=i); i0++)
{
i1 = i - i0;
if (i0 >= i1)
break;
if (i1 > nx)
continue;
h = x.dat[nx-i0];
if (!h)
continue;
l = x.dat[nx-i1];
if (!l)
continue;
alu.mul(h, l, h, l);
k = nc - i;
if (k >= 0)
alu.add(c.dat[k], c.dat[k], l);
k--;
if (k>=0)
alu.adc(c.dat[k], c.dat[k],h);
k--;
for (; (alu.cy) && (k>=0); k--)
alu.inc(c.dat[k]);
}
c.shl(1);
for (i = 0; i <= N; i += 2)
{
i0 = i>>1;
h = x.dat[nx-i0];
if (!h)
continue;
alu.mul(h, l, h, h);
k = nc - i;
if (k >= 0)
alu.add(c.dat[k], c.dat[k],l);
k--;
if (k>=0)
alu.adc(c.dat[k], c.dat[k], h);
k--;
for (; (alu.cy) && (k >= 0); k--)
alu.inc(c.dat[k]);
}
c.bits = c.siz<<5;
c.exp = x.exp + x.exp + ((c.siz - x.siz - x.siz)<<5) + 1;
c.sig = sig;
*this = c;
}
Use of Karatsuba multiplication
(thanks to Calpis)
I implemented Karatsuba multiplication but the results are massively slower even than by use of simple O(N^2) multiplication, probably because of that horrible recursion that I can't see any way to avoid. It's trade-off must be at really large numbers (bigger than hundreds of digits) ... but even then there are a lot of memory transfers. Is there a way to avoid recursion calls (non-recursive variant,... Almost all recursive algorithms can be done that way). Still, I will try to tweak things up and see what happens (avoid normalizations, etc..., also it could be some silly mistake in the code). Anyway, after solving Karatsuba for case x*x there is not much performance gain.
Optimized Karatsuba multiplication
Performance test for y = x^2 looped 1000x times, 0.9 < x < 1 ~ 32*98 bits:
x = 0.98765588997654321000000009876... | 98*32 bits
sqr [ 213.989 ms ] ... O((N+1)*N/2) fast sqr
mul1[ 363.472 ms ] ... O(N^2) classic multiplication
mul2[ 349.384 ms ] ... O(3*(N^log2(3))) optimized Karatsuba multiplication
mul3[ 9345.127 ms] ... O(3*(N^log2(3))) unoptimized Karatsuba multiplication
x = 0.98765588997654321000... | 195*32 bits
sqr [ 883.01 ms ]
mul1[ 1427.02 ms ]
mul2[ 1089.84 ms ]
x = 0.98765588997654321000... | 389*32 bits
sqr [ 3189.19 ms ]
mul1[ 5553.23 ms ]
mul2[ 3159.07 ms ]
After optimizations for Karatsuba, the code is massively faster than before. Still, for smaller numbers it is slightly less than half speed of my O(N^2) multiplication. For bigger numbers, it is faster with the ratio given by the complexities of Booth multiplications. The threshold for multiplication is around 32*98 bits and for sqr around 32*389 bits, so if the sum of input bits cross this threshold then Karatsuba multiplication will be used for speeding up multiplication and that goes similar for sqr too.
BTW, optimizations included:
Minimize heap trashing by too-big recursion argument
Avoidance of any bignum aritmetics (+,-) 32-bit ALU with carry is used instead.
Ignoring 0*y or x*0 or 0*0 cases
Reformatting input x,y number sizes to power of two to avoid reallocating
Implement modulo multiplication for z1 = (x0 + x1)*(y0 + y1) to minimize recursion
Modified Schönhage-Strassen multiplication to sqr implementation
I have tested use of FFT and NTT transforms to speed up sqr computation. The results are these:
FFT
Lose accuracy and therefore need high precision complex numbers. This actually slows things down considerably so no speedup is present. The result is not precise (can be wrongly rounded)so FFT is unusable (for now)
NTT
NTT is finite field DFT and so no accuracy loss occurs. It need modular arithmetics on unsigned integers: modpow, modmul, modadd and modsub.
I use DWORD (32bit unsigned integer numbers). The NTT input/otput vector size is limited because of overflow issues!!! For 32-bit modular arithmetics, N is limited to (2^32)/(max(input[])^2) so bigint must be divided to smaller chunks (I use BYTES so maximum size of bigint processed is
(2^32)/((2^8)^2) = 2^16 bytes = 2^14 DWORDs = 16384 DWORDs)
The sqr uses only 1xNTT + 1xINTT instead of 2xNTT + 1xINTT for multiplication but NTT usage is too slow and the threshold number size is too large for practical use in my implementation (for mul and also for sqr).
Is possible that is even over the overflow limit so 64-bit modular arithmetics should be used which can slow things down even more. So NTT is for my purposes also unusable too.
Some measurements:
a = 0.98765588997654321000 | 389*32 bits
looped 1x times
sqr1[ 3.177 ms ] fast sqr
sqr2[ 720.419 ms ] NTT sqr
mul1[ 5.588 ms ] simpe mul
mul2[ 3.172 ms ] karatsuba mul
mul3[ 1053.382 ms ] NTT mul
My implementation:
void arbnum::sqr_NTT(const arbnum &x)
{
// O(N*log(N)*(log(log(N)))) - 1x NTT
// Schönhage-Strassen sqr
// To prevent NTT overflow: n <= 48K * 8 bit -> result siz <= 12K * 32 bit -> x.siz + y.siz <= 12K!!!
int i, j, k, n;
int s = x.sig*x.sig, exp0 = x.exp + x.exp - ((x.siz+x.siz)<<5) + 2;
i = x.siz;
for (n = 1; n < i; n<<=1)
;
if (n + n > 0x3000) {
_error(_arbnum_error_TooBigNumber);
zero();
return;
}
n <<= 3;
DWORD *xx, *yy, q, qq;
xx = new DWORD[n+n];
#ifdef _mmap_h
if (xx)
mmap_new(xx, (n+n) << 2);
#endif
if (xx==NULL) {
_error(_arbnum_error_NotEnoughMemory);
zero();
return;
}
yy = xx + n;
// Zero padding (and split DWORDs to BYTEs)
for (i--, k=0; i >= 0; i--)
{
q = x.dat[i];
xx[k] = q&0xFF; k++; q>>=8;
xx[k] = q&0xFF; k++; q>>=8;
xx[k] = q&0xFF; k++; q>>=8;
xx[k] = q&0xFF; k++;
}
for (;k<n;k++)
xx[k] = 0;
//NTT
fourier_NTT ntt;
ntt.NTT(yy,xx,n); // init NTT for n
// Convolution
for (i=0; i<n; i++)
yy[i] = modmul(yy[i], yy[i], ntt.p);
//INTT
ntt.INTT(xx, yy);
//suma
q=0;
for (i = 0, j = 0; i<n; i++) {
qq = xx[i];
q += qq&0xFF;
yy[n-i-1] = q&0xFF;
q>>=8;
qq>>=8;
q+=qq;
}
// Merge WORDs to DWORDs and copy them to result
_alloc(n>>2);
for (i = 0, j = 0; i<siz; i++)
{
q =(yy[j]<<24)&0xFF000000; j++;
q |=(yy[j]<<16)&0x00FF0000; j++;
q |=(yy[j]<< 8)&0x0000FF00; j++;
q |=(yy[j] )&0x000000FF; j++;
dat[i] = q;
}
#ifdef _mmap_h
if (xx)
mmap_del(xx);
#endif
delete xx;
bits = siz<<5;
sig = s;
exp = exp0 + (siz<<5) - 1;
// _normalize();
}
Conclusion
For smaller numbers, it is the best option my fast sqr approach, and after
threshold Karatsuba multiplication is better. But I still think there should be something trivial which we have overlooked. Has anyone other ideas?
NTT optimization
After massively-intense optimizations (mostly NTT): Stack Overflow question Modular arithmetics and NTT (finite field DFT) optimizations.
Some values have changed:
a = 0.98765588997654321000 | 1553*32bits
looped 10x times
mul2[ 28.585 ms ] Karatsuba mul
mul3[ 26.311 ms ] NTT mul
So now NTT multiplication is finally faster than Karatsuba after about 1500*32-bit threshold.
Some measurements and bug spotted
a = 0.99991970486 | 1553*32 bits
looped: 10x
sqr1[ 58.656 ms ] fast sqr
sqr2[ 13.447 ms ] NTT sqr
mul1[ 102.563 ms ] simpe mul
mul2[ 28.916 ms ] Karatsuba mul Error
mul3[ 19.470 ms ] NTT mul
I found out that my Karatsuba (over/under)flows the LSB of each DWORD segment of bignum. When I have researched, I will update the code...
Also, after further NTT optimizations the thresholds changed, so for NTT sqr it is 310*32 bits = 9920 bits of operand, and for NTT mul it is 1396*32 bits = 44672 bits of result (sum of bits of operands).
Karatsuba code repaired thanks to #greybeard
//---------------------------------------------------------------------------
void arbnum::_mul_karatsuba(DWORD *z, DWORD *x, DWORD *y, int n)
{
// Recursion for Karatsuba
// z[2n] = x[n]*y[n];
// n=2^m
int i;
for (i=0; i<n; i++)
if (x[i]) {
i=-1;
break;
} // x==0 ?
if (i < 0)
for (i = 0; i<n; i++)
if (y[i]) {
i = -1;
break;
} // y==0 ?
if (i >= 0) {
for (i = 0; i < n + n; i++)
z[i]=0;
return;
} // 0.? = 0
if (n == 1) {
alu.mul(z[0], z[1], x[0], y[0]);
return;
}
if (n< 1)
return;
int n2 = n>>1;
_mul_karatsuba(z+n, x+n2, y+n2, n2); // z0 = x0.y0
_mul_karatsuba(z , x , y , n2); // z2 = x1.y1
DWORD *q = new DWORD[n<<1], *q0, *q1, *qq;
BYTE cx,cy;
if (q == NULL) {
_error(_arbnum_error_NotEnoughMemory);
return;
}
#define _add { alu.add(qq[i], q0[i], q1[i]); for (i--; i>=0; i--) alu.adc(qq[i], q0[i], q1[i]); } // qq = q0 + q1 ...[i..0]
#define _sub { alu.sub(qq[i], q0[i], q1[i]); for (i--; i>=0; i--) alu.sbc(qq[i], q0[i], q1[i]); } // qq = q0 - q1 ...[i..0]
qq = q;
q0 = x + n2;
q1 = x;
i = n2 - 1;
_add;
cx = alu.cy; // =x0+x1
qq = q + n2;
q0 = y + n2;
q1 = y;
i = n2 - 1;
_add;
cy = alu.cy; // =y0+y1
_mul_karatsuba(q + n, q + n2, q, n2); // =(x0+x1)(y0+y1) mod ((2^N)-1)
if (cx) {
qq = q + n;
q0 = qq;
q1 = q + n2;
i = n2 - 1;
_add;
cx = alu.cy;
}// += cx*(y0 + y1) << n2
if (cy) {
qq = q + n;
q0 = qq;
q1 = q;
i = n2 -1;
_add;
cy = alu.cy;
}// +=cy*(x0+x1)<<n2
qq = q + n; q0 = qq; q1 = z + n; i = n - 1; _sub; // -=z0
qq = q + n; q0 = qq; q1 = z; i = n - 1; _sub; // -=z2
qq = z + n2; q0 = qq; q1 = q + n; i = n - 1; _add; // z1=(x0+x1)(y0+y1)-z0-z2
DWORD ccc=0;
if (alu.cy)
ccc++; // Handle carry from last operation
if (cx || cy)
ccc++; // Handle carry from before last operation
if (ccc)
{
i = n2 - 1;
alu.add(z[i], z[i], ccc);
for (i--; i>=0; i--)
if (alu.cy)
alu.inc(z[i]);
else
break;
}
delete[] q;
#undef _add
#undef _sub
}
//---------------------------------------------------------------------------
void arbnum::mul_karatsuba(const arbnum &x, const arbnum &y)
{
// O(3*(N)^log2(3)) ~ O(3*(N^1.585))
// Karatsuba multiplication
//
int s = x.sig*y.sig;
arbnum a, b;
a = x;
b = y;
a.sig = +1;
b.sig = +1;
int i, n;
for (n = 1; (n < a.siz) || (n < b.siz); n <<= 1)
;
a._realloc(n);
b._realloc(n);
_alloc(n + n);
for (i=0; i < siz; i++)
dat[i]=0;
_mul_karatsuba(dat, a.dat, b.dat, n);
bits = siz << 5;
sig = s;
exp = a.exp + b.exp + ((siz-a.siz-b.siz)<<5) + 1;
// _normalize();
}
//---------------------------------------------------------------------------
My arbnum number representation:
// dat is MSDW first ... LSDW last
DWORD *dat; int siz,exp,sig,bits;
dat[siz] is the mantisa. LSDW means least significant DWORD.
exp is the exponent of MSB of dat[0]
The first nonzero bit is present in the mantissa!!!
// |-----|---------------------------|---------------|------|
// | sig | MSB mantisa LSB | exponent | bits |
// |-----|---------------------------|---------------|------|
// | +1 | 0.(0 ... 0) | 2^0 | 0 | +zero
// | -1 | 0.(0 ... 0) | 2^0 | 0 | -zero
// |-----|---------------------------|---------------|------|
// | +1 | 1.(dat[0] ... dat[siz-1]) | 2^exp | n | +number
// | -1 | 1.(dat[0] ... dat[siz-1]) | 2^exp | n | -number
// |-----|---------------------------|---------------|------|
// | +1 | 1.0 | 2^+0x7FFFFFFE | 1 | +infinity
// | -1 | 1.0 | 2^+0x7FFFFFFE | 1 | -infinity
// |-----|---------------------------|---------------|------|

If I understand your algorithm correctly, it seems O(n^2) where n is the number of digits.
Have you looked at Karatsuba Algorithm?
It speeds up multiplication using the divide and conquer approach. It may be worth taking a look at.

Great question you have, thanks!
Decided to implement from scratch a huge C++ solution for you, based on Number Theoretic Transform (NTT) and Discrete Fourier Transform.
To tell in advance, my FFT/NTT code achieves 330x speedup on 2-core old laptop compared to naive school-grade multiplication for the case of array size 2^16 32-bit words. Even bigger arrays above 2^20 in size will give millions times speedup.
Squaring a number with 2^22 words of 32-bit size (i.e. 4 Million words) takes 7 seconds on my NTT and 13 seconds on my FFT, on old 2GHz 2-core laptop with SSE2 only.
To remind, FFT and NTT give multiplication time O(N * Log(N)), while naive school grade algorithm has O(N^2) time. That's why I have so huge speedup described in previous paragraph.
Both together with code are well described in this article, mainly I was inspired by this article when writing below code. Another good article is Nayuki's NTT article.
I was convinced that for quite large numbers these two transforms will beat any other methods, like Karatsuba.
Besides basic approach described in article I also did dozens of optimizations:
For NTT computed set of my own primitive roots and modulos. And used biggest one closest to 2^62.
Used multi threading almost on every loop of NTT and FFT computation. Through OpenMP.
For squaring definitely I used 2 transforms instead of 3 (used for multiply). This gives 33% speed boost.
For NTT used Montgomery Reduction in all arrays when computing modulus. This gave about 2x-3x speedup.
Used constexpr functions and values and templated programming everywhere where I can. Reduction of runtime values to compile time values where possible gives a lot of speedup.
Re-designed swap/shuffle function that is used at every start of FFT/NTT transforms. Used precomputed table and caching for re-using previous results. Also did swapping in blocks to make cache-friendly reads/writes. Also bit twidling is done not in a loop but using pre-computed bit-table.
Inside main loop of transform factored out computation of W multiplier into separate loop together with pre-computation/caching. This gave about 2x speedup.
Used Intel SIMD instruction sets, currently SSE2 and AVX. These are used only for FFT, as NTT uses 128-bit integer division and multiplication and add/sub-with-carry, these are not available in SIMD. Also for SIMD in FFT I designed loop unrolling with special cache-friendly storage of complex numbers in std::array<>.
Did time/performance measurement of NTT/FFT multiplication versus naive.
Did analysis of error rate inside FFT. To remind NTT has no errors at all.
My code is self-contained, if you compile+run it then it will run tests measuring speed. Inside test function you can see how to use my library. Test runs FFT/NTT/Naive multiplication, measures time and compares if all multiplication results are correct, i.e. equal to naive version.
Note: No matter how I struggled to speedup FFT through SIMD, yet my NTT is so optimized that it is 1.3-1.8x times faster than FFT. As you know FFT gives errors which grow with size of big number. And if to take into account a fact that my NTT got faster then NTT is the only option for you!
It appeared that FFT can be used only for array sizes like 2^16 32-bit words, no more, then error size becomes to critical and destructs final result. Or you can decrease size of input 32-bit numbers, to 10-12 bits, this helps to reduce errors, yet you can't go bigger than 2^18 array size with critical error. You have to compute error size experimentally to figure out what is best.
Code can be compiled in CLang/MSVC/GCC. Maybe other compilers too. It has no external libraries dependencies at all, maybe except OpenMP library which is usually shipped with compiler. Only computation of Primitive Roots (NTT modulus) requires Boost library but only for MSVC and uses only 128-bit integer from there.
CODE GOES HERE. Only because code size is 65 KB, I can't inline it inside this post, as StackOverflow post size limit is 30 000 symbols. Hence I'm providing my code in below Github Gist link. Also click Try it online! link to run my code on online server of GodBolt.
Try it online!
Github Gist source code
Example console output:
Using SIMD SSE2
Test FindNttMod
FindNttEntry<T>{.k = 57, .c = 29, .p = 4179340454199820289, .g = 3, .root = 68630377364883, .plog2 = 61.86},
FindNttEntry<T>{.k = 54, .c = 177, .p = 3188548536178311169, .g = 7, .root = 3055434446054240334, .plog2 = 61.47},
FindNttEntry<T>{.k = 54, .c = 163, .p = 2936346957045563393, .g = 3, .root = 83050791888939419, .plog2 = 61.35},
FindNttEntry<T>{.k = 55, .c = 69, .p = 2485986994308513793, .g = 5, .root = 1700750308946223057, .plog2 = 61.11},
FindNttEntry<T>{.k = 54, .c = 127, .p = 2287828610704211969, .g = 3, .root = 878887558841786394, .plog2 = 60.99},
FindNttEntry<T>{.k = 55, .c = 57, .p = 2053641430080946177, .g = 7, .root = 640559856471874596, .plog2 = 60.83},
FindNttEntry<T>{.k = 56, .c = 27, .p = 1945555039024054273, .g = 5, .root = 1613915479851665306, .plog2 = 60.75},
FindNttEntry<T>{.k = 53, .c = 161, .p = 1450159080013299713, .g = 3, .root = 359678689516082930, .plog2 = 60.33},
FindNttEntry<T>{.k = 53, .c = 143, .p = 1288029493427961857, .g = 3, .root = 531113314168589713, .plog2 = 60.16},
FindNttEntry<T>{.k = 55, .c = 35, .p = 1261007895663738881, .g = 6, .root = 397650301651152680, .plog2 = 60.13},
0.025 sec
Test CompareNttMultWithReg
Time NTT 0.035 FFT 0.081 Reg 11.614 Boost_NTT 333.588x (FFT 142.644)
Swap 0.776 (Slow 0.000) ToMontg 0.079 Main 3.056 (0.399, 2.656) Invert 0.000 All 3.911
MidMul 0.110
Swap 0.510 (Slow 0.000) ToMontg 0.000 Main 2.535 (0.336, 2.198) Invert 0.094 All 3.139
AssignComplex 0.495
Swap 1.373 FromComplex 0.309 Main 4.875 (0.382, 4.493) Invert 0.000 ToComplex 0.224 All 6.781
MidMul 0.147
Swap 1.106 FromComplex 0.296 Main 4.209 (0.277, 3.931) Invert 0.166 ToComplex 0.199 All 5.975
Round 0.143
Time NTT 7.457 FFT 14.097 Boost_NTT 1.891x
Run Time: 33.719 sec

If you're looking to write a new better exponent you might have to write it in assembly. This is the code from golang.
https://code.google.com/p/go/source/browse/src/pkg/math/exp_amd64.s

Clean, efficient algorithm for wrapping integers in C++

/**
* Returns a number between kLowerBound and kUpperBound
* e.g.: Wrap(-1, 0, 4); // Returns 4
* e.g.: Wrap(5, 0, 4); // Returns 0
*/
int Wrap(int const kX, int const kLowerBound, int const kUpperBound)
{
// Suggest an implementation?
}

The sign of a % b is only defined if a and b are both non-negative.
int Wrap(int kX, int const kLowerBound, int const kUpperBound)
{
int range_size = kUpperBound - kLowerBound + 1;
if (kX < kLowerBound)
kX += range_size * ((kLowerBound - kX) / range_size + 1);
return kLowerBound + (kX - kLowerBound) % range_size;
}

The following should work independently of the implementation of the mod operator:
int range = kUpperBound - kLowerBound + 1;
kx = ((kx-kLowerBound) % range);
if (kx<0)
return kUpperBound + 1 + kx;
else
return kLowerBound + kx;
An advantage over other solutions is, that it uses only a single % (i.e. division), which makes it pretty efficient.
Note (Off Topic):
It's a good example, why sometimes it is wise to define intervals with the upper bound being being the first element not in the range (such as for STL iterators...). In this case, both "+1" would vanish.

Fastest solution, least flexible: Take advantage of native datatypes that will do wrapping in the hardware.
The absolute fastest method for wrapping integers would be to make sure your data is scaled to int8/int16/int32 or whatever native datatype. Then when you need your data to wrap the native data type will be done in hardware! Very painless and orders of magnitude faster than any software wrapping implementation seen here.
As an example case study:
I have found this to be very useful when I need a fast implementation of sin/cos implemented using a look-up-table for a sin/cos implementation. Basically you make scale your data such that INT16_MAX is pi and INT16_MIN is -pi. Then have you are set to go.
As a side note, scaling your data will add some up front finite computation cost that usually looks something like:
int fixedPoint = (int)( floatingPoint * SCALING_FACTOR + 0.5 )
Feel free to exchange int for something else you want like int8_t / int16_t / int32_t.
Next fastest solution, more flexible: The mod operation is slow instead if possible try to use bit masks!
Most of the solutions I skimmed are functionally correct... but they are dependent on the mod operation.
The mod operation is very slow because it is essentially doing a hardware division. The laymans explanation of why mod and division are slow is to equate the division operation to some pseudo-code for(quotient = 0;inputNum> 0;inputNum -= divisor) { quotient++; } ( def of quotient and divisor ). As you can see, the hardware division can be fast if it is a low number relative to the divisor... but division can also be horribly slow if it is much greater than the divisor.
If you can scale your data to a power of two then you can use a bit mask which will execute in one cycle ( on 99% of all platforms ) and your speed improvement will be approximately one order of magnitude ( at the very least 2 or 3 times faster ).
C code to implement wrapping:
#define BIT_MASK (0xFFFF)
int wrappedAddition(int a, int b) {
return ( a + b ) & BIT_MASK;
}
int wrappedSubtraction(int a, int b) {
return ( a - b ) & BIT_MASK;
}
Feel free to make the #define something that is run time. And feel free to adjust the bit mask to be whatever power of two that you need. Like 0xFFFFFFFF or power of two you decide on implementing.
p.s. I strongly suggest reading about fixed point processing when messing with wrapping/overflow conditions. I suggest reading:
Fixed-Point Arithmetic: An Introduction by Randy Yates August 23, 2007

Please do not overlook this post. :)
Is this any good?
int Wrap(N,L,H){
H=H-L+1; return (N-L+(N<L)*H)%H+L;
}
This works for negative inputs, and all arguments can be negative so long as L is less than H.
Background... (Note that H here is the reused variable, set to original H-L+1).
I had been using (N-L)%H+L when incrementing, but unlike in Lua, which I used before starting to learn C a few months back, this would NOT work if I used inputs below the lower bound, never mind negative inputs. (Lua is built in C, but I don't know what it's doing, and it likely wouldn't be fast...)
I decided to add +(N<L)*H to make (N-L+(N<L)*H)%H+L, as C seems to be defined such that true=1 and false=0. It works well enough for me, and seems to answer the original question neatly. If anyone knows how to do it without the MOD operator % to make it dazzlingly fast, please do it. I don't need speed right now, but some time I will, no doubt.
EDIT:
That function fails if N is lower than L by more than H-L+1 but this doesn't:
int Wrap(N,L,H){
H-=L; return (N-L+(N<L)*((L-N)/H+1)*++H)%H+L;
}
I think it would break at the negative extreme of the integer range in any system, but should work for most practical situations. It adds an extra multiplication and a division, but is still fairly compact.
(This edit is just for completion, because I came up with a much better way, in a newer post in this thread.)
Crow.

Personally I've found solutions to these types of functions to be cleaner if range is exclusive and divisor is restricted to positive values.
int ifloordiv(int x, int y)
{
if (x > 0)
return x / y;
if (x < 0)
return (x + 1) / y - 1;
return 0
}
int iwrap(int x, int y)
{ return x - y * ifloordiv(x, y);
}
Integrated.
int iwrap(int x, int y)
{
if (x > 0)
return x % y;
if (x < 0)
return (x + 1) % y + y - 1;
return 0;
}
Same family. Why not?
int ireflect(int x, int y)
{
int z = iwrap(x, y*2);
if (z < y)
return z;
return y*2-1 - z;
}
int ibandy(int x, int y)
{
if (y != 1)
return ireflect(abs(x + x / (y - 1)), y);
return 0;
}
Ranged functionality can be implemented for all functions with,
// output is in the range [min, max).
int func2(int x, int min, int max)
{
// increment max for inclusive behavior.
assert(min < max);
return func(x - min, max - min) + min;
}

Actually, since -1 % 4 returns -1 on every system I've even been on, the simple mod solution doesn't work. I would try:
int range = kUpperBound - kLowerBound +1;
kx = ((kx - kLowerBound) % range) + range;
return (kx % range) + kLowerBound;
if kx is positive, you mod, add range, and mod back, undoing the add. If kx is negative, you mod, add range which makes it positive, then mod again, which doesn't do anything.

My other post got nasty, all that 'corrective' multiplication and division got out of hand. After looking at Martin Stettner's post, and at my own starting conditions of (N-L)%H+L, I came up with this:
int Wrap(N,L,H){
H=H-L+1; N=(N-L)%H+L; if(N<L)N+=H; return N;
}
At the extreme negative end of the integer range it breaks as my other one would, but it will be faster, and is a lot easier to read, and avoids the other nastiness that crept in to it.
Crow.

I would suggest this solution:
int Wrap(int const kX, int const kLowerBound, int const kUpperBound)
{
int d = kUpperBound - kLowerBound + 1;
return kLowerBound + (kX >= 0 ? kX % d : -kX % d ? d - (-kX % d) : 0);
}
The if-then-else logic of the ?: operator makes sure that both operands of % are nonnegative.

I would give an entry point to the most common case lowerBound=0, upperBound=N-1. And call this function in the general case. No mod computation is done where I is already in range. It assumes upper>=lower, or n>0.
int wrapN(int i,int n)
{
if (i<0) return (n-1)-(-1-i)%n; // -1-i is >=0
if (i>=n) return i%n;
return i; // In range, no mod
}
int wrapLU(int i,int lower,int upper)
{
return lower+wrapN(i-lower,1+upper-lower);
}

An answer that has some symmetry and also makes it obvious that when kX is in range, it is returned unmodified.
int Wrap(int const kX, int const kLowerBound, int const kUpperBound)
{
int range_size = kUpperBound - kLowerBound + 1;
if (kX < kLowerBound)
return kX + range_size * ((kLowerBound - kX) / range_size + 1);
if (kX > kUpperBound)
return kX - range_size * ((kX - kUpperBound) / range_size + 1);
return kX;
}

I've faced this problem as well. This is my solution.
template <> int mod(const int &x, const int &y) {
return x % y;
}
template <class T> T mod(const T &x, const T &y) {
return ::fmod((T)x, (T)y);
}
template <class T> T wrap(const T &x, const T &max, const T &min = 0) {
if(max < min)
return x;
if(x > max)
return min + mod(x - min, max - min + 1);
if(x < min)
return max - mod(min - x, max - min + 1);
return x;
}
I don't know if it's good, but I'd thought I'd share since I got directed here when doing a Google search on this problem and found the above solutions lacking to my needs. =)

In the special case where the lower bound is zero, this code avoids division, modulus and multiplication. The upper bound does not have to be a power of two. This code is overly verbose and looks bloated, but compiles into 3 instructions: subtract, shift (by constant), and 'and'.
#include <climits> // CHAR_BIT
// -------------------------------------------------------------- allBits
// sign extend a signed integer into an unsigned mask:
// return all zero bits (+0) if arg is positive,
// or all one bits (-0) for negative arg
template <typename SNum>
static inline auto allBits (SNum arg) {
static constexpr auto argBits = CHAR_BIT * sizeof( arg);
static_assert( argBits < 256, "allBits() sign extension may fail");
static_assert( std::is_signed< SNum>::value, "SNum must be signed");
typedef typename std::make_unsigned< SNum>::type UNum;
// signed shift required, but need unsigned result
const UNum mask = UNum( arg >> (argBits - 1));
return mask;
}
// -------------------------------------------------------------- boolWrap
// wrap reset a counter without conditionals:
// return arg >= limit? 0 : arg
template <typename UNum>
static inline auto boolWrap (const UNum arg, const UNum limit) {
static_assert( ! std::is_signed< UNum>::value, "UNum assumed unsigned");
typedef typename std::make_signed< UNum>::type SNum;
const SNum negX = SNum( arg) - SNum( limit);
const auto signX = allBits( negX); // +0 or -0
return arg & signX;
}
// example usage:
for (int j= 0; j < 15; ++j) {
cout << j << boolWrap( j, 11);
}

For negative kX, you can add:
int temp = kUpperBound - kLowerBound + 1;
while (kX < 0) kX += temp;
return kX%temp + kLowerBound;

Why not using Extension methods.
public static class IntExtensions
{
public static int Wrap(this int kX, int kLowerBound, int kUpperBound)
{
int range_size = kUpperBound - kLowerBound + 1;
if (kX < kLowerBound)
kX += range_size * ((kLowerBound - kX) / range_size + 1);
return kLowerBound + (kX - kLowerBound) % range_size;
}
}
Usage: currentInt = (++currentInt).Wrap(0, 2);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimizing Fixed-Point Sqrt - c++

Related

how to wrap radians between -pi and pi with mod? [duplicate]

Calculating the summation of powers of a number modulo a number

performance of log10 function returning an int

Fast bignum square computation

Clean, efficient algorithm for wrapping integers in C++

Categories

Resources