Optimization Headache - removing if's from Look Up Table - c++

I'm trying to optimize the following piece of code, which is a bottleneck in my application.
What it does: It takes the double values value1 and value2 and tries to find the maximum including a correctional factor. If the difference between both values is greater than 5.0 (the LUT is scaled by factor 10), I can just take the max value of those two. If the difference is smaller than 5.0, I can use the correctional factor from the LUT.
Does anyone have an idea what could be a better style for this piece of code? I don't know where I'm losing time - is it the large number of ifs or the multiplication by 10?
double value1, value2;
// Lookup Table scaled by 10 for (ln(1+exp(-abs(x)))), which is almost 0 for x > 5 and symmetrical around 0. LUT[0] is x=0.0, LUT[40] is x=4.0.
const logValue LUT[50] = { ... }
if (value1 > value2)
{
if (value1 - value2 >= 5.0)
{
return value1;
}
else
{
return value1 + LUT[(uint8)((value1 - value2) * 10)];
}
}
else
{
if (value2 - value1 >= 5.0)
{
return value2;
}
else
{
return value2 + LUT[(uint8)((value2 - value1) * 10)];
}
}

A couple of minutes of playing with Excel produces an approximation equation that looks like it might have the accuracy you need, so you can do away with the lookup table altogether. You still need one condition to make sure the parameters to the equation remain within the range that it was optimized for.
double diff = abs(value1 - value2);
double dmax = (value1 + value2 + diff) * 0.5; // same as (min+max+(max-min))/2
if (diff > 5.0)
return dmax;
return dmax + 4.473865638/(2.611112371+diff) + 0.088190879*diff + -1.015046114;
P.S. I'm not guaranteeing this is faster, only that it's a different enough approach to be worth benchmarking.
P.P.S. It's possible to change the constraints to come up with slightly different constants, there are a lot of variations. Here's another set I did where the difference between your table and the formula will always be less than 0.008, also each value will be less than the one preceeding.
return dmax + 3.441318133/(2.296924445+diff) + 0.065529678*diff + -0.797081529;
Edit: I tested this code (second formula) with 100 passes against a million random numbers between 0 and 10, along with the original code from the question, MSalters currently accepted answer, and a brute force implementation max(value1,value2)+log(1.0+exp(-abs(value1-value2))). I tried it on a dual-core AMD Athlon and an Intel quad-core i7 and the results were roughly consistent. Here's a typical run:
Original: 1.32 seconds.
MSalters: 1.13 seconds.
Mine: 0.67 seconds.
Brute force: 4.50 seconds.
Processors have gotten unbelievably fast over the years, so fast now that they can do a couple of floating point multiplies and divides faster than they can look up a value in memory. Not only is this approach faster on a modern x86, it's also more accurate; the approximation errors in the equation are much less than the step errors caused by truncating the input for the lookup.
Naturally results can still vary based on your processor and compiler; benchmarking is still required for your own particular target.

It probably goes down both paths equally enough that you're causing a lot of pipe-lining problems for your processor.
Have you tried profiling?
I'd also suggest trying to use the standard library and see if that helps (for example if it's able to use and processor-specific instructions):
double diff = std::fabs(value1 - value2);
double maxv = std::max(value1, value2);
return (diff >= 5.0) ? maxv : maxv + LUT[(uint8)((diff) * 10)];

I'd probably have written the code a bit different to handle the value2<value1 case:
if (value2 < value1) std::swap(value1, value2);
assert(value1 <= value2); // Assertion corrected
int diff = int((value2 - value1) * 10.0);
if (diff >= 50) diff = 49; // Integer comparison iso floating point
return value2 + LUT[diff];

I am going to assume when the function is called, you'll most likely get the part where you have to use the look up table, rather then the >=5.0 parts. In this case, it's better to guide the compiler towards this.
double maxval = value1;
double difference_scaled = (value1-value2)*10;
if (difference < 0)
{
difference = -difference;
maxval = value2;
}
if (difference < 50)
return maxval+LUT[(int)difference_scaled];
else
return maxval;
Try this and let me know if that improves your program's performance.

The only reason this code would be the bottleneck in your application is because you are calling it many many times. Are you sure you need to? Perhaps the algorithm higher up in the code can be changed to use the comparison less?

I've done some very quick tests but please profile the code yourself to verify the effect.
Changing the LUT[] to a static variable got me a 600% speed up (from 3.5s to 0.6s). Which is close to the absolute minimum of test I used (0.4s). See if that works and re-profile to determine if any further optimization is needed.
For reference, the I was simply timing the execution of this loop (100 million iterations of the inner loop) in VC++ 2010:
int Counter = 0;
for (double j = 0; j < 10; j += 0.001)
{
for (double i = 0; i < 10; i += 0.001)
{
++Counter;
Value1 += TestFunc1(i, j);
}
}

You are computing value1 - value2 quite a few times in your function. Just do it once.
That cast to a uint8_t may also be also problematic. A far as performance is concerned, the best integral type to use as a conversion from double to integer is an int, as is the best integral type to use an array index is an int.
max_value = value1;
diff = value1 - value2;
if (diff < 0.0) {
max_value = value2;
diff = -diff;
}
if (diff >= 5.0) {
return max_value;
}
else {
return max_value + LUT[(int)(diff * 10.0)];
}
Note that the above guarantees that the LUT index will be between 0 (inclusive) and 50 (exclusive). There's no need for a uint8_t here.
Edit
After some playing around with some variations, this is a fairly fast LUT-based approximation to log(exp(value1)+exp(value2)):
#include <stdint.h>
// intptr_t *happens* to be fastest on my machine. YMMV.
typedef intptr_t IndexType;
double log_sum_exp (double value1, double value2, double *LUT) {
double diff = value1 - value2;
if (diff < 0.0) {
value1 = value2;
diff = -diff;
}
IndexType idx = diff * 10.0;
if (idx < 50) {
value1 += LUT[idx];
}
return value1;
}
The integral type IndexType is one of the keys to speeding things up. I tested with clang and g++, and both indicated that casting to an intptr_t (long on my computer) and using a intptr_t as an index into the LUT is faster than other integral types. It is considerably faster than some types. For example, unsigned long long and uint8_t are incredibly poor choices on my computer.
The type is not just a hint, at least with the compilers I used. Those compilers did exactly what the code told it to do with regard to the conversion from floating point type to integral type, regardless of the optimization level.
Another speed bump results from comparing the integral type to 50 as opposed to comparing the floating point type to 5.0.
One last speed bump: Not all compilers are created equal. On my computer (YMMV), g++ -O3 generates considerably slower code (25% slower with this problem!) than does clang -O3, which in turn generates code that is a bit slower than that generated by clang -O4.
I also played with a rational function approximation approach (similar to Mark Ransom's answer), but the above obviously does not use such an approach.

Related

Vectorized function to count numbers in an array when a number is a specified power

I am attempting to vectorize this fairly expensive function (Scaler Now working!):
template<typename N, typename POW>
inline constexpr bool isPower(const N n, const POW p) noexcept
{
double x = std::log(static_cast<double>(n)) / std::log(static_cast<double>(p));
return (x - std::trunc(x)) < 0.000001;
}//End of isPower
Here's what I have so far (for 32-bit int only):
template<typename RETURN_T>
inline RETURN_T count_powers_of(const std::vector<int32_t>& arr, const int32_t power)
{
RETURN_T cnt = 0;
const __m256 _MAGIC = _mm256_set1_ps(0.000001f);
const __m256 _POWER_D = _mm256_set1_ps(static_cast<float>(para));
const __m256 LOG_OF_POWER = _mm256_log_ps(_POWER_D);
__m256i _count = _mm256_setzero_si256();
__m256i _N_INT = _mm256_setzero_si256();
__m256 _N_DBL = _mm256_setzero_ps();
__m256 LOG_OF_N = _mm256_setzero_ps();
__m256 DIVIDE_LOG = _mm256_setzero_ps();
__m256 TRUNCATED = _mm256_setzero_ps();
__m256 CMP_MASK = _mm256_setzero_ps();
for (size_t i = 0uz; (i + 8uz) < end; i += 8uz)
{
//Set Values
_N_INT = _mm256_load_si256((__m256i*) &arr[i]);
_N_DBL = _mm256_cvtepi32_ps(_N_INT);
LOG_OF_N = _mm256_log_ps(_N_DBL);
DIVIDE_LOG = _mm256_div_ps(LOG_OF_N, LOG_OF_POWER);
TRUNCATED = _mm256_sub_ps(DIVIDE_LOG, _mm256_trunc_ps(DIVIDE_LOG));
CMP_MASK = _mm256_cmp_ps(TRUNCATED, _MAGIC, _CMP_LT_OQ);
_count = _mm256_sub_epi32(_count, _mm256_castps_si256(CMP_MASK));
}//End for
cnt = static_cast<RETURN_T>(util::_mm256_sum_epi32(_count));
}//End of count_powers_of
The scaler version runs in about 14.1 seconds.
The scaler version called from std::count_if with par_unseq runs in 4.5 seconds.
The vectorized version runs in just 155 milliseconds but produces the wrong result. Albeit vastly closer now.
Testing:
int64_t count = 0;
for (size_t i = 0; i < vec.size(); ++i)
{
if (isPower(vec[i], 4))
{
++count;
}//End if
}//End for
std::cout << "Counted " << count << " powers of 4.\n";//produces 4,996,215 powers of 4 in a vector of 1 billion 32-bit ints consisting of a uniform distribution of 0 to 1000
std::cout << "Counted " << count_powers_of<int32_t>(vec, 4) << " powers of 4.\n";//produces 4,996,865 powers of 4 on the same array
This new vastly simplified code often produces results that are either slightly off the correct number of powers found (usually higher). I think the problem is my reinterpret cast from __m256 to _m256i but when I try use a conversation (with floor) instead I get a number that's way off (in the billions again).
It could also be this sum function (based off of code by #PeterCordes ):
inline uint32_t _mm_sum_epi32(__m128i& x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x);
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32);
}
inline uint32_t _mm256_sum_epi32(__m256i& v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1));
return _mm_sum_epi32(sum128);
}
I know this has got to be a floating-point precision/comparison issue; Is there a better way to approach this?
Thanks for all your insights and suggestions thus far.
A more sensible unit-test would be to non-random: Check all powers in a loop to make sure they're all true, like x *= base;, and count how many powers there are <= n. Then check all numbers from 0..n in a loop, once each to verify the right total. If both those checks succeed, that means it returned false in all the cases it should have, otherwise the count would be wrong.
Re: the original version:
This seems to depend on there being no floating-point rounding error. You do d == (N)d which (if N is an integral type) checks that the ratio of two logs is an exact integer; even 1 bit in the mantissa will make it unequal. Hardly surprising that a different log implementation would give different results, if one has different rounding error.
Except your scalar code at least is even more broken because it takes d = floor(log ratio) so it's already always an exact integer.
I just tried your scalar version for a testcase like return isPower(5, 4) to ask if 5 is a power of 4. It returns true: https://godbolt.org/z/aMT94ro6o . So yeah, your code is super broken, and is in fact only checking that n>0 or something. That would explain why 999 of 1000 of your "random" inputs from 0..999 were counted as powers of 4, which is obviously super broken.
I think it's impossible to achieve correctness with your FP log ratio idea: FP rounding error means you can't expect exact equality, but allowing a range would probably let in non-exact powers.
You might want to special-case integral N, power-of-2 pow. That can go vastly vaster by checking that n has a single bit set (n & (n-1) == 0) and that it's at a valid position. (e.g. for pow=4, n & 0b...10101010 != 0). You can construct the constant by multiplying and adding until overflow or something. Or 32/pow times? Anyway, one psubd/pand/pcmpeqd, pand/pcmpeqd, and pand/psubd per 8 elements, with maybe some room to optimize that further.
Otherwise, in the general case, you can brute-force check 32-bit integers one at a time against the 32 or fewer possible powers that fit in an int32_t. e.g. broadcast-load, 4x vpcmpeqd / vpsubd into multiple accumulators. (The smallest possible base, 2, can have exponents up to 2^31` and still fit in an unsigned int). log_3(2^31) is 19, so you'd only need three AVX2 vectors of powers. Or log_4(2^31) is 15.5 so you'd only need 2 vectors to hold every non-overflowing power.
That only handles 1 input element per vector instead of 4 doubles, but it's probably faster than your current FP attempt, as well as fixing the correctness problems. I could see that running more than 4x the throughput per iteration of what you're doing now, or even 8x, so it should be good for speed. And of course has the advantage that correctness is possible!!
Speed gets even better for bases of 4 or greater, only 2x compare/sub per input element, or 1x for bases of 16 or greater. (<= 8 elements to compare against can fit in one vector).
Implementation mistakes in the attempt to vectorize this probably-unfixable algorithm:
_mm256_rem_epi32 is slow library function, but you're using it with a constant divisor of 2! Integer mod 2 is just n & 1 for non-negative. Or if you need to handle negative remainders, you can use the tricks compilers use to implement int % 2: https://godbolt.org/z/b89eWqEzK where it shifts down the sign bit as a correction to do signed division.
Updated version using (x - std::trunc(x)) < 0.000001;
This might work, especially if you limit it to small n. I'd worry that with large n, the difference between an exact power and off-by-1 would be a small ratio. (I haven't really looked at the details, though.)
Your vectorization with __m256 vectors of single-precision float is doomed for large n, but could be ok for small n: float32 can't represent every int32_t, so large odd integers (above 2^24) get rounded to multiples of 2, or multiples of 4 above 2^25, etc.
float has less relative precision in general, so it might not have enough to spare for this algorithm. Or maybe there's something that could be fixed, IDK, I haven't looked closely since the update.
I'd still recommend trying a simple compare-for-equality against all possible powers in the range, broadcast-loading each element. That will definitely work exactly, and if it's as fast then there's no need to try to fix this version using FP logs.
__m256 _N_DBL = _mm256_setzero_ps(); is a confusing name; it's a vector of float, not double. (And it's not part of a standard library header so it shouldn't be using a leading underscore.)
Also, there's zero point initializing it with zero there, since it gets written unconditionally inside the loop. In fact it's only ever used inside the loop, so it could just be declared at that scope, when you're ready to give it a value. Only declare variables in outer scopes if you need them after a loop.

Dividing two integers and rounding up the result, without using floating point

I need to divide two numbers and round it up. Are there any better way to do this?
int myValue = (int) ceil( (float)myIntNumber / myOtherInt );
I find an overkill to have to cast two different time. (the extern int cast is just to shut down the warning)
Note I have to cast internally to float otherwise
int a = ceil(256/11); //> Should be 24, but it is 23
^example
Assuming that both myIntNumber and myOtherInt are positive, you could do:
int myValue = (myIntNumber + myOtherInt - 1) / myOtherInt;
With help from DyP, came up with the following branchless formula:
int idiv_ceil ( int numerator, int denominator )
{
return numerator / denominator
+ (((numerator < 0) ^ (denominator > 0)) && (numerator%denominator));
}
It avoids floating-point conversions and passes a basic suite of unit tests, as shown here:
http://ideone.com/3OrviU
Here's another version that avoids the modulo operator.
int idiv_ceil ( int numerator, int denominator )
{
int truncated = numerator / denominator;
return truncated + (((numerator < 0) ^ (denominator > 0)) &&
(numerator - truncated*denominator));
}
http://ideone.com/Z41G5q
The first one will be faster on processors where IDIV returns both quotient and remainder (and the compiler is smart enough to use that).
Maybe it is just easier to do a:
int result = dividend / divisor;
if(dividend % divisor != 0)
result++;
Benchmarks
Since a lot of different methods are shown in the answers and none of the answers actually prove any advantages in terms of performance I tried to benchmark them myself. My plan was to write an answer that contains a short table and a definite answer which method is the fastest.
Unfortunately it wasn't that simple. (It never is.) It seems that the performance of the rounding formulas depend on the used data type, compiler and optimization level.
In one case there is an increase of speed by 7.5× from one method to another. So the impact can be significant for some people.
TL;DR
For long integers the naive version using a type cast to float and std::ceil was actually the fastest. This was interesting for me personally since I intended to use it with size_t which is usually defined as unsigned long.
For ordinary ints it depends on your optimization level. For lower levels #Jwodder's solution performs best. For the highest levels std::ceil was the optimal one. With one exception: For the clang/unsigned int combination #Jwodder's was still better.
The solutions from the accepted answer never really outperformed the other two. You should keep in mind however that #Jwodder's solution doesn't work with negatives.
Results are at the bottom.
Implementations
To recap here are the four methods I benchmarked and compared:
#Jwodder's version (Jwodder)
template<typename T>
inline T divCeilJwodder(const T& numerator, const T& denominator)
{
return (numerator + denominator - 1) / denominator;
}
#Ben Voigt's version using modulo (VoigtModulo)
template<typename T>
inline T divCeilVoigtModulo(const T& numerator, const T& denominator)
{
return numerator / denominator + (((numerator < 0) ^ (denominator > 0))
&& (numerator%denominator));
}
#Ben Voigt's version without using modulo (VoigtNoModulo)
inline T divCeilVoigtNoModulo(const T& numerator, const T& denominator)
{
T truncated = numerator / denominator;
return truncated + (((numerator < 0) ^ (denominator > 0))
&& (numerator - truncated*denominator));
}
OP's implementation (TypeCast)
template<typename T>
inline T divCeilTypeCast(const T& numerator, const T& denominator)
{
return (int)std::ceil((double)numerator / denominator);
}
Methodology
In a single batch the division is performed 100 million times. Ten batches are calculated for each combination of Compiler/Optimization level, used data type and used implementation. The values shown below are the averages of all 10 batches in milliseconds. The errors that are given are standard deviations.
The whole source code that was used can be found here. Also you might find this script useful which compiles and executes the source with different compiler flags.
The whole benchmark was performed on a i7-7700K. The used compiler versions were GCC 10.2.0 and clang 11.0.1.
Results
Now without further ado here are the results:
DataTypeAlgorithm
GCC-O0
-O1
-O2
-O3
-Os
-Ofast
-Og
clang-O0
-O1
-O2
-O3
-Ofast
-Os
-Oz
unsigned
Jwodder
264.1 ± 0.9 🏆
175.2 ± 0.9 🏆
153.5 ± 0.7 🏆
175.2 ± 0.5 🏆
153.3 ± 0.5
153.4 ± 0.8
175.5 ± 0.6 🏆
329.4 ± 1.3 🏆
220.0 ± 1.3 🏆
146.2 ± 0.6 🏆
146.2 ± 0.6 🏆
146.0 ± 0.5 🏆
153.2 ± 0.3 🏆
153.5 ± 0.6 🏆
VoigtModulo
528.5 ± 2.5
306.5 ± 1.0
175.8 ± 0.7
175.2 ± 0.5 🏆
175.6 ± 0.7
175.4 ± 0.6
352.0 ± 1.0
588.9 ± 6.4
408.7 ± 1.5
164.8 ± 1.0
164.0 ± 0.4
164.1 ± 0.4
175.2 ± 0.5
175.8 ± 0.9
VoigtNoModulo
375.3 ± 1.5
175.7 ± 1.3 🏆
192.5 ± 1.4
197.6 ± 1.9
200.6 ± 7.2
176.1 ± 1.5
197.9 ± 0.5
541.0 ± 1.8
263.1 ± 0.9
186.4 ± 0.6
186.4 ± 1.2
187.2 ± 1.1
197.2 ± 0.8
197.1 ± 0.7
TypeCast
348.5 ± 2.7
231.9 ± 3.9
234.4 ± 1.3
226.6 ± 1.0
137.5 ± 0.8 🏆
138.7 ± 1.7 🏆
243.8 ± 1.4
591.2 ± 2.4
591.3 ± 2.6
155.8 ± 1.9
155.9 ± 1.6
155.9 ± 2.4
214.6 ± 1.9
213.6 ± 1.1
unsigned long
Jwodder
658.6 ± 2.5
546.3 ± 0.9
549.3 ± 1.8
549.1 ± 2.8
540.6 ± 3.4
548.8 ± 1.3
486.1 ± 1.1
638.1 ± 1.8
613.3 ± 2.1
190.0 ± 0.8 🏆
182.7 ± 0.5
182.4 ± 0.5
496.2 ± 1.3
554.1 ± 1.0
VoigtModulo
1,169.0 ± 2.9
1,015.9 ± 4.4
550.8 ± 2.0
504.0 ± 1.4
550.3 ± 1.2
550.5 ± 1.3
1,020.1 ± 2.9
1,259.0 ± 9.0
1,136.5 ± 4.2
187.0 ± 3.4 🏆
199.7 ± 6.1
197.6 ± 1.0
549.4 ± 1.7
506.8 ± 4.4
VoigtNoModulo
768.1 ± 1.7
559.1 ± 1.8
534.4 ± 1.6
533.7 ± 1.5
559.5 ± 1.7
534.3 ± 1.5
571.5 ± 1.3
879.5 ± 10.8
617.8 ± 2.1
223.4 ± 1.3
231.3 ± 1.3
231.4 ± 1.1
594.6 ± 1.9
572.2 ± 0.8
TypeCast
353.3 ± 2.5 🏆
267.5 ± 1.7 🏆
248.0 ± 1.6 🏆
243.8 ± 1.2 🏆
154.2 ± 0.8 🏆
154.1 ± 1.0 🏆
263.8 ± 1.8 🏆
365.5 ± 1.6 🏆
316.9 ± 1.8 🏆
189.7 ± 2.1 🏆
156.3 ± 1.8 🏆
157.0 ± 2.2 🏆
155.1 ± 0.9 🏆
176.5 ± 1.2 🏆
int
Jwodder
307.9 ± 1.3 🏆
175.4 ± 0.9 🏆
175.3 ± 0.5 🏆
175.4 ± 0.6 🏆
175.2 ± 0.5
175.1 ± 0.6
175.1 ± 0.5 🏆
307.4 ± 1.2 🏆
219.6 ± 0.6 🏆
146.0 ± 0.3 🏆
153.5 ± 0.5
153.6 ± 0.8
175.4 ± 0.7 🏆
175.2 ± 0.5 🏆
VoigtModulo
528.5 ± 1.9
351.9 ± 4.6
175.3 ± 0.6 🏆
175.2 ± 0.4 🏆
197.1 ± 0.6
175.2 ± 0.8
373.5 ± 1.1
598.7 ± 5.1
460.6 ± 1.3
175.4 ± 0.4
164.3 ± 0.9
164.0 ± 0.4
176.3 ± 1.6 🏆
460.0 ± 0.8
VoigtNoModulo
398.0 ± 2.5
241.0 ± 0.7
199.4 ± 5.1
219.2 ± 1.0
175.9 ± 1.2
197.7 ± 1.2
242.9 ± 3.0
543.5 ± 2.3
350.6 ± 1.0
186.6 ± 1.2
185.7 ± 0.3
186.3 ± 1.1
197.1 ± 0.6
373.3 ± 1.6
TypeCast
338.8 ± 4.9
228.1 ± 0.9
230.3 ± 0.8
229.5 ± 9.4
153.8 ± 0.4 🏆
138.3 ± 1.0 🏆
241.1 ± 1.1
590.0 ± 2.1
589.9 ± 0.8
155.2 ± 2.4
149.4 ± 1.6 🏆
148.4 ± 1.2 🏆
214.6 ± 2.2
211.7 ± 1.6
long
Jwodder
758.1 ± 1.8
600.6 ± 0.9
601.5 ± 2.2
601.5 ± 2.8
581.2 ± 1.9
600.6 ± 1.8
586.3 ± 3.6
745.9 ± 3.6
685.8 ± 2.2
183.1 ± 1.0
182.5 ± 0.5
182.6 ± 0.6
553.2 ± 1.5
488.0 ± 0.8
VoigtModulo
1,360.8 ± 6.1
1,202.0 ± 2.1
600.0 ± 2.4
600.0 ± 3.0
607.0 ± 6.8
599.0 ± 1.6
1,187.2 ± 2.6
1,439.6 ± 6.7
1,346.5 ± 2.9
197.9 ± 0.7
208.2 ± 0.6
208.0 ± 0.4
548.9 ± 1.4
1,326.4 ± 3.0
VoigtNoModulo
844.5 ± 6.9
647.3 ± 1.3
628.9 ± 1.8
627.9 ± 1.6
629.1 ± 2.4
629.6 ± 4.4
668.2 ± 2.7
1,019.5 ± 3.2
715.1 ± 8.2
224.3 ± 4.8
219.0 ± 1.0
219.0 ± 0.6
561.7 ± 2.5
769.4 ± 9.3
TypeCast
366.1 ± 0.8 🏆
246.2 ± 1.1 🏆
245.3 ± 1.8 🏆
244.6 ± 1.1 🏆
154.6 ± 1.6 🏆
154.3 ± 0.5 🏆
257.4 ± 1.5 🏆
591.8 ± 4.1 🏆
590.4 ± 1.3 🏆
154.5 ± 1.3 🏆
135.4 ± 8.3 🏆
132.9 ± 0.7 🏆
132.8 ± 1.2 🏆
177.4 ± 2.3 🏆
Now I can finally get on with my life :P
Integer division with round-up.
Only 1 division executed per call, no % or * or conversion to/from floating point, works for positive and negative int. See note (1).
n (numerator) = OPs myIntNumber;
d (denominator) = OPs myOtherInt;
The following approach is simple. int division rounds toward 0. For negative quotients, this is a round up so nothing special is needed. For positive quotients, add d-1 to effect a round up, then perform an unsigned division.
Note (1) The usual divide by 0 blows things up and MININT/-1 fails as expected on 2's compliment machines.
int IntDivRoundUp(int n, int d) {
// If n and d are the same sign ...
if ((n < 0) == (d < 0)) {
// If n (and d) are negative ...
if (n < 0) {
n = -n;
d = -d;
}
// Unsigned division rounds down. Adding d-1 to n effects a round up.
return (((unsigned) n) + ((unsigned) d) - 1)/((unsigned) d);
}
else {
return n/d;
}
}
[Edit: test code removed, see earlier rev as needed]
Just use
int ceil_of_division = ((dividend-1)/divisor)+1;
For example:
for (int i=0;i<20;i++)
std::cout << i << "/8 = " << ((i-1)/8)+1 << std::endl;
A small hack is to do:
int divideUp(int a, int b) {
result = (a-1)/b + 1;
}
// Proof:
a = b*N + k (always)
if k == 0, then
(a-1) == b*N - 1
(a-1)/b == N - 1
(a-1)/b + 1 == N ---> Good !
if k > 0, then
(a-1) == b*N + l
(a-1)/b == N
(a-1)/b + 1 == N+1 ---> Good !
Instead of using the ceil function before casting to int, you can add a constant which is very nearly (but not quite) equal to 1 - this way, nearly anything (except a value which is exactly or incredibly close to an actual integer) will be increased by one before it is truncated.
Example:
#define EPSILON (0.9999)
int myValue = (int)(((float)myIntNumber)/myOtherInt + EPSILON);
EDIT: after seeing your response to the other post, I want to clarify that this will round up, not away from zero - negative numbers will become less negative, and positive numbers will become more positive.

Fast fixed point pow, log, exp and sqrt

I've got a fixed point class (10.22) and I have a need of a pow, a sqrt, an exp and a log function.
Alas I have no idea where to even start on this. Can anyone provide me with some links to useful articles or, better yet, provide me with some code?
I'm assuming that once I have an exp function then it becomes relatively easy to implement pow and sqrt as they just become.
pow( x, y ) => exp( y * log( x ) )
sqrt( x ) => pow( x, 0.5 )
Its just those exp and log functions that I'm finding difficult (as though I remember a few of my log rules, I can't remember much else about them).
Presumably, there would also be a faster method for sqrt and pow so any pointers on that front would be appreciated even if its just to say use the methods i outline above.
Please note: This HAS to be cross platform and in pure C/C++ code so I cannot use any assembler optimisations.
A very simple solution is to use a decent table-driven approximation. You don't actually need a lot of data if you reduce your inputs correctly. exp(a)==exp(a/2)*exp(a/2), which means you really only need to calculate exp(x) for 1 < x < 2. Over that range, a runga-kutta approximation would give reasonable results with ~16 entries IIRC.
Similarly, sqrt(a) == 2 * sqrt(a/4) == sqrt(4*a) / 2 which means you need only table entries for 1 < a < 4. Log(a) is a bit harder: log(a) == 1 + log(a/e). This is a rather slow iteration, but log(1024) is only 6.9 so you won't have many iterations.
You'd use a similar "integer-first" algorithm for pow: pow(x,y)==pow(x, floor(y)) * pow(x, frac(y)). This works because pow(double, int) is trivial (divide and conquer).
[edit] For the integral component of log(a), it may be useful to store a table 1, e, e^2, e^3, e^4, e^5, e^6, e^7 so you can reduce log(a) == n + log(a/e^n) by a simple hardcoded binary search of a in that table. The improvement from 7 to 3 steps isn't so big, but it means you only have to divide once by e^n instead of n times by e.
[edit 2]
And for that last log(a/e^n) term, you can use log(a/e^n) = log((a/e^n)^8)/8 - each iteration produces 3 more bits by table lookup. That keeps your code and table size small. This is typically code for embedded systems, and they don't have large caches.
[edit 3]
That's still not to smart on my side. log(a) = log(2) + log(a/2). You can just store the fixed-point value log2=0.6931471805599, count the number of leading zeroes, shift a into the range used for your lookup table, and multiply that shift (integer) by the fixed-point constant log2. Can be as low as 3 instructions.
Using e for the reduction step just gives you a "nice" log(e)=1.0 constant but that's false optimization. 0.6931471805599 is just as good a constant as 1.0; both are 32 bits constants in 10.22 fixed point. Using 2 as the constant for range reduction allows you to use a bit shift for a division.
[edit 5]
And since you're storing it in Q10.22, you can better store log(65536)=11.09035488. (16 x log(2)). The "x16" means that we've got 4 more bits of precision available.
You still get the trick from edit 2, log(a/2^n) = log((a/2^n)^8)/8. Basically, this gets you a result (a + b/8 + c/64 + d/512) * 0.6931471805599 - with b,c,d in the range [0,7]. a.bcd really is an octal number. Not a surprise since we used 8 as the power. (The trick works equally well with power 2, 4 or 16.)
[edit 4]
Still had an open end. pow(x, frac(y) is just pow(sqrt(x), 2 * frac(y)) and we have a decent 1/sqrt(x). That gives us the far more efficient approach. Say frac(y)=0.101 binary, i.e. 1/2 plus 1/8. Then that means x^0.101 is (x^1/2 * x^1/8). But x^1/2 is just sqrt(x) and x^1/8 is (sqrt(sqrt(sqrt(x))). Saving one more operation, Newton-Raphson NR(x) gives us 1/sqrt(x) so we calculate 1.0/(NR(x)*NR((NR(NR(x))). We only invert the end result, don't use the sqrt function directly.
Below is an example C implementation of Clay S. Turner's fixed-point log base 2 algorithm[1]. The algorithm doesn't require any kind of look-up table. This can be useful on systems where memory constraints are tight and the processor lacks an FPU, such as is the case with many microcontrollers. Log base e and log base 10 are then also supported by using the property of logarithms that, for any base n:
logₘ(x)
logₙ(x) = ───────
logₘ(n)
where, for this algorithm, m equals 2.
A nice feature of this implementation is that it supports variable precision: the precision can be determined at runtime, at the expense of range. The way I've implemented it, the processor (or compiler) must be capable of doing 64-bit math for holding some intermediate results. It can be easily adapted to not require 64-bit support, but the range will be reduced.
When using these functions, x is expected to be a fixed-point value scaled according to the
specified precision. For instance, if precision is 16, then x should be scaled by 2^16 (65536). The result is a fixed-point value with the same scale factor as the input. A return value of INT32_MIN represents negative infinity. A return value of INT32_MAX indicates an error and errno will be set to EINVAL, indicating that the input precision was invalid.
#include <errno.h>
#include <stddef.h>
#include "log2fix.h"
#define INV_LOG2_E_Q1DOT31 UINT64_C(0x58b90bfc) // Inverse log base 2 of e
#define INV_LOG2_10_Q1DOT31 UINT64_C(0x268826a1) // Inverse log base 2 of 10
int32_t log2fix (uint32_t x, size_t precision)
{
int32_t b = 1U << (precision - 1);
int32_t y = 0;
if (precision < 1 || precision > 31) {
errno = EINVAL;
return INT32_MAX; // indicates an error
}
if (x == 0) {
return INT32_MIN; // represents negative infinity
}
while (x < 1U << precision) {
x <<= 1;
y -= 1U << precision;
}
while (x >= 2U << precision) {
x >>= 1;
y += 1U << precision;
}
uint64_t z = x;
for (size_t i = 0; i < precision; i++) {
z = z * z >> precision;
if (z >= 2U << (uint64_t)precision) {
z >>= 1;
y += b;
}
b >>= 1;
}
return y;
}
int32_t logfix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_E_Q1DOT31;
return t >> 31;
}
int32_t log10fix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_10_Q1DOT31;
return t >> 31;
}
The code for this implementation also lives at Github, along with a sample/test program that illustrates how to use this function to compute and display logarithms from numbers read from standard input.
[1] C. S. Turner, "A Fast Binary Logarithm Algorithm", IEEE Signal Processing Mag., pp. 124,140, Sep. 2010.
A good starting point is Jack Crenshaw's book, "Math Toolkit for Real-Time Programming". It has a good discussion of algorithms and implementations for various transcendental functions.
Check my fixed point sqrt implementation using only integer operations.
It was fun to invent. Quite old now.
https://groups.google.com/forum/?hl=fr%05aacf5997b615c37&fromgroups#!topic/comp.lang.c/IpwKbw0MAxw/discussion
Otherwise check the CORDIC set of algorithms. That's the way to implement all the functions you listed and the trigonometric functions.
EDIT : I published the reviewed source on GitHub here

How can I safely average two unsigned ints in C++?

Using integer math alone, I'd like to "safely" average two unsigned ints in C++.
What I mean by "safely" is avoiding overflows (and anything else that can be thought of).
For instance, averaging 200 and 5000 is easy:
unsigned int a = 200;
unsigned int b = 5000;
unsigned int average = (a + b) / 2; // Equals: 2600 as intended
But in the case of 4294967295 and 5000 then:
unsigned int a = 4294967295;
unsigned int b = 5000;
unsigned int average = (a + b) / 2; // Equals: 2499 instead of 2147486147
The best I've come up with is:
unsigned int a = 4294967295;
unsigned int b = 5000;
unsigned int average = (a / 2) + (b / 2); // Equals: 2147486147 as expected
Are there better ways?
Your last approach seems promising. You can improve on that by manually considering the lowest bits of a and b:
unsigned int average = (a / 2) + (b / 2) + (a & b & 1);
This gives the correct results in case both a and b are odd.
If you know ahead of time which one is higher, a very efficient way is possible. Otherwise you're better off using one of the other strategies, instead of conditionally swapping to use this.
unsigned int average = low + ((high - low) / 2);
Here's a related article: http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html
Your method is not correct if both numbers are odd eg 5 and 7, average is 6 but your method #3 returns 5.
Try this:
average = (a>>1) + (b>>1) + (a & b & 1)
with math operators only:
average = a/2 + b/2 + (a%2) * (b%2)
And the correct answer is...
(A&B)+((A^B)>>1)
If you don't mind a little x86 inline assembly (GNU C syntax), you can take advantage of supercat's suggestion to use rotate-with-carry after an add to put the high 32 bits of the full 33-bit result into a register.
Of course, you usually should mind using inline-asm, because it defeats some optimizations (https://gcc.gnu.org/wiki/DontUseInlineAsm). But here we go anyway:
// works for 64-bit long as well on x86-64, and doesn't depend on calling convention
unsigned average(unsigned x, unsigned y)
{
unsigned result;
asm("add %[x], %[res]\n\t"
"rcr %[res]"
: [res] "=r" (result) // output
: [y] "%0"(y), // input: in the same reg as results output. Commutative with next operand
[x] "rme"(x) // input: reg, mem, or immediate
: // no clobbers. ("cc" is implicit on x86)
);
return result;
}
The % modifier to tell the compiler the args are commutative doesn't actually help make better asm in the case I tried, calling the function with y being a constant or pointer-deref (memory operand). Probably using a matching constraint for an output operand defeats that, since you can't use it with read-write operands.
As you can see on the Godbolt compiler explorer, this compiles correctly, and so does a version where we change the operands to unsigned long, with the same inline asm. clang3.9 makes a mess of it, though, and decides to use the "m" option for the "rme" constraint, so it stores to memory and uses a memory operand.
RCR-by-one is not too slow, but it's still 3 uops on Skylake, with 2 cycle latency. It's great on AMD CPUs, where RCR has single-cycle latency. (Source: Agner Fog's instruction tables, see also the x86 tag wiki for x86 performance links). It's still better than #sellibitze's version, but worse than #Sheldon's order-dependent version. (See code on Godbolt)
But remember that inline-asm defeats optimizations like constant-propagation, so any pure-C++ version will be better in that case.
What you have is fine, with the minor detail that it will claim that the average of 3 and 3 is 2. I'm guessing that you don't want that; fortunately, there's an easy fix:
unsigned int average = a/2 + b/2 + (a & b & 1);
This just bumps the average back up in the case that both divisions were truncated.
In C++20, you can use std::midpoint:
template <class T>
constexpr T midpoint(T a, T b) noexcept;
The paper P0811R3 that introduced std::midpoint recommended this snippet (slightly adopted to work with C++11):
#include <type_traits>
template <typename Integer>
constexpr Integer midpoint(Integer a, Integer b) noexcept {
using U = std::make_unsigned<Integer>::type;
return a>b ? a-(U(a)-b)/2 : a+(U(b)-a)/2;
}
For completeness, here is the unmodified C++20 implementation from the paper:
constexpr Integer midpoint(Integer a, Integer b) noexcept {
using U = make_unsigned_t<Integer>;
return a>b ? a-(U(a)-b)/2 : a+(U(b)-a)/2;
}
If the code is for an embedded micro, and if speed is critical, assembly language may be helpful. On many microcontrollers, the result of the add would naturally go into the carry flag, and instructions exist to shift it back into a register. On an ARM, the average operation (source and dest. in registers) could be done in two instructions; any C-language equivalent would likely yield at least 5, and probably a fair bit more than that.
Incidentally, on machines with shorter word sizes, the differences can be even more substantial. On an 8-bit PIC-18 series, averaging two 32-bit numbers would take twelve instructions. Doing the shifts, add, and correction, would take 5 instructions for each shift, eight for the add, and eight for the correction, so 26 (not quite a 2.5x difference, but probably more significant in absolute terms).
int[] array = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
decimal avg = 0;
for (int i = 0; i < array.Length; i++){
avg = (array[i] - avg) / (i+1) + avg;
}
expects avg == 5.0 for this test
(((a&b << 1) + (a^b)) >> 1) is also a nice way.
Courtesy: http://www.ragestorm.net/blogs/?p=29

How can I write a power function myself?

I was always wondering how I can make a function which calculates the power (e.g. 23) myself. In most languages these are included in the standard library, mostly as pow(double x, double y), but how can I write it myself?
I was thinking about for loops, but it think my brain got in a loop (when I wanted to do a power with a non-integer exponent, like 54.5 or negatives 2-21) and I went crazy ;)
So, how can I write a function which calculates the power of a real number? Thanks
Oh, maybe important to note: I cannot use functions which use powers (e.g. exp), which would make this ultimately useless.
Negative powers are not a problem, they're just the inverse (1/x) of the positive power.
Floating point powers are just a little bit more complicated; as you know a fractional power is equivalent to a root (e.g. x^(1/2) == sqrt(x)) and you also know that multiplying powers with the same base is equivalent to add their exponents.
With all the above, you can:
Decompose the exponent in a integer part and a rational part.
Calculate the integer power with a loop (you can optimise it decomposing in factors and reusing partial calculations).
Calculate the root with any algorithm you like (any iterative approximation like bisection or Newton method could work).
Multiply the result.
If the exponent was negative, apply the inverse.
Example:
2^(-3.5) = (2^3 * 2^(1/2)))^-1 = 1 / (2*2*2 * sqrt(2))
AB = Log-1(Log(A)*B)
Edit: yes, this definition really does provide something useful. For example, on an x86, it translates almost directly to FYL2X (Y * Log2(X)) and F2XM1 (2x-1):
fyl2x
fld st(0)
frndint
fsubr st(1),st
fxch st(1)
fchs
f2xmi
fld1
faddp st(1),st
fscale
fstp st(1)
The code ends up a little longer than you might expect, primarily because F2XM1 only works with numbers in the range -1.0..1.0. The fld st(0)/frndint/fsubr st(1),st piece subtracts off the integer part, so we're left with only the fraction. We apply F2XM1 to that, add the 1 back on, then use FSCALE to handle the integer part of the exponentiation.
Typically the implementation of the pow(double, double) function in math libraries is based on the identity:
pow(x,y) = pow(a, y * log_a(x))
Using this identity, you only need to know how to raise a single number a to an arbitrary exponent, and how to take a logarithm base a. You have effectively turned a complicated multi-variable function into a two functions of a single variable, and a multiplication, which is pretty easy to implement. The most commonly chosen values of a are e or 2 -- e because the e^x and log_e(1+x) have some very nice mathematical properties, and 2 because it has some nice properties for implementation in floating-point arithmetic.
The catch of doing it this way is that (if you want to get full accuracy) you need to compute the log_a(x) term (and its product with y) to higher accuracy than the floating-point representation of x and y. For example, if x and y are doubles, and you want to get a high accuracy result, you'll need to come up with some way to store intermediate results (and do arithmetic) in a higher-precision format. The Intel x87 format is a common choice, as are 64-bit integers (though if you really want a top-quality implementation, you'll need to do a couple of 96-bit integer computations, which are a little bit painful in some languages). It's much easier to deal with this if you implement powf(float,float), because then you can just use double for intermediate computations. I would recommend starting with that if you want to use this approach.
The algorithm that I outlined is not the only possible way to compute pow. It is merely the most suitable for delivering a high-speed result that satisfies a fixed a priori accuracy bound. It is less suitable in some other contexts, and is certainly much harder to implement than the repeated-square[root]-ing algorithm that some others have suggested.
If you want to try the repeated square[root] algorithm, begin by writing an unsigned integer power function that uses repeated squaring only. Once you have a good grasp on the algorithm for that reduced case, you will find it fairly straightforward to extend it to handle fractional exponents.
There are two distinct cases to deal with: Integer exponents and fractional exponents.
For integer exponents, you can use exponentiation by squaring.
def pow(base, exponent):
if exponent == 0:
return 1
elif exponent < 0:
return 1 / pow(base, -exponent)
elif exponent % 2 == 0:
half_pow = pow(base, exponent // 2)
return half_pow * half_pow
else:
return base * pow(base, exponent - 1)
The second "elif" is what distinguishes this from the naïve pow function. It allows the function to make O(log n) recursive calls instead of O(n).
For fractional exponents, you can use the identity a^b = C^(b*log_C(a)). It's convenient to take C=2, so a^b = 2^(b * log2(a)). This reduces the problem to writing functions for 2^x and log2(x).
The reason it's convenient to take C=2 is that floating-point numbers are stored in base-2 floating point. log2(a * 2^b) = log2(a) + b. This makes it easier to write your log2 function: You don't need to have it be accurate for every positive number, just on the interval [1, 2). Similarly, to calculate 2^x, you can multiply 2^(integer part of x) * 2^(fractional part of x). The first part is trivial to store in a floating point number, for the second part, you just need a 2^x function over the interval [0, 1).
The hard part is finding a good approximation of 2^x and log2(x). A simple approach is to use Taylor series.
Per definition:
a^b = exp(b ln(a))
where exp(x) = 1 + x + x^2/2 + x^3/3! + x^4/4! + x^5/5! + ...
where n! = 1 * 2 * ... * n.
In practice, you could store an array of the first 10 values of 1/n!, and then approximate
exp(x) = 1 + x + x^2/2 + x^3/3! + ... + x^10/10!
because 10! is a huge number, so 1/10! is very small (2.7557319224⋅10^-7).
Wolfram functions gives a wide variety of formulae for calculating powers. Some of them would be very straightforward to implement.
For positive integer powers, look at exponentiation by squaring and addition-chain exponentiation.
Using three self implemented functions iPow(x, n), Ln(x) and Exp(x), I'm able to compute fPow(x, a), x and a being doubles. Neither of the functions below use library functions, but just iteration.
Some explanation about functions implemented:
(1) iPow(x, n): x is double, n is int. This is a simple iteration, as n is an integer.
(2) Ln(x): This function uses the Taylor Series iteration. The series used in iteration is Σ (from int i = 0 to n) {(1 / (2 * i + 1)) * ((x - 1) / (x + 1)) ^ (2 * n + 1)}. The symbol ^ denotes the power function Pow(x, n) implemented in the 1st function, which uses simple iteration.
(3) Exp(x): This function, again, uses the Taylor Series iteration. The series used in iteration is Σ (from int i = 0 to n) {x^i / i!}. Here, the ^ denotes the power function, but it is not computed by calling the 1st Pow(x, n) function; instead it is implemented within the 3rd function, concurrently with the factorial, using d *= x / i. I felt I had to use this trick, because in this function, iteration takes some more steps relative to the other functions and the factorial (i!) overflows most of the time. In order to make sure the iteration does not overflow, the power function in this part is iterated concurrently with the factorial. This way, I overcame the overflow.
(4) fPow(x, a): x and a are both doubles. This function does nothing but just call the other three functions implemented above. The main idea in this function depends on some calculus: fPow(x, a) = Exp(a * Ln(x)). And now, I have all the functions iPow, Ln and Exp with iteration already.
n.b. I used a constant MAX_DELTA_DOUBLE in order to decide in which step to stop the iteration. I've set it to 1.0E-15, which seems reasonable for doubles. So, the iteration stops if (delta < MAX_DELTA_DOUBLE) If you need some more precision, you can use long double and decrease the constant value for MAX_DELTA_DOUBLE, to 1.0E-18 for example (1.0E-18 would be the minimum).
Here is the code, which works for me.
#define MAX_DELTA_DOUBLE 1.0E-15
#define EULERS_NUMBER 2.718281828459045
double MathAbs_Double (double x) {
return ((x >= 0) ? x : -x);
}
int MathAbs_Int (int x) {
return ((x >= 0) ? x : -x);
}
double MathPow_Double_Int(double x, int n) {
double ret;
if ((x == 1.0) || (n == 1)) {
ret = x;
} else if (n < 0) {
ret = 1.0 / MathPow_Double_Int(x, -n);
} else {
ret = 1.0;
while (n--) {
ret *= x;
}
}
return (ret);
}
double MathLn_Double(double x) {
double ret = 0.0, d;
if (x > 0) {
int n = 0;
do {
int a = 2 * n + 1;
d = (1.0 / a) * MathPow_Double_Int((x - 1) / (x + 1), a);
ret += d;
n++;
} while (MathAbs_Double(d) > MAX_DELTA_DOUBLE);
} else {
printf("\nerror: x < 0 in ln(x)\n");
exit(-1);
}
return (ret * 2);
}
double MathExp_Double(double x) {
double ret;
if (x == 1.0) {
ret = EULERS_NUMBER;
} else if (x < 0) {
ret = 1.0 / MathExp_Double(-x);
} else {
int n = 2;
double d;
ret = 1.0 + x;
do {
d = x;
for (int i = 2; i <= n; i++) {
d *= x / i;
}
ret += d;
n++;
} while (d > MAX_DELTA_DOUBLE);
}
return (ret);
}
double MathPow_Double_Double(double x, double a) {
double ret;
if ((x == 1.0) || (a == 1.0)) {
ret = x;
} else if (a < 0) {
ret = 1.0 / MathPow_Double_Double(x, -a);
} else {
ret = MathExp_Double(a * MathLn_Double(x));
}
return (ret);
}
It's an interesting exercise. Here's some suggestions, which you should try in this order:
Use a loop.
Use recursion (not better, but interesting none the less)
Optimize your recursion vastly by using divide-and-conquer
techniques
Use logarithms
You can found the pow function like this:
static double pows (double p_nombre, double p_puissance)
{
double nombre = p_nombre;
double i=0;
for(i=0; i < (p_puissance-1);i++){
nombre = nombre * p_nombre;
}
return (nombre);
}
You can found the floor function like this:
static double floors(double p_nomber)
{
double x = p_nomber;
long partent = (long) x;
if (x<0)
{
return (partent-1);
}
else
{
return (partent);
}
}
Best regards
A better algorithm to efficiently calculate positive integer powers is repeatedly square the base, while keeping track of the extra remainder multiplicands. Here is a sample solution in Python that should be relatively easy to understand and translate into your preferred language:
def power(base, exponent):
remaining_multiplicand = 1
result = base
while exponent > 1:
remainder = exponent % 2
if remainder > 0:
remaining_multiplicand = remaining_multiplicand * result
exponent = (exponent - remainder) / 2
result = result * result
return result * remaining_multiplicand
To make it handle negative exponents, all you have to do is calculate the positive version and divide 1 by the result, so that should be a simple modification to the above code. Fractional exponents are considerably more difficult, since it means essentially calculating an nth-root of the base, where n = 1/abs(exponent % 1) and multiplying the result by the result of the integer portion power calculation:
power(base, exponent - (exponent % 1))
You can calculate roots to a desired level of accuracy using Newton's method. Check out wikipedia article on the algorithm.
I am using fixed point long arithmetics and my pow is log2/exp2 based. Numbers consist of:
int sig = { -1; +1 } signum
DWORD a[A+B] number
A is number of DWORDs for integer part of number
B is number of DWORDs for fractional part
My simplified solution is this:
//---------------------------------------------------------------------------
longnum exp2 (const longnum &x)
{
int i,j;
longnum c,d;
c.one();
if (x.iszero()) return c;
i=x.bits()-1;
for(d=2,j=_longnum_bits_b;j<=i;j++,d*=d)
if (x.bitget(j))
c*=d;
for(i=0,j=_longnum_bits_b-1;i<_longnum_bits_b;j--,i++)
if (x.bitget(j))
c*=_longnum_log2[i];
if (x.sig<0) {d.one(); c=d/c;}
return c;
}
//---------------------------------------------------------------------------
longnum log2 (const longnum &x)
{
int i,j;
longnum c,d,dd,e,xx;
c.zero(); d.one(); e.zero(); xx=x;
if (xx.iszero()) return c; //**** error: log2(0) = infinity
if (xx.sig<0) return c; //**** error: log2(negative x) ... no result possible
if (d.geq(x,d)==0) {xx=d/xx; xx.sig=-1;}
i=xx.bits()-1;
e.bitset(i); i-=_longnum_bits_b;
for (;i>0;i--,e>>=1) // integer part
{
dd=d*e;
j=dd.geq(dd,xx);
if (j==1) continue; // dd> xx
c+=i; d=dd;
if (j==2) break; // dd==xx
}
for (i=0;i<_longnum_bits_b;i++) // fractional part
{
dd=d*_longnum_log2[i];
j=dd.geq(dd,xx);
if (j==1) continue; // dd> xx
c.bitset(_longnum_bits_b-i-1); d=dd;
if (j==2) break; // dd==xx
}
c.sig=xx.sig;
c.iszero();
return c;
}
//---------------------------------------------------------------------------
longnum pow (const longnum &x,const longnum &y)
{
//x^y = exp2(y*log2(x))
int ssig=+1; longnum c; c=x;
if (y.iszero()) {c.one(); return c;} // ?^0=1
if (c.iszero()) return c; // 0^?=0
if (c.sig<0)
{
c.overflow(); c.sig=+1;
if (y.isreal()) {c.zero(); return c;} //**** error: negative x ^ noninteger y
if (y.bitget(_longnum_bits_b)) ssig=-1;
}
c=exp2(log2(c)*y); c.sig=ssig; c.iszero();
return c;
}
//---------------------------------------------------------------------------
where:
_longnum_bits_a = A*32
_longnum_bits_b = B*32
_longnum_log2[i] = 2 ^ (1/(2^i)) ... precomputed sqrt table
_longnum_log2[0]=sqrt(2)
_longnum_log2[1]=sqrt[tab[0])
_longnum_log2[i]=sqrt(tab[i-1])
longnum::zero() sets *this=0
longnum::one() sets *this=+1
bool longnum::iszero() returns (*this==0)
bool longnum::isnonzero() returns (*this!=0)
bool longnum::isreal() returns (true if fractional part !=0)
bool longnum::isinteger() returns (true if fractional part ==0)
int longnum::bits() return num of used bits in number counted from LSB
longnum::bitget()/bitset()/bitres()/bitxor() are bit access
longnum.overflow() rounds number if there was a overflow X.FFFFFFFFFF...FFFFFFFFF??h -> (X+1).0000000000000...000000000h
int longnum::geq(x,y) is comparition |x|,|y| returns 0,1,2 for (<,>,==)
All you need to understand this code is that numbers in binary form consists of sum of powers of 2, when you need to compute 2^num then it can be rewritten as this
2^(b(-n)*2^(-n) + ... + b(+m)*2^(+m))
where n are fractional bits and m are integer bits. multiplication/division by 2 in binary form is simple bit shifting so if you put it all together you get code for exp2 similar to my. log2 is based on binaru search...changing the result bits from MSB to LSB until it matches searched value (very similar algorithm as for fast sqrt computation). hope this helps clarify things...
A lot of approaches are given in other answers. Here is something that I thought may be useful in case of integral powers.
In the case of integer power x of nx, the straightforward approach would take x-1 multiplications. In order to optimize this, we can use dynamic programming and reuse an earlier multiplication result to avoid all x multiplications. For example, in 59, we can, say, make batches of 3, i.e. calculate 53 once, get 125 and then cube 125 using the same logic, taking only 4 multiplcations in the process, instead of 8 multiplications with the straightforward way.
The question is what is the ideal size of the batch b so that the number of multiplications is minimum. So let's write the equation for this. If f(x,b) is the function representing the number of multiplications entailed in calculating nx using the above method, then
Explanation: A product of batch of p numbers will take p-1 multiplications. If we divide x multiplications into b batches, there would be (x/b)-1 multiplications required inside each batch, and b-1 multiplications required for all b batches.
Now we can calculate the first derivative of this function with respect to b and equate it to 0 to get the b for the least number of multiplications.
Now put back this value of b into the function f(x,b) to get the least number of multiplications:
For all positive x, this value is lesser than the multiplications by the straightforward way.
maybe you can use taylor series expansion. the Taylor series of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor series are equal near this point. Taylor's series are named after Brook Taylor who introduced them in 1715.