How to quickly divide a big uinteger into a word? - c++

I am currently developing a class to work with big unsigned integers. However, I need incomplete functionality, namely:
bi_uint+=bi_uint - Already implemented. No complaints.
bi_uint*=std::uint_fast64_t - Already implemented. No complaints.
bi_uint/=std::uint_fast64_t - Implemented but works very slowly, also requires a type that is twice as wide as uint_fast64_t. In the test case, division was 35 times slower than multiplication
Next, I will give my implementation of division, which is based on a simple long division algorithm:
#include <climits>
#include <cstdint>
#include <limits>
#include <vector>
class bi_uint
{
public:
using u64_t = std::uint_fast64_t;
constexpr static std::size_t u64_bits = CHAR_BIT * sizeof(u64_t);
using u128_t = unsigned __int128;
static_assert(sizeof(u128_t) >= sizeof(u64_t) * 2);
//little-endian
std::vector<u64_t> data;
//User should guarantee data.size()>0 and val>0
void self_div(const u64_t val)
{
auto it = data.rbegin();
if(data.size() == 1) {
*it /= val;
return;
}
u128_t rem = 0;
if(*it < val) {
rem = *it++;
data.pop_back();
}
u128_t r = rem % val;
while(it != data.rend()) {
rem = (r << u64_bits) + *it;
const u128_t q = rem / val;
r = rem % val;
*it++ = static_cast<u64_t>(q);
}
}
};
You can see that the unsigned __int128 type was used, thefore, this option is not portable and is tied to a single compiler - GCC and also require x64 platform.
After reading the page about division algorithms, I feel the appropriate algorithm would be "Newton-Raphson division". However, the "Newton–Raphson division" algorithm seems complicated to me. I guess there is a simpler algorithm for dividing the type "big_uint/uint" that would have almost the same performance.
Q: How to fast divide a bi_uint into a u64_t?
I have about 10^6 iterations, each iteration uses all the operations listed
If this is easily achievable, then I would like to have portability and not use unsigned __int128. Otherwise, I prefer to abandon portability in favor of an easier way.
EDIT1:
This is an academic project, I am not able to use third-party libraries.

Part 1 (See Part 2 below)
I managed to speedup your division code 5x times on my old laptop (and even 7.5x times on GodBolt servers) using Barrett Reduction, this is a technique that allows to replace single division by several multiplications and additions. Implemented whole code from sctracth just today.
If you want you can jump directly to code location at the end of my post, without reading long description, as code is fully runnable without any knowledge or dependency.
Code below is only for Intel x64, because I used Intel only instructions and only 64-bit variants of them. Sure it can be re-written for x32 too and for other processors, because Barrett algorithm is generic.
To explain whole Barrett Reduction in short pseudo-code I'll write it in Python as it is simplest language suitable for understandable pseudo-code:
# https://www.nayuki.io/page/barrett-reduction-algorithm
def BarrettR(n, s):
return (1 << s) // n
def BarrettDivMod(x, n, r, s):
q = (x * r) >> s
t = x - q * n
return (q, t) if t < n else (q + 1, t - n)
Basically in pseudo code above BarrettR() is done only single time for same divisor (you use same single-word divisor for whole big integer division). BarrettDivMod() is used each time when you want to make division or modulus operations, basically given input x and divisor n it returns tuple (x / n, x % n), nothing else, but does it faster than regular division instruction.
In below C++ code I implement same two functions of Barrett, but do some C++ specific optimizations to make it even more faster. Optimizations are possible due to fact that divisor n is always 64-bit, x is 128-bit but higher half is always smaller than n (last assumption happens because higher half in your big integer division is always a remainder modulus n).
Barrett algorithm works with divisor n that is NOT a power of 2, so divisors like 0, 1, 2, 4, 8, 16, ... are not allowed. This trivial case of divisor you can cover just by doing right bit-shift of big integer, because dividing by power of 2 is just a bit-shift. Any other divisor is allowed, including even divisors that are not power of 2.
Also it is important to note that my BarrettDivMod() accepts ONLY dividend x that is strictly smaller than divisor * 2^64, in other words higher half of 128-bit dividend x should be smaller than divisor. This is always true in your case of your big integer divison function, as higher half is always a remainder modulus divisor. This rule for x should be checked by you, it is checked in my BarrettDivMod() only as DEBUG assertion that is removed in release.
You can notice that BarrettDivMod() has two big branches, these are two variants of same algorithm, first uses CLang/GCC only type unsigned __int128, second uses only 64-bit instructions and hence suitable for MSVC.
I tried to target three compilers CLang/GCC/MSVC, but some how MSVC version got only 2x faster with Barrett, while CLang/GCC are 5x faster. Maybe I did some bug in MSVC code.
You can see that I used your class bi_uint for time measurement of two versions of code - with regular divide and with Barrett. Important to note that I changed your code quite significantly, first to not use u128 (so that MSVC version compiles that has no u128), second not to modify data vector, so it does read only division and doesn't store final result (this read-only is needed for me to run speed tests very fast without copying data vector on each test iteration). So your code is broken in my snippet, it can't-be copy pasted to be used straight away, I only used your code for speed measurement.
Barrett reduction works faster not only because division is slower than multiplication, but also because multiplication and addition are both very well pipelined on moder CPUs, modern CPU can execute several mul or add instructions within one cycle, but only if these several mul/add don't depend on each other's result, in other words CPU can run several instructions in parallel within one cycle. As far as I know division can't be run in parallel, because there is only single module within CPU to make division, but still it is a bit pipelined, because after 50% of first division is done second division can be started in parallel at beginning of CPU pipeline.
On some computers you may notice that regular Divide version is much slower sometimes, this happens because CLang/GCC do fallback to library-based Soft implementation of Division even for 128 bit dividend. In this case my Barrett may show even 7-10x times speedup, as it doesn't use library functions.
To overcome issue described above, about Soft division, it is better to add Assembly code with usage of DIV instruction directly, or to find some Intrinsic function that implements this inside your compiler (I think CLang/GCC have such intrinsic). Also I can write this Assembly implementation if needed, just tell me in comments.
Update. As promised, implemented Assembly variant of 128 bit division for CLang/GCC, function UDiv128Asm(). After this change it is used as a main implementation for CLang/GCC 128 division instead of regular u128(a) / b. You may come back to regular u128 impementation by replacing #if 0 with #if 1 inside body of UDiv128() function.
Try it online!
#include <cstdint>
#include <bit>
#include <stdexcept>
#include <string>
#include <immintrin.h>
#if defined(_MSC_VER) && !defined(__clang__)
#define IS_MSVC 1
#else
#define IS_MSVC 0
#endif
#if IS_MSVC
#include <intrin.h>
#endif
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#ifdef _DEBUG
#define DASSERT_MSG(cond, msg) ASSERT_MSG(cond, msg)
#else
#define DASSERT_MSG(cond, msg)
#endif
#define DASSERT(cond) DASSERT_MSG(cond, "")
using u16 = uint16_t;
using u32 = uint32_t;
using i64 = int64_t;
using u64 = uint64_t;
using UllPtr = unsigned long long *;
inline int GetExp(double x) {
return int((std::bit_cast<uint64_t>(x) >> 52) & 0x7FF) - 1023;
}
inline size_t BitSizeWrong(uint64_t x) {
return x == 0 ? 0 : (GetExp(x) + 1);
}
inline size_t BitSize(u64 x) {
size_t r = 0;
if (x >= (u64(1) << 32)) {
x >>= 32;
r += 32;
}
while (x >= 0x100) {
x >>= 8;
r += 8;
}
while (x) {
x >>= 1;
++r;
}
return r;
}
#if !IS_MSVC
inline u64 UDiv128Asm(u64 h, u64 l, u64 d, u64 * r) {
u64 q;
asm (R"(
.intel_syntax
mov rdx, %V[h]
mov rax, %V[l]
div %V[d]
mov %V[r], rdx
mov %V[q], rax
)"
: [q] "=r" (q), [r] "=r" (*r)
: [h] "r" (h), [l] "r" (l), [d] "r" (d)
: "rax", "rdx"
);
return q;
}
#endif
inline std::pair<u64, u64> UDiv128(u64 hi, u64 lo, u64 d) {
#if IS_MSVC
u64 r, q = _udiv128(hi, lo, d, &r);
return std::make_pair(q, r);
#else
#if 0
using u128 = unsigned __int128;
auto const dnd = (u128(hi) << 64) | lo;
return std::make_pair(u64(dnd / d), u64(dnd % d));
#else
u64 r, q = UDiv128Asm(hi, lo, d, &r);
return std::make_pair(q, r);
#endif
#endif
}
inline std::pair<u64, u64> UMul128(u64 a, u64 b) {
#if IS_MSVC
u64 hi, lo = _umul128(a, b, &hi);
return std::make_pair(hi, lo);
#else
using u128 = unsigned __int128;
auto const x = u128(a) * b;
return std::make_pair(u64(x >> 64), u64(x));
#endif
}
inline std::pair<u64, u64> USub128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
u64 r_hi, r_lo;
_subborrow_u64(_subborrow_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
return std::make_pair(r_hi, r_lo);
}
inline std::pair<u64, u64> UAdd128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
u64 r_hi, r_lo;
_addcarry_u64(_addcarry_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
return std::make_pair(r_hi, r_lo);
}
inline int UCmp128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
if (a_hi != b_hi)
return a_hi < b_hi ? -1 : 1;
return a_lo == b_lo ? 0 : a_lo < b_lo ? -1 : 1;
}
std::pair<u64, size_t> BarrettRS64(u64 n) {
// https://www.nayuki.io/page/barrett-reduction-algorithm
ASSERT_MSG(n >= 3 && (n & (n - 1)) != 0, "n " + std::to_string(n))
size_t const nbits = BitSize(n);
// 2^s = q * n + r; 2^s = (2^64 + q0) * n + r; 2^s - n * 2^64 = q0 * n + r
u64 const dnd_hi = (nbits >= 64 ? 0ULL : (u64(1) << nbits)) - n;
auto const q0 = UDiv128(dnd_hi, 0, n).first;
return std::make_pair(q0, nbits);
}
template <bool Use128 = true, bool Adjust = true>
std::pair<u64, u64> BarrettDivMod64(u64 x_hi, u64 x_lo, u64 n, u64 r, size_t s) {
// ((x_hi * 2^64 + x_lo) * (2^64 + r)) >> (64 + s)
DASSERT(x_hi < n);
#if !IS_MSVC
if constexpr(Use128) {
using u128 = unsigned __int128;
u128 const xf = (u128(x_hi) << 64) | x_lo;
u64 q = u64((u128(x_hi) * r + xf + u64((u128(x_lo) * r) >> 64)) >> s);
if (s < 64) {
u64 t = x_lo - q * n;
if constexpr(Adjust) {
u64 const mask = ~u64(i64(t - n) >> 63);
q += mask & 1;
t -= mask & n;
}
return std::make_pair(q, t);
} else {
u128 t = xf - u128(q) * n;
return t < n ? std::make_pair(q, u64(t)) : std::make_pair(q + 1, u64(t) - n);
}
} else
#endif
{
auto const w1a = UMul128(x_lo, r).first;
auto const [w2b, w1b] = UMul128(x_hi, r);
auto const w2c = x_hi, w1c = x_lo;
u64 w1, w2 = _addcarry_u64(0, w1a, w1b, (UllPtr)&w1);
w2 += _addcarry_u64(0, w1, w1c, (UllPtr)&w1);
w2 += w2b + w2c;
if (s < 64) {
u64 q = (w2 << (64 - s)) | (w1 >> s);
u64 t = x_lo - q * n;
if constexpr(Adjust) {
u64 const mask = ~u64(i64(t - n) >> 63);
q += mask & 1;
t -= mask & n;
}
return std::make_pair(q, t);
} else {
u64 const q = w2;
auto const [b_hi, b_lo] = UMul128(q, n);
auto const [t_hi, t_lo] = USub128(x_hi, x_lo, b_hi, b_lo);
return t_hi != 0 || t_lo >= n ? std::make_pair(q + 1, t_lo - n) : std::make_pair(q, t_lo);
}
}
}
#include <random>
#include <iomanip>
#include <iostream>
#include <chrono>
void TestBarrett() {
std::mt19937_64 rng{123}; //{std::random_device{}()};
for (size_t i = 0; i < (1 << 11); ++i) {
size_t const nbits = rng() % 63 + 2;
u64 n = 0;
do {
n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
} while (!(n >= 3 && (n & (n - 1)) != 0));
auto const [br, bs] = BarrettRS64(n);
for (size_t j = 0; j < (1 << 6); ++j) {
u64 const hi = rng() % n, lo = rng();
auto const [ref_q, ref_r] = UDiv128(hi, lo, n);
u64 bar_q = 0, bar_r = 0;
for (size_t k = 0; k < 2; ++k) {
bar_q = 0; bar_r = 0;
if (k == 0)
std::tie(bar_q, bar_r) = BarrettDivMod64<true>(hi, lo, n, br, bs);
else
std::tie(bar_q, bar_r) = BarrettDivMod64<false>(hi, lo, n, br, bs);
ASSERT_MSG(bar_q == ref_q && bar_r == ref_r, "i " + std::to_string(i) + ", j " + std::to_string(j) + ", k " + std::to_string(k) +
", nbits " + std::to_string(nbits) + ", n " + std::to_string(n) + ", bar_q " + std::to_string(bar_q) +
", ref_q " + std::to_string(ref_q) + ", bar_r " + std::to_string(bar_r) + ", ref_r " + std::to_string(ref_r));
}
}
}
}
class bi_uint
{
public:
using u64_t = std::uint64_t;
constexpr static std::size_t u64_bits = 8 * sizeof(u64_t);
//little-endian
std::vector<u64_t> data;
static auto constexpr DefPrep = [](auto n){
return std::make_pair(false, false);
};
static auto constexpr DefDivMod = [](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
return UDiv128(dnd_hi, dnd_lo, dsr);
};
//User should guarantee data.size()>0 and val>0
template <typename PrepT = decltype(DefPrep), typename DivModT = decltype(DefDivMod)>
void self_div(const u64_t val, PrepT const & Prep = DefPrep, DivModT const & DivMod = DefDivMod)
{
auto it = data.rbegin();
if(data.size() == 1) {
*it /= val;
return;
}
u64_t rem_hi = 0, rem_lo = 0;
if(*it < val) {
rem_lo = *it++;
//data.pop_back();
}
auto const prep = Prep(val);
u64_t r = rem_lo % val;
u64_t q = 0;
while(it != data.rend()) {
rem_hi = r;
rem_lo = *it;
std::tie(q, r) = DivMod(rem_hi, rem_lo, val, prep);
//*it++ = static_cast<u64_t>(q);
it++;
auto volatile out = static_cast<u64_t>(q);
}
}
};
void TestSpeed() {
auto Time = []{
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
};
std::mt19937_64 rng{123};
std::vector<u64> limbs, divisors;
for (size_t i = 0; i < (1 << 17); ++i)
limbs.push_back(rng());
for (size_t i = 0; i < (1 << 8); ++i) {
size_t const nbits = rng() % 63 + 2;
u64 n = 0;
do {
n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
} while (!(n >= 3 && (n & (n - 1)) != 0));
divisors.push_back(n);
}
std::cout << std::fixed << std::setprecision(3);
double div_time = 0;
{
bi_uint x;
x.data = limbs;
auto const tim = Time();
for (auto dsr: divisors)
x.self_div(dsr);
div_time = Time() - tim;
std::cout << "Divide time " << div_time << " sec" << std::endl;
}
{
bi_uint x;
x.data = limbs;
for (size_t i = 0; i < 2; ++i) {
if (IS_MSVC && i == 0)
continue;
auto const tim = Time();
for (auto dsr: divisors)
x.self_div(dsr, [](auto n){ return BarrettRS64(n); },
[i](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
return i == 0 ? BarrettDivMod64<true>(dnd_hi, dnd_lo, dsr, prep.first, prep.second) :
BarrettDivMod64<false>(dnd_hi, dnd_lo, dsr, prep.first, prep.second);
});
double const bar_time = Time() - tim;
std::cout << "Barrett" << (i == 0 ? "128" : "64 ") << " time " << bar_time << " sec, boost " << div_time / bar_time << std::endl;
}
}
}
int main() {
try {
TestBarrett();
TestSpeed();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
Divide time 3.171 sec
Barrett128 time 0.675 sec, boost 4.695
Barrett64 time 0.642 sec, boost 4.937
Part 2
As you have a very interesting question, after few days when I first published this post, I decided to implement from scratch all big integer math.
Below code implements math operations +, -, *, /, <<, >> for natural big numbers (positive integers), and +, -, *, / for floating big numbers. Both types of numbers are of arbitrary size (even millions of bits). Besides those as you requested, I fully implemented Newton-Raphson (both square and cubic variants) and Goldschmidt fast division algorithms.
Here is code snippet only for Newton-Raphson/Golschmidt functions, remaining code as it is very large is linked below on external server:
BigFloat & DivNewtonRaphsonSquare(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
thread_local BigFloat two, c_48_17, c_32_17;
thread_local size_t static_prec = 0;
if (static_prec != BigFloat::prec_) {
two = 2;
c_48_17 = BigFloat(48) / BigFloat(17);
c_32_17 = BigFloat(32) / BigFloat(17);
static_prec = BigFloat::prec_;
}
BigFloat x = c_48_17 - c_32_17 * b;
for (size_t i = 0, num_iters = std::ceil(std::log2(double(static_prec + 1)
/ std::log2(17.0))) + 0.1; i < num_iters; ++i)
x = x * (two - b * x);
*this = a * x;
return BitNorm();
}
BigFloat & DivNewtonRaphsonCubic(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
thread_local BigFloat one, c_140_33, c_m64_11, c_256_99;
thread_local size_t static_prec = 0;
if (static_prec != BigFloat::prec_) {
one = 1;
c_140_33 = BigFloat(140) / BigFloat(33);
c_m64_11 = BigFloat(-64) / BigFloat(11);
c_256_99 = BigFloat(256) / BigFloat(99);
static_prec = BigFloat::prec_;
}
BigFloat e, y, x = c_140_33 + b * (c_m64_11 + b * c_256_99);
for (size_t i = 0, num_iters = std::ceil(std::log2(double(static_prec + 1)
/ std::log2(99.0)) / std::log2(3.0)) + 0.1; i < num_iters; ++i) {
e = one - b * x;
y = x * e;
x = x + y + y * e;
}
*this = a * x;
return BitNorm();
}
BigFloat & DivGoldschmidt(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Goldschmidt_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
BigFloat one = 1, two = 2, f;
for (size_t i = 0;; ++i) {
f = two - b;
a *= f;
b *= f;
if (i % 3 == 0 && (one - b).GetScale() < -i64(prec_) + i64(bit_sizeof(Word)))
break;
}
*this = a;
return BitNorm();
}
See Output: below, it will show that Newton-Raphson and Goldschmidt methods are actually 10x times slower than regular School-grade (called Reference in output) algorithm. Between each other these 3 advanced algorithms are about same speed. Probably Raphson/Goldschmidt could be faster if to use Fast Fourier Transform for multiplication, because multiplication of two large numbers takes 95% of time of these algorithms. In code below all results of Raphson/Goldschmidt algorithms are not only time-measured but also checked for correctness of results compared to School-grade (Reference) algorithm (see diff 2^... in console output, this shows how large is difference of result compared to school grade).
FULL SOURCE CODE HERE. Full code is so huge that it didn't fit into this StackOverflow due to SO limit of 30 000 characters per post, although I wrote this code from scracth specifically for this post. That's why providing external download link (PasteBin server), also click Try it online! linke below, it is same copy of code that is run live on GodBolt's linux servers:
Try it online!
Output:
========== 1 K bits ==========
Reference 0.000029 sec
Raphson2 0.000066 sec, boost 0.440x, diff 2^-8192
Raphson3 0.000092 sec, boost 0.317x, diff 2^-8192
Goldschmidt 0.000080 sec, boost 0.365x, diff 2^-1022
========== 2 K bits ==========
Reference 0.000071 sec
Raphson2 0.000177 sec, boost 0.400x, diff 2^-16384
Raphson3 0.000283 sec, boost 0.250x, diff 2^-16384
Goldschmidt 0.000388 sec, boost 0.182x, diff 2^-2046
========== 4 K bits ==========
Reference 0.000319 sec
Raphson2 0.000875 sec, boost 0.365x, diff 2^-4094
Raphson3 0.001122 sec, boost 0.285x, diff 2^-32768
Goldschmidt 0.000881 sec, boost 0.362x, diff 2^-32768
========== 8 K bits ==========
Reference 0.000484 sec
Raphson2 0.002281 sec, boost 0.212x, diff 2^-65536
Raphson3 0.002341 sec, boost 0.207x, diff 2^-65536
Goldschmidt 0.002432 sec, boost 0.199x, diff 2^-8189
========== 16 K bits ==========
Reference 0.001199 sec
Raphson2 0.009042 sec, boost 0.133x, diff 2^-16382
Raphson3 0.009519 sec, boost 0.126x, diff 2^-131072
Goldschmidt 0.009047 sec, boost 0.133x, diff 2^-16380
========== 32 K bits ==========
Reference 0.004311 sec
Raphson2 0.039151 sec, boost 0.110x, diff 2^-32766
Raphson3 0.041058 sec, boost 0.105x, diff 2^-262144
Goldschmidt 0.045517 sec, boost 0.095x, diff 2^-32764
========== 64 K bits ==========
Reference 0.016273 sec
Raphson2 0.165656 sec, boost 0.098x, diff 2^-524288
Raphson3 0.210301 sec, boost 0.077x, diff 2^-65535
Goldschmidt 0.208081 sec, boost 0.078x, diff 2^-65534
========== 128 K bits ==========
Reference 0.059469 sec
Raphson2 0.725865 sec, boost 0.082x, diff 2^-1048576
Raphson3 0.735530 sec, boost 0.081x, diff 2^-1048576
Goldschmidt 0.703991 sec, boost 0.084x, diff 2^-131069
========== 256 K bits ==========
Reference 0.326368 sec
Raphson2 3.007454 sec, boost 0.109x, diff 2^-2097152
Raphson3 2.977631 sec, boost 0.110x, diff 2^-2097152
Goldschmidt 3.363632 sec, boost 0.097x, diff 2^-262141
========== 512 K bits ==========
Reference 1.138663 sec
Raphson2 12.827783 sec, boost 0.089x, diff 2^-524287
Raphson3 13.799401 sec, boost 0.083x, diff 2^-524287
Goldschmidt 15.836072 sec, boost 0.072x, diff 2^-524286

On most of the modern CPUs, division is indeed much slower than multiplication.
Referring to
https://agner.org/optimize/instruction_tables.pdf
That on Intel Skylake an MUL/IMUL has a latency of 3-4 cycles; while an DIV/IDIV could take 26-90 cycles; which is 7 - 23 times slower than MUL; so your initial benchmark result isn't really a surprise.
If you happen to be on x86 CPU, as showing in the answer below, if this is indeed the bottleneck you could try to utilize AVX/SSE instructions. Basically you'd need to rely on special instructions than a general one like DIV/IDIV.
How to divide a __m256i vector by an integer variable?

Related

If n is an integer how to find k such that |n-2^k| is the smallest possible [duplicate]

This question already has answers here:
Rounding up to next power of 2
(31 answers)
What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?
(31 answers)
Closed 5 years ago.
I came up with three solutions so far:
The extremely inefficient standard library pow and log2 functions:
int_fast16_t powlog(uint_fast16_t n)
{
return static_cast<uint_fast16_t>(pow(2, floor(log2(n))));
}
Far more efficient counting subsequent powers of 2 until I reach a greater number than I had to reach:
uint_fast16_t multiply(uint_fast16_t n)
{
uint_fast16_t maxpow = 1;
while(2*maxpow <= n)
maxpow *= 2;
return maxpow;
}
The most efficient so far binsearching a precomputed table of powers of 2:
uint_fast16_t binsearch(uint_fast16_t n)
{
static array<uint_fast16_t, 20> pows {1,2,4,8,16,32,64,128,256,512,
1024,2048,4096,8192,16384,32768,65536,131072,262144,524288};
return *(upper_bound(pows.begin(), pows.end(), n)-1);
}
Can this be optimized even more? Any tricks that could be used here?
Full benchmark I used:
#include <iostream>
#include <chrono>
#include <cmath>
#include <cstdint>
#include <array>
#include <algorithm>
using namespace std;
using namespace chrono;
uint_fast16_t powlog(uint_fast16_t n)
{
return static_cast<uint_fast16_t>(pow(2, floor(log2(n))));
}
uint_fast16_t multiply(uint_fast16_t n)
{
uint_fast16_t maxpow = 1;
while(2*maxpow <= n)
maxpow *= 2;
return maxpow;
}
uint_fast16_t binsearch(uint_fast16_t n)
{
static array<uint_fast16_t, 20> pows {1,2,4,8,16,32,64,128,256,512,
1024,2048,4096,8192,16384,32768,65536,131072,262144,524288};
return *(upper_bound(pows.begin(), pows.end(), n)-1);
}
high_resolution_clock::duration test(uint_fast16_t(powfunct)(uint_fast16_t))
{
auto tbegin = high_resolution_clock::now();
volatile uint_fast16_t sink;
for(uint_fast8_t i = 0; i < UINT8_MAX; ++i)
for(uint_fast16_t n = 1; n <= 999999; ++n)
sink = powfunct(n);
auto tend = high_resolution_clock::now();
return tend - tbegin;
}
int main()
{
cout << "Pow and log took " << duration_cast<milliseconds>(test(powlog)).count() << " milliseconds." << endl;
cout << "Multiplying by 2 took " << duration_cast<milliseconds>(test(multiply)).count() << " milliseconds." << endl;
cout << "Binsearching precomputed table of powers took " << duration_cast<milliseconds>(test(binsearch)).count() << " milliseconds." << endl;
}
Compiled with -O2 this gave the following results on my laptop:
Pow and log took 19294 milliseconds.
Multiplying by 2 took 2756 milliseconds.
Binsearching precomputed table of powers took 2278 milliseconds.
Versions with intrinsics have already been suggested in the comments, so here's a version that does not rely on them:
uint32_t highestPowerOfTwoIn(uint32_t x)
{
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return x ^ (x >> 1);
}
This works by first "smearing" the highest set bit to the right, and then x ^ (x >> 1) keeps only the bits that differ from the bit directly left of them (the msb is considered to have a 0 to left of it), which is only the highest set bit because thanks to the smearing the number is of the form 0n1m (in string notation, not numerical exponentiation).
Since no one is actually posting it, with intrinsics you could write (GCC, Clang)
uint32_t highestPowerOfTwoIn(uint32_t x)
{
return 0x80000000 >> __builtin_clz(x);
}
Or (MSVC, probably, not tested)
uint32_t highestPowerOfTwoIn(uint32_t x)
{
unsigned long index;
// ignoring return value, assume x != 0
_BitScanReverse(&index, x);
return 1u << index;
}
Which, when directly supported by the target hardware, should be better.
Results on coliru, and latency results on coliru (compare with the baseline too, which should be roughly indicative of the overhead). In the latency result, the first version of highestPowerOfTwoIn doesn't look so good anymore (still OK, but it is a long chain of dependent instructions so it's not a big surprise that it widens the gap with the intrinsics version). Which one of these is the most relevant comparison depends on your actual usage.
If you have some odd hardware with a fast bit-reversal operation (but maybe slow shifts or slow clz), let's call it _rbit, then you can do
uint32_t highestPowerOfTwoIn(uint32_t x)
{
x = _rbit(x);
return _rbit(x & -x);
}
This is of course based on the old x & -x which isolates the lowest set bit, surrounded by bit reversals it's isolating the highest set bit.
The lookup table looks like the best option here. Hence, to answer
Can this be optimized even more? Any tricks that could be used here?
Yes we can! Let us beat the standard library binary search!
template <class T>
inline size_t
choose(T const& a, T const& b, size_t const& src1, size_t const& src2)
{
return b >= a ? src2 : src1;
}
template <class Container>
inline typename Container::const_iterator
fast_upper_bound(Container const& cont, typename Container::value_type const& value)
{
auto size = cont.size();
size_t low = 0;
while (size > 0) {
size_t half = size / 2;
size_t other_half = size - half;
size_t probe = low + half;
size_t other_low = low + other_half;
auto v = cont[probe];
size = half;
low = choose(v, value, low, other_low);
}
return begin(cont)+low;
}
Using this implementation of upper_bound gives me a substantial improvement:
g++ -std=c++14 -O2 -Wall -Wno-unused-but-set-variable -Werror main.cpp && ./a.out
Pow and log took 2536 milliseconds.
Multiplying by 2 took 320 milliseconds.
Binsearching precomputed table of powers took 349 milliseconds.
Binsearching (opti) precomputed table of powers took 167 milliseconds.
(live on coliru)
Note that I've improved your benchmark to use random values; by doing so I removed the branch prediction bias.
Now, if you really need to push harder, you can optimize the choose function with x86_64 asm for clang:
template <class T> inline size_t choose(T const& a, T const& b, size_t const& src1, size_t const& src2)
{
#if defined(__clang__) && defined(__x86_64)
size_t res = src1;
asm("cmpq %1, %2; cmovaeq %4, %0"
:
"=q" (res)
:
"q" (a),
"q" (b),
"q" (src1),
"q" (src2),
"0" (res)
:
"cc");
return res;
#else
return b >= a ? src2 : src1;
#endif
}
With output:
clang++ -std=c++14 -O2 -Wall -Wno-unused-variable -Wno-missing-braces -Werror main.cpp && ./a.out
Pow and log took 1408 milliseconds.
Multiplying by 2 took 351 milliseconds.
Binsearching precomputed table of powers took 359 milliseconds.
Binsearching (opti) precomputed table of powers took 153 milliseconds.
(Live on coliru)
Climbs faster but falls back same speed.
uint multiply_quick(uint n)
{
if (n < 2u) return 1u;
uint maxpow = 1u;
if (n > 256u)
{
maxpow = 256u * 128u;
// fast fixing the overshoot
while (maxpow > n)
maxpow = maxpow >> 2;
// fixing the undershoot
while (2u * maxpow <= n)
maxpow *= 2u;
}
else
{
// quicker scan
while (maxpow < n && maxpow != 256u)
maxpow *= maxpow;
// fast fixing the overshoot
while (maxpow > n)
maxpow = maxpow >> 2;
// fixing the undershoot
while (2u * maxpow <= n)
maxpow *= 2u;
}
return maxpow;
}
maybe this is better suited for 32bit variables using 65k constant literal instead of 256.
Just set to 0 all bits but the first one. This should be very fast and efficient
As #Jack already mentioned you can simply set to 0 all bits except first one.
And here solution:
#include <iostream>
uint16_t bit_solution(uint16_t num)
{
if ( num == 0 )
return 0;
uint16_t ret = 1;
while (num >>= 1)
ret <<= 1;
return ret;
}
int main()
{
std::cout << bit_solution(1024) << std::endl; //1024
std::cout << bit_solution(1025) << std::endl; //1024
std::cout << bit_solution(1023) << std::endl; //512
std::cout << bit_solution(1) << std::endl; //1
std::cout << bit_solution(0) << std::endl; //0
}
Well, it's still a loop (and its loop count depends on the number of set bits since they are reset one by one), so its worst case is likely to be worse than the approaches using block bit manipulations.
But it's cute.
uint_fast16_t bitunsetter(uint_fast16_t n)
{
while (uint_fast16_t k = n & (n-1))
n = k;
return n;
}

Fast SSE low precision exponential using double precision operations

I am looking for for a fast-SSE-low-precision (~1e-3) exponential function.
I came across this great answer:
/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
__m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
__m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
__m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
return _mm_castsi128_ps (t);
}
Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.
Now I would need a "double precision" version: __m128d FastExpSSE (__m128d x).
This is because I don't control the input and output precision, which happen to be double precision, and the two conversions double -> float, then float -> double is eating 50% of the CPU resources.
What changes would be needed?
I naively tried this:
__m128i double_to_uint64(__m128d x) {
x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
return _mm_xor_si128(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
);
}
__m128d FastExpSseDouble(__m128d x) {
#define S 52
#define C (1llu << S) / log(2)
__m128d a = _mm_set1_pd(C); /* (1 << 52) / log(2) */
__m128i b = _mm_set1_epi64x(127 * (1llu << S) - 298765llu << 29);
auto y = double_to_uint64(_mm_mul_pd(a, x));
__m128i t = _mm_add_epi64(y, b);
return _mm_castsi128_pd(t);
}
Of course this returns garbage as I don't know what I'm doing...
edit:
About the 50% factor, it is a very rough estimation, comparing the speedup (with respect to std::exp) converting a vector of single precision numbers (great) to the speedup with a list of double precision numbers (not so great).
Here is the code I used:
// gives the result in place
void FastExpSseVector(std::vector<double> & v) { //vector with several millions elements
const auto I = v.size();
const auto N = (I / 4) * 4;
for (int n = 0; n < N; n += 4) {
float a[4] = { float(v[n]), float(v[n + 1]), float(v[n + 2]), float(v[n + 3]) };
__m128 x;
x = _mm_load_ps(a);
auto r = FastExpSse(x);
_mm_store_ps(a, r);
v[n] = a[0];
v[n + 1] = a[1];
v[n + 2] = a[2];
v[n + 3] = a[3];
}
for (int n = N; n < I; ++n) {
v[n] = FastExp(v[n]);
}
}
And here is what I would do if I had this "double precision" version:
void FastExpSseVectorDouble(std::vector<double> & v) {
const auto I = v.size();
const auto N = (I / 2) * 2;
for (int n = 0; n < N; n += 2) {
__m128d x;
x = _mm_load_pd(&v[n]);
auto r = FastExpSseDouble(x);
_mm_store_pd(&v[n], r);
}
for (int n = N; n < I; ++n) {
v[n] = FastExp(v[n]);
}
}
Something like this should do the job. You need to tune the 1.05 constant to get a lower maximal error -- I'm too lazy to do that:
__m128d fastexp(const __m128d &x)
{
__m128d scaled = _mm_add_pd(_mm_mul_pd(x, _mm_set1_pd(1.0/std::log(2.0)) ), _mm_set1_pd(3*1024.0-1.05));
return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(scaled), 11));
}
This just gets about 2.5% relative precision -- for better precision you may need to add a second term.
Also, for values which overflow or underflow this will result in unspecified values, you can avoid this by clamping the scaled value to some values.

How to efficiently count the highest power of 2 that is less than or equal to a given number? [duplicate]

This question already has answers here:
Rounding up to next power of 2
(31 answers)
What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?
(31 answers)
Closed 5 years ago.
I came up with three solutions so far:
The extremely inefficient standard library pow and log2 functions:
int_fast16_t powlog(uint_fast16_t n)
{
return static_cast<uint_fast16_t>(pow(2, floor(log2(n))));
}
Far more efficient counting subsequent powers of 2 until I reach a greater number than I had to reach:
uint_fast16_t multiply(uint_fast16_t n)
{
uint_fast16_t maxpow = 1;
while(2*maxpow <= n)
maxpow *= 2;
return maxpow;
}
The most efficient so far binsearching a precomputed table of powers of 2:
uint_fast16_t binsearch(uint_fast16_t n)
{
static array<uint_fast16_t, 20> pows {1,2,4,8,16,32,64,128,256,512,
1024,2048,4096,8192,16384,32768,65536,131072,262144,524288};
return *(upper_bound(pows.begin(), pows.end(), n)-1);
}
Can this be optimized even more? Any tricks that could be used here?
Full benchmark I used:
#include <iostream>
#include <chrono>
#include <cmath>
#include <cstdint>
#include <array>
#include <algorithm>
using namespace std;
using namespace chrono;
uint_fast16_t powlog(uint_fast16_t n)
{
return static_cast<uint_fast16_t>(pow(2, floor(log2(n))));
}
uint_fast16_t multiply(uint_fast16_t n)
{
uint_fast16_t maxpow = 1;
while(2*maxpow <= n)
maxpow *= 2;
return maxpow;
}
uint_fast16_t binsearch(uint_fast16_t n)
{
static array<uint_fast16_t, 20> pows {1,2,4,8,16,32,64,128,256,512,
1024,2048,4096,8192,16384,32768,65536,131072,262144,524288};
return *(upper_bound(pows.begin(), pows.end(), n)-1);
}
high_resolution_clock::duration test(uint_fast16_t(powfunct)(uint_fast16_t))
{
auto tbegin = high_resolution_clock::now();
volatile uint_fast16_t sink;
for(uint_fast8_t i = 0; i < UINT8_MAX; ++i)
for(uint_fast16_t n = 1; n <= 999999; ++n)
sink = powfunct(n);
auto tend = high_resolution_clock::now();
return tend - tbegin;
}
int main()
{
cout << "Pow and log took " << duration_cast<milliseconds>(test(powlog)).count() << " milliseconds." << endl;
cout << "Multiplying by 2 took " << duration_cast<milliseconds>(test(multiply)).count() << " milliseconds." << endl;
cout << "Binsearching precomputed table of powers took " << duration_cast<milliseconds>(test(binsearch)).count() << " milliseconds." << endl;
}
Compiled with -O2 this gave the following results on my laptop:
Pow and log took 19294 milliseconds.
Multiplying by 2 took 2756 milliseconds.
Binsearching precomputed table of powers took 2278 milliseconds.
Versions with intrinsics have already been suggested in the comments, so here's a version that does not rely on them:
uint32_t highestPowerOfTwoIn(uint32_t x)
{
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return x ^ (x >> 1);
}
This works by first "smearing" the highest set bit to the right, and then x ^ (x >> 1) keeps only the bits that differ from the bit directly left of them (the msb is considered to have a 0 to left of it), which is only the highest set bit because thanks to the smearing the number is of the form 0n1m (in string notation, not numerical exponentiation).
Since no one is actually posting it, with intrinsics you could write (GCC, Clang)
uint32_t highestPowerOfTwoIn(uint32_t x)
{
return 0x80000000 >> __builtin_clz(x);
}
Or (MSVC, probably, not tested)
uint32_t highestPowerOfTwoIn(uint32_t x)
{
unsigned long index;
// ignoring return value, assume x != 0
_BitScanReverse(&index, x);
return 1u << index;
}
Which, when directly supported by the target hardware, should be better.
Results on coliru, and latency results on coliru (compare with the baseline too, which should be roughly indicative of the overhead). In the latency result, the first version of highestPowerOfTwoIn doesn't look so good anymore (still OK, but it is a long chain of dependent instructions so it's not a big surprise that it widens the gap with the intrinsics version). Which one of these is the most relevant comparison depends on your actual usage.
If you have some odd hardware with a fast bit-reversal operation (but maybe slow shifts or slow clz), let's call it _rbit, then you can do
uint32_t highestPowerOfTwoIn(uint32_t x)
{
x = _rbit(x);
return _rbit(x & -x);
}
This is of course based on the old x & -x which isolates the lowest set bit, surrounded by bit reversals it's isolating the highest set bit.
The lookup table looks like the best option here. Hence, to answer
Can this be optimized even more? Any tricks that could be used here?
Yes we can! Let us beat the standard library binary search!
template <class T>
inline size_t
choose(T const& a, T const& b, size_t const& src1, size_t const& src2)
{
return b >= a ? src2 : src1;
}
template <class Container>
inline typename Container::const_iterator
fast_upper_bound(Container const& cont, typename Container::value_type const& value)
{
auto size = cont.size();
size_t low = 0;
while (size > 0) {
size_t half = size / 2;
size_t other_half = size - half;
size_t probe = low + half;
size_t other_low = low + other_half;
auto v = cont[probe];
size = half;
low = choose(v, value, low, other_low);
}
return begin(cont)+low;
}
Using this implementation of upper_bound gives me a substantial improvement:
g++ -std=c++14 -O2 -Wall -Wno-unused-but-set-variable -Werror main.cpp && ./a.out
Pow and log took 2536 milliseconds.
Multiplying by 2 took 320 milliseconds.
Binsearching precomputed table of powers took 349 milliseconds.
Binsearching (opti) precomputed table of powers took 167 milliseconds.
(live on coliru)
Note that I've improved your benchmark to use random values; by doing so I removed the branch prediction bias.
Now, if you really need to push harder, you can optimize the choose function with x86_64 asm for clang:
template <class T> inline size_t choose(T const& a, T const& b, size_t const& src1, size_t const& src2)
{
#if defined(__clang__) && defined(__x86_64)
size_t res = src1;
asm("cmpq %1, %2; cmovaeq %4, %0"
:
"=q" (res)
:
"q" (a),
"q" (b),
"q" (src1),
"q" (src2),
"0" (res)
:
"cc");
return res;
#else
return b >= a ? src2 : src1;
#endif
}
With output:
clang++ -std=c++14 -O2 -Wall -Wno-unused-variable -Wno-missing-braces -Werror main.cpp && ./a.out
Pow and log took 1408 milliseconds.
Multiplying by 2 took 351 milliseconds.
Binsearching precomputed table of powers took 359 milliseconds.
Binsearching (opti) precomputed table of powers took 153 milliseconds.
(Live on coliru)
Climbs faster but falls back same speed.
uint multiply_quick(uint n)
{
if (n < 2u) return 1u;
uint maxpow = 1u;
if (n > 256u)
{
maxpow = 256u * 128u;
// fast fixing the overshoot
while (maxpow > n)
maxpow = maxpow >> 2;
// fixing the undershoot
while (2u * maxpow <= n)
maxpow *= 2u;
}
else
{
// quicker scan
while (maxpow < n && maxpow != 256u)
maxpow *= maxpow;
// fast fixing the overshoot
while (maxpow > n)
maxpow = maxpow >> 2;
// fixing the undershoot
while (2u * maxpow <= n)
maxpow *= 2u;
}
return maxpow;
}
maybe this is better suited for 32bit variables using 65k constant literal instead of 256.
Just set to 0 all bits but the first one. This should be very fast and efficient
As #Jack already mentioned you can simply set to 0 all bits except first one.
And here solution:
#include <iostream>
uint16_t bit_solution(uint16_t num)
{
if ( num == 0 )
return 0;
uint16_t ret = 1;
while (num >>= 1)
ret <<= 1;
return ret;
}
int main()
{
std::cout << bit_solution(1024) << std::endl; //1024
std::cout << bit_solution(1025) << std::endl; //1024
std::cout << bit_solution(1023) << std::endl; //512
std::cout << bit_solution(1) << std::endl; //1
std::cout << bit_solution(0) << std::endl; //0
}
Well, it's still a loop (and its loop count depends on the number of set bits since they are reset one by one), so its worst case is likely to be worse than the approaches using block bit manipulations.
But it's cute.
uint_fast16_t bitunsetter(uint_fast16_t n)
{
while (uint_fast16_t k = n & (n-1))
n = k;
return n;
}

To find combination value of large numbers

I want to find (n choose r) for large integers, and I also have to find out the mod of that number.
long long int choose(int a,int b)
{
if (b > a)
return (-1);
if(b==0 || a==1 || b==a)
return(1);
else
{
long long int r = ((choose(a-1,b))%10000007+(choose(a-1,b- 1))%10000007)%10000007;
return r;
}
}
I am using this piece of code, but I am getting TLE. If there is some other method to do that please tell me.
I don't have the reputation to comment yet, but I wanted to point out that the answer by rock321987 works pretty well:
It is fast and correct up to and including C(62, 31)
but cannot handle all inputs that have an output that fits in a uint64_t. As proof, try:
C(67, 33) = 14,226,520,737,620,288,370 (verify correctness and size)
Unfortunately, the other implementation spits out 8,829,174,638,479,413 which is incorrect. There are other ways to calculate nCr which won't break like this, however the real problem here is that there is no attempt to take advantage of the modulus.
Notice that p = 10000007 is prime, which allows us to leverage the fact that all integers have an inverse mod p, and that inverse is unique. Furthermore, we can find that inverse quite quickly. Another question has an answer on how to do that here, which I've replicated below.
This is handy since:
x/y mod p == x*(y inverse) mod p; and
xy mod p == (x mod p)(y mod p)
Modifying the other code a bit, and generalizing the problem we have the following:
#include <iostream>
#include <assert.h>
// p MUST be prime and less than 2^63
uint64_t inverseModp(uint64_t a, uint64_t p) {
assert(p < (1ull << 63));
assert(a < p);
assert(a != 0);
uint64_t ex = p-2, result = 1;
while (ex > 0) {
if (ex % 2 == 1) {
result = (result*a) % p;
}
a = (a*a) % p;
ex /= 2;
}
return result;
}
// p MUST be prime
uint32_t nCrModp(uint32_t n, uint32_t r, uint32_t p)
{
assert(r <= n);
if (r > n-r) r = n-r;
if (r == 0) return 1;
if(n/p - (n-r)/p > r/p) return 0;
uint64_t result = 1; //intermediary results may overflow 32 bits
for (uint32_t i = n, x = 1; i > r; --i, ++x) {
if( i % p != 0) {
result *= i % p;
result %= p;
}
if( x % p != 0) {
result *= inverseModp(x % p, p);
result %= p;
}
}
return result;
}
int main() {
uint32_t smallPrime = 17;
uint32_t medNum = 3001;
uint32_t halfMedNum = medNum >> 1;
std::cout << nCrModp(medNum, halfMedNum, smallPrime) << std::endl;
uint32_t bigPrime = 4294967291ul; // 2^32-5 is largest prime < 2^32
uint32_t bigNum = 1ul << 24;
uint32_t halfBigNum = bigNum >> 1;
std::cout << nCrModp(bigNum, halfBigNum, bigPrime) << std::endl;
}
Which should produce results for any set of 32-bit inputs if you are willing to wait. To prove a point, I've included the calculation for a 24-bit n, and the maximum 32-bit prime. My modest PC took ~13 seconds to calculate this. Check the answer against wolfram alpha, but beware that it may exceed the 'standard computation time' there.
There is still room for improvement if p is much smaller than (n-r) where r <= n-r. For example, we could precalculate all the inverses mod p instead of doing it on demand several times over.
nCr = n! / (r! * (n-r)!) {! = factorial}
now choose r or n - r in such a way that any of them is minimum
#include <cstdio>
#include <cmath>
#define MOD 10000007
int main()
{
int n, r, i, x = 1;
long long int res = 1;
scanf("%d%d", &n, &r);
int mini = fmin(r, (n - r));//minimum of r,n-r
for (i = n;i > mini;i--) {
res = (res * i) / x;
x++;
}
printf("%lld\n", res % MOD);
return 0;
}
it will work for most cases as required by programming competitions if the value of n and r are not too high
Time complexity :- O(min(r, n - r))
Limitation :- for languages like C/C++ etc. there will be overflow if
n > 60 (approximately)
as no datatype can store the final value..
The expansion of nCr can always be reduced to product of integers. This is done by canceling out terms in denominator. This approach is applied in the function given below.
This function has time complexity of O(n^2 * log(n)). This will calculate nCr % m for n<=10000 under 1 sec.
#include <numeric>
#include <algorithm>
int M=1e7+7;
int ncr(int n, int r)
{
r=min(r,n-r);
int A[r],i,j,B[r];
iota(A,A+r,n-r+1); //initializing A starting from n-r+1 to n
iota(B,B+r,1); //initializing B starting from 1 to r
int g;
for(i=0;i<r;i++)
for(j=0;j<r;j++)
{
if(B[i]==1)
break;
g=__gcd(B[i], A[j] );
A[j]/=g;
B[i]/=g;
}
long long ans=1;
for(i=0;i<r;i++)
ans=(ans*A[i])%M;
return ans;
}

How can I improve this Pollard's rho algorithm to handle products of semi-large primes?

Below is my implementation of Pollard's rho algorithm for prime factorization:
#include <vector>
#include <queue>
#include <gmpxx.h>
// Interface to the GMP random number functions.
gmp_randclass rng(gmp_randinit_default);
// Returns a divisor of N using Pollard's rho method.
mpz_class getDivisor(const mpz_class &N)
{
mpz_class c = rng.get_z_range(N);
mpz_class x = rng.get_z_range(N);
mpz_class y = x;
mpz_class d = 1;
mpz_class z;
while (d == 1) {
x = (x*x + c) % N;
y = (y*y + c) % N;
y = (y*y + c) % N;
z = x - y;
mpz_gcd(d.get_mpz_t(), z.get_mpz_t(), N.get_mpz_t());
}
return d;
}
// Adds the prime factors of N to the given vector.
void factor(const mpz_class &N, std::vector<mpz_class> &factors)
{
std::queue<mpz_class> to_factor;
to_factor.push(N);
while (!to_factor.empty()) {
mpz_class n = to_factor.front();
to_factor.pop();
if (n == 1) {
continue; // Trivial factor.
} else if (mpz_probab_prime_p(n.get_mpz_t(), 5)) {
// n is a prime.
factors.push_back(n);
} else {
// n is a composite, so push its factors on the queue.
mpz_class d = getDivisor(n);
to_factor.push(d);
to_factor.push(n/d);
}
}
}
It's essentially a straight translation of the pseudocode on Wikipedia, and relies on GMP for big numbers and for primality testing. The implementation works well and can factor primes such as
1000036317378699858851366323 = 1000014599 * 1000003357 * 1000018361
but will choke on e.g.
1000000000002322140000000048599822299 = 1000000000002301019 * 1000000000000021121
My question is: Is there anything I can do to improve on this, short of switching to a more complex factorization algorithm such as Quadratic sieve?
I know one improvement could be to first do some trial divisions by pre-computed primes, but that would not help for products of a few large primes such as the above.
I'm interested in any tips on improvements to the basic Pollard's rho method to get it to handle larger composites of only a few prime factors. Of course if you find any stupidities in the code above, I'm interested in those as well.
For full disclosure: This is a homework assignment, so general tips and pointers are better than fully coded solutions. With this very simple approach I already get a passing grade on the assignment, but would of course like to improve.
Thanks in advance.
You are using the original version of the rho algorithm due to Pollard. Brent's variant makes two improvements: Floyd's tortoise-and-hare cycle-finding algorithm is replaced by a cycle-finding algorithm developed by Brent, and the gcd calculation is delayed so it is performed only once every hundred or so times through the loop instead of every time. But those changes only get a small improvement, maybe 25% or so, and won't allow you to factor the large numbers you are talking about. Thus, you will need a better algorithm: SQUFOF might work for semi-primes the size that you mention, or you could implement quadratic sieve or the elliptic curve method. I have discussion and implementation of all those algorithms at my blog.
Part 1
Very interesting question you have, thanks!
Decided to implement my own very complex C++ solution of your task from scratch. Although you asked not to write code, still I did it fully only to have proof of concept, to check real speed and timings.
To tell in advance, I improved speed of you program 250-500x times (see Part 2).
Besides well known algorithm described in Wikipedia I did following extra optimizations:
Made most of computations in compile time. This is main feature of my code. Input number N is provided at compile time as long macro constant. This ensures that compiler does half of optimizations at compile time like inlining constants and doing optimizing division and other arithmetics. As a drawback, you have to re-compile program every time when you change a number that you want to factor.
Additionally to 1. I also did support of runtime-only value of N. This is needed to do real comparison of speed in different environments.
One more very important speed improvement is that I used Montgomery Reduction to speedup modulus division. This Montgomery speeds up all computations 2-2.5x times. Besides Montgomery you can also use Barrett Reduction. Both Montgomery and Barrett replace single expensive division with several multiplications and additions, which makes division very fast.
Unlike in your code I do GCD (Greatest Common Divisor) computation very rarely, once in 8 000 - 16 000 iterations. Because GCD is very expensive, it needs around 128 expensive divisions for 128-bit numbers. Instead of computing GCD(x - y, N) every time you can notice that it is enough to accumulate product prod = (prod * (x - y)) % N and later after thousand of such iterations just compute GCD(prod, N). This is easily derived from fact that GCD((a * b * c) % N, N) = GCD(GCD(a, N) * GCD(b, N) * GCD(c, N), N).
One very advanced and fast optimization that I did is implemented my own uint128 and uint256 with all necessary sub-optimizations needed for my task. This optimization is only posted in code by me in Part 2 of my answer, see this part after first code.
As a result of above steps, total speed of Pollard Rho is increased 50-100x times, especially due to doing GCD only once in thousand steps. This speed is enough even to compute your biggest number provided in your question.
Besides algorithms described above I also used following extra algorithms: Extended Euclidean Algorithm (for computing coefficients for modular inverse), Modular Multiplicative Inverse, Euclidean Algorithm (for computing GCD), Modular Binary Exponentiation, Trial Division (for checking primality of small numbers), Fermat Primality Test, as already mentioned Montgomery Reduction and Pollard Rho itself.
I did following timings and speed measurements:
N 1000036317378699858851366323 (90 bits)
1000003357 (30 bits) * 1000014599 (30 bits) * 1000018361 (30 bits)
PollardRho time 0.1 secs, tries 1 iterations 25599 (2^14.64)
N 780002082420246798979794021150335143 (120 bits)
244300526707007 (48 bits) * 3192797383346267127449 (72 bits)
PollardRho time 32 secs, tries 1 iterations 25853951 (2^24.62)
NO-Montgomery time 70 secs
N 614793320656537415355785711660734447 (120 bits)
44780536225818373 (56 bits) * 13729029897191722339 (64 bits)
PollardRho time 310 secs, tries 1 iterations 230129663 (2^27.78)
N 1000000000002322140000000048599822299 (120 bits)
1000000000000021121 (60 bits) * 1000000000002301019 (60 bits)
PollardRho time 2260 secs, tries 1 iterations 1914068991 (2^30.83)
As you can see above your smaller number takes just 0.1 second to factor, while your bigger number (that you failed to factor at all) takes quite affordable time, 2260 seconds (a bit more than half hour). Also you can see that I created myself a number with 48-bit smallest factor, and another number with 56-bit smaller factor.
In general a rule is such that if you have smallest factor of K-bit then it takes 2^(K/2) iterations of Pollard Rho to compute this factor. Unlike for example Trial division algorithm which needs square times bigger time of 2^K for K-bit factor.
In my code see very start of file, there is a bunch of lines #define NUM, each defining compile time constant containing a number. You can comment out any line or change value of a number or add a new line with new number. Then re-compile program and run it to see results.
Before below code don't forget to click on Try it online! link to check code run on GodBolt server. Also see example Console Output after code.
Try it online!
#include <cstdint>
#include <tuple>
#include <iostream>
#include <iomanip>
#include <chrono>
#include <random>
#include <stdexcept>
#include <string>
#include <mutex>
#include <cmath>
#include <type_traits>
#include <boost/multiprecision/cpp_int.hpp>
//#define NUM "1000036317378699858851366323" // 90 bits, 1000003357 (30 bits) * 1000014599 (30 bits) * 1000018361 (30 bits), PollardRho time 0.1 secs, tries 1 iterations 25599 (2^14.64)
#define NUM "780002082420246798979794021150335143" // 120 bits, 244300526707007 (48 bits) * 3192797383346267127449 (72 bits), PollardRho time 32 secs, tries 1 iterations 25853951 (2^24.62), NO-Montgomery time 70 secs
//#define NUM "614793320656537415355785711660734447" // 120 bits, 44780536225818373 (56 bits) * 13729029897191722339 (64 bits), PollardRho time 310 secs, tries 1 iterations 230129663 (2^27.78)
//#define NUM "1000000000002322140000000048599822299" // 120 bits, 1000000000000021121 (60 bits) * 1000000000002301019 (60 bits), PollardRho time 2260 secs, tries 1 iterations 1914068991 (2^30.83)
#define IS_DEBUG 0
#define IS_COMPILE_TIME 1
bool constexpr use_montg = 1;
size_t constexpr gcd_per_nloops = 1 << 14;
#if defined(_MSC_VER) && !defined(__clang__)
#define HAS_INT128 0
#else
#define HAS_INT128 1
#endif
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#define COUT(code) { std::unique_lock<std::mutex> lock(cout_mux); std::cout code; std::cout << std::flush; }
#if IS_DEBUG
#define LN { COUT(<< "LN " << __LINE__ << std::endl); }
#define DUMP(var) { COUT(<< __LINE__ << " : " << #var << " = (" << (var) << ")" << std::endl); }
#else
#define LN
#define DUMP(var)
#endif
#define bisizeof(x) (sizeof(x) * 8)
using u32 = uint32_t;
using i64 = int64_t;
using u64 = uint64_t;
using u128 = boost::multiprecision::uint128_t;
using i128 = boost::multiprecision::number<boost::multiprecision::cpp_int_backend<128, 128, boost::multiprecision::signed_magnitude, boost::multiprecision::unchecked, void>>;
using u192 = boost::multiprecision::number<boost::multiprecision::cpp_int_backend<192, 192, boost::multiprecision::unsigned_magnitude, boost::multiprecision::unchecked, void>>;
using i192 = boost::multiprecision::number<boost::multiprecision::cpp_int_backend<192, 192, boost::multiprecision::signed_magnitude, boost::multiprecision::unchecked, void>>;
using u256 = boost::multiprecision::uint256_t;
using i256 = boost::multiprecision::number<boost::multiprecision::cpp_int_backend<256, 256, boost::multiprecision::signed_magnitude, boost::multiprecision::unchecked, void>>;
using u384 = boost::multiprecision::number<boost::multiprecision::cpp_int_backend<384, 384, boost::multiprecision::unsigned_magnitude, boost::multiprecision::unchecked, void>>;
using i384 = boost::multiprecision::number<boost::multiprecision::cpp_int_backend<384, 384, boost::multiprecision::signed_magnitude, boost::multiprecision::unchecked, void>>;
#if HAS_INT128
using u128_cl = unsigned __int128;
using i128_cl = signed __int128;
#endif
template <typename T> struct DWordOf;
template <> struct DWordOf<u64> : std::type_identity<u128> {};
template <> struct DWordOf<i64> : std::type_identity<i128> {};
template <> struct DWordOf<u128> : std::type_identity<u256> {};
template <> struct DWordOf<i128> : std::type_identity<i256> {};
#if HAS_INT128
template <> struct DWordOf<u128_cl> : std::type_identity<u256> {};
template <> struct DWordOf<i128_cl> : std::type_identity<i256> {};
#endif
template <typename T>
using DWordOfT = typename DWordOf<T>::type;
template <typename T> struct SignedOf;
template <> struct SignedOf<u64> : std::type_identity<i64> {};
template <> struct SignedOf<i64> : std::type_identity<i64> {};
template <> struct SignedOf<u128> : std::type_identity<i128> {};
template <> struct SignedOf<i128> : std::type_identity<i128> {};
#if HAS_INT128
template <> struct SignedOf<u128_cl> : std::type_identity<i128> {};
template <> struct SignedOf<i128_cl> : std::type_identity<i128> {};
#endif
template <typename T>
using SignedOfT = typename SignedOf<T>::type;
template <typename T> struct BiSizeOf;
template <> struct BiSizeOf<u64> : std::integral_constant<size_t, 64> {};
template <> struct BiSizeOf<u128> : std::integral_constant<size_t, 128> {};
template <> struct BiSizeOf<u192> : std::integral_constant<size_t, 192> {};
template <> struct BiSizeOf<u256> : std::integral_constant<size_t, 256> {};
template <> struct BiSizeOf<u384> : std::integral_constant<size_t, 384> {};
#if HAS_INT128
template <> struct BiSizeOf<u128_cl> : std::integral_constant<size_t, 128> {};
#endif
template <typename T>
size_t constexpr BiSizeOfT = BiSizeOf<T>::value;
static std::mutex cout_mux;
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
template <typename T, typename DT = DWordOfT<T>>
constexpr DT MulD(T const & a, T const & b) {
return DT(a) * b;
}
template <typename T>
constexpr auto EGCD(T const & a, T const & b) {
using ST = SignedOfT<T>;
using DST = DWordOfT<ST>;
T ro = 0, r = 0, qu = 0, re = 0;
ST so = 0, s = 0;
std::tie(ro, r, so, s) = std::make_tuple(a, b, 1, 0);
while (r != 0) {
std::tie(qu, re) = std::make_tuple(ro / r, ro % r);
std::tie(ro, r) = std::make_tuple(r, re);
std::tie(so, s) = std::make_tuple(s, ST(so - DST(s) * ST(qu)));
}
ST const to = ST((DST(ro) - DST(a) * so) / ST(b));
return std::make_tuple(ro, so, to);
}
template <typename T>
constexpr T ModInv(T x, T mod) {
using ST = SignedOfT<T>;
using DT = DWordOfT<T>;
x %= mod;
auto [g, s, t] = EGCD(x, mod);
//ASSERT(g == 1);
if (s < 0) {
//ASSERT(ST(mod) + s >= 0);
s += mod;
} else {
//ASSERT(s < mod);
}
//ASSERT((DT(x) * s) % mod == 1);
return T(s);
}
template <typename ST>
constexpr std::tuple<ST, ST, ST, ST> MontgKRR(ST n) {
size_t constexpr ST_bisize = BiSizeOfT<ST>;
using DT = DWordOfT<ST>;
DT constexpr r = DT(1) << ST_bisize;
ST const rmod = ST(r % n), rmod2 = ST(MulD<ST>(rmod, rmod) % n), rinv = ModInv<ST>(rmod, n);
DT const k0 = (r * DT(rinv) - 1) / n;
//ASSERT(k0 < (DT(1) << ST_bisize));
ST const k = ST(k0);
return std::make_tuple(k, rmod, rmod2, rinv);
}
template <typename T>
constexpr T GCD(T a, T b) {
while (b != 0)
std::tie(a, b) = std::make_tuple(b, a % b);
return a;
}
template <typename T>
T PowMod(T a, T b, T const & c) {
// https://en.wikipedia.org/wiki/Modular_exponentiation
using DT = DWordOfT<T>;
T r = 1;
while (b != 0) {
if (u32(b) & 1)
r = T(MulD<T>(r, a) % c);
a = T(MulD<T>(a, a) % c);
b >>= 1;
}
return r;
}
template <typename T>
std::pair<bool, bool> IsProbablyPrime_TrialDiv(T const n, u64 limit = u64(-1)) {
// https://en.wikipedia.org/wiki/Trial_division
if (n <= 16)
return {n == 2 || n == 3 || n == 5 || n == 7 || n == 11 || n == 13, true};
if ((n & 1) == 0)
return {false, true};
u64 d = 0;
for (d = 3; d < limit && d * d <= n; d += 2)
if (n % d == 0)
return {false, true};
return {true, d * d > n};
}
template <typename T>
bool IsProbablyPrime_Fermat(T const n, size_t ntrials = 32) {
// https://en.wikipedia.org/wiki/Fermat_primality_test
if (n <= 16)
return n == 2 || n == 3 || n == 5 || n == 7 || n == 11 || n == 13;
thread_local std::mt19937_64 rng{123};
u64 const rand_end = n - 3 <= u64(-5) ? u64(n - 3) : u64(-5);
for (size_t trial = 0; trial < ntrials; ++trial)
if (PowMod<T>(rng() % rand_end + 2, n - 1, n) != 1)
return false;
return true;
}
template <typename T>
bool IsProbablyPrime(T const n) {
if (n < (1 << 12))
return IsProbablyPrime_TrialDiv(n).first;
return IsProbablyPrime_Fermat(n);
}
template <typename T>
std::string IntToStr(T n) {
if (n == 0)
return "0";
std::string r;
while (n != 0) {
u32 constexpr mod = 1'000'000'000U;
std::ostringstream ss;
auto const nm = u32(n % mod);
n /= mod;
if (n != 0)
ss << std::setw(9) << std::setfill('0');
ss << nm;
r = ss.str() + r;
}
return r;
}
template <typename T>
constexpr T ParseNum(char const * s) {
size_t len = 0;
for (len = 0; s[len]; ++len);
T r = 0;
for (size_t i = 0; i < len; ++i) {
r *= 10;
r += s[i] - '0';
}
return r;
}
template <typename T>
std::tuple<T, std::vector<T>, std::vector<T>> Factor_PollardRho(
#if !IS_COMPILE_TIME
T const & n,
#endif
u64 limit = u64(-1), size_t ntrials = 6) {
size_t constexpr T_bisize = BiSizeOfT<T>;
// https://en.wikipedia.org/wiki/Pollard%27s_rho_algorithm
using DT = DWordOfT<T>;
#if IS_COMPILE_TIME
static auto constexpr n = ParseNum<T>(NUM);
#endif
if (n <= 1)
return {n, {}, {}};
if (IsProbablyPrime<T>(n))
return {n, {n}, {}};
#if IS_COMPILE_TIME
static auto constexpr montg_krr = MontgKRR(n);
static T constexpr mk = std::get<0>(montg_krr), mrm = std::get<1>(montg_krr), mrm2 = std::get<2>(montg_krr), mri = std::get<3>(montg_krr),
mone = use_montg ? mrm : 1, mone2 = use_montg ? mrm2 : 1;
#else
static auto const montg_krr = MontgKRR(n);
static T const mk = std::get<0>(montg_krr), mrm = std::get<1>(montg_krr), mrm2 = std::get<2>(montg_krr), mri = std::get<3>(montg_krr),
mone = use_montg ? mrm : 1, mone2 = use_montg ? mrm2 : 1;
#endif
auto AdjustL = [&](T x) -> T {
if constexpr(1) {
while (x >= n)
x -= n;
return x;
} else {
using SiT = SignedOfT<T>;
return x - (n & (~T(SiT(x - n) >> (T_bisize - 1))));
}
};
auto MontgModL = [&](DT const & x) -> T {
if constexpr(!use_montg)
return T(x % n);
else
return T((x + MulD<T>(n, T(x) * mk)) >> T_bisize);
};
auto ToMontgL = [&](T const & x) -> T {
if constexpr(!use_montg)
return x;
else
return MontgModL(MulD<T>(x, mrm2));
};
auto FromMontgL = [&](T const & x) -> T {
if constexpr(!use_montg)
return x;
else
return AdjustL(MontgModL(x));
};
auto DumpMontgX = [&](char const * name, T const & x, bool from = true){
if constexpr(1) {
COUT(<< __LINE__ << " : " << name << " = " << IntToStr(from ? FromMontgL(x) : x) << std::endl);
}
};
auto f = [&](T x){ return MontgModL(MulD<T>(x, x) + mone2); };
#if IS_DEBUG
#define DUMPM(x) DumpMontgX(#x, x)
#define DUMPI(x) DumpMontgX(#x, x, false)
#else
#define DUMPM(x)
#define DUMPI(x)
#endif
ASSERT(3 <= n);
size_t cnt = 0;
u64 const distr_end = n - 2 <= u64(-5) ? u64(n - 2) : u64(-5);
thread_local std::mt19937_64 rng{123};
for (size_t itry = 0; itry < ntrials; ++itry) {
bool failed = false;
u64 const rnd = rng() % distr_end + 1;
T x = ToMontgL(rnd);
u64 sum_cycles = 0;
for (u64 cycle = 1;; cycle <<= 1) {
T y = x, m = mone, xstart = x, ny = 0;
while (ny < y)
ny += n;
ny -= y;
auto ILast = [&](auto istart){
size_t ri = istart + gcd_per_nloops - 1;
if (ri < cycle)
return ri;
else
return cycle - 1;
};
for (u64 i = 0, istart = 0, ilast = ILast(istart); i < cycle; ++i) {
x = f(x);
m = MontgModL(MulD<T>(m, ny + x));
if (i < ilast)
continue;
cnt += ilast + 1 - istart;
if (cnt >= limit)
return {n, {}, {n}};
if (GCD<T>(n, FromMontgL(m)) == 1) {
istart = i + 1;
ilast = ILast(istart);
xstart = x;
continue;
}
T x2 = xstart;
for (u64 i2 = istart; i2 <= i; ++i2) {
x2 = f(x2);
auto const g = GCD<T>(n, FromMontgL(ny + x2));
if (g == 1) {
continue;
}
sum_cycles += i + 1;
if (g == n) {
failed = true;
break;
}
#if 0
auto res0 = Factor_PollardRho<T>(g, limit, ntrials);
auto res1 = Factor_PollardRho<T>(n / g, limit, ntrials);
res0.first.insert(res0.first.end(), res1.first.begin(), res1.first.end());
res0.second.insert(res0.second.end(), res1.second.begin(), res1.second.end());
#endif
ASSERT(n % g == 0);
COUT(<< "PollardRho tries " << (itry + 1) << " iterations " << sum_cycles << " (2^" << std::fixed << std::setprecision(2) << std::log2(std::max<size_t>(1, sum_cycles)) << ")" << std::endl);
if (IsProbablyPrime<T>(n / g))
return {n, {g, n / g}, {}};
else
return {n, {g}, {n / g}};
}
if (failed)
break;
ASSERT(false);
}
sum_cycles += cycle;
if (failed)
break;
}
}
return {n, {}, {n}};
}
template <typename T>
void ShowFactors(std::tuple<T, std::vector<T>, std::vector<T>> fs) {
auto [N, a, b] = fs;
std::cout << "Factors of " << IntToStr(N) << " (2^" << std::fixed << std::setprecision(3) << std::log2(double(std::max<T>(1, N))) << "):" << std::endl;
std::sort(a.begin(), a.end());
std::sort(b.begin(), b.end());
for (auto const & x: a)
std::cout << x << " ";
std::cout << std::endl;
if (!b.empty()) {
std::cout << "Unfactored:" << std::endl;
for (auto const & x: b)
std::cout << x << " ";
std::cout << std::endl;
}
}
int main() {
try {
using T = u128;
#if !IS_COMPILE_TIME
std::string s;
COUT(<< "Enter number: ");
std::cin >> s;
auto const N = ParseNum<T>(s.c_str());
#endif
auto const tim = Time();
ShowFactors(Factor_PollardRho<T>(
#if !IS_COMPILE_TIME
N
#endif
));
COUT(<< "Time " << std::fixed << std::setprecision(3) << (Time() - tim) << " sec" << std::endl);
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Console Output:
PollardRho tries 1 iterations 25853951 (2^24.62)
Factors of 780002082420246798979794021150335143 (2^119.231):
244300526707007 3192797383346267127449
Time 35.888 sec
Part 2
Decided to do even further improvements, by implementing my own highly optimal uint128 and uint256, meaning long arithmetics, same like done in Boost or GMP.
Not to dwell into all the details, I optimized every line and every method of these classes. Especially all those methods that deal with operations needed for factorization.
This improved version gives 6x times more speed, compared to Part 1 of my answer, meaning if first version takes 30 seconds to finish, this second version takes 5 seconds.
As you can see in Console Output at the very end of my post, your biggest number takes just 420 seconds to factor. This is 5.4x times faster compared to 2260 seconds of the first part of this answer.
CODE GOES HERE. Only because of StackOverflow limit of 30 000 symbols per post, I can't inline 2nd code here, because it alone is 26 KB in size. Hence I'm providing this code as Gist link below and also through Try it online! link (to run online on GodBolt server):
Github Gist source code
Try it online!
Console Output:
PollardRho tries 1 iterations 32767 (2^15.00)
Factors of 1000036317378699858851366323 (2^89.692):
1000014599
Unfactored:
1000021718061637877
Time 0.086 sec
PollardRho tries 1 iterations 25853951 (2^24.62)
Factors of 780002082420246798979794021150335143 (2^119.231):
244300526707007 3192797383346267127449
Time 5.830 sec
PollardRho tries 1 iterations 230129663 (2^27.78)
Factors of 614793320656537415355785711660734447 (2^118.888):
44780536225818373 13729029897191722339
Time 49.446 sec
PollardRho tries 1 iterations 1914077183 (2^30.83)
Factors of 1000000000002322140000000048599822299 (2^119.589):
1000000000000021121 1000000000002301019
Time 419.680 sec