I am currently developing a class to work with big unsigned integers. However, I need incomplete functionality, namely:
bi_uint+=bi_uint - Already implemented. No complaints.
bi_uint*=std::uint_fast64_t - Already implemented. No complaints.
bi_uint/=std::uint_fast64_t - Implemented but works very slowly, also requires a type that is twice as wide as uint_fast64_t. In the test case, division was 35 times slower than multiplication
Next, I will give my implementation of division, which is based on a simple long division algorithm:
#include <climits>
#include <cstdint>
#include <limits>
#include <vector>
class bi_uint
{
public:
using u64_t = std::uint_fast64_t;
constexpr static std::size_t u64_bits = CHAR_BIT * sizeof(u64_t);
using u128_t = unsigned __int128;
static_assert(sizeof(u128_t) >= sizeof(u64_t) * 2);
//little-endian
std::vector<u64_t> data;
//User should guarantee data.size()>0 and val>0
void self_div(const u64_t val)
{
auto it = data.rbegin();
if(data.size() == 1) {
*it /= val;
return;
}
u128_t rem = 0;
if(*it < val) {
rem = *it++;
data.pop_back();
}
u128_t r = rem % val;
while(it != data.rend()) {
rem = (r << u64_bits) + *it;
const u128_t q = rem / val;
r = rem % val;
*it++ = static_cast<u64_t>(q);
}
}
};
You can see that the unsigned __int128 type was used, thefore, this option is not portable and is tied to a single compiler - GCC and also require x64 platform.
After reading the page about division algorithms, I feel the appropriate algorithm would be "Newton-Raphson division". However, the "Newton–Raphson division" algorithm seems complicated to me. I guess there is a simpler algorithm for dividing the type "big_uint/uint" that would have almost the same performance.
Q: How to fast divide a bi_uint into a u64_t?
I have about 10^6 iterations, each iteration uses all the operations listed
If this is easily achievable, then I would like to have portability and not use unsigned __int128. Otherwise, I prefer to abandon portability in favor of an easier way.
EDIT1:
This is an academic project, I am not able to use third-party libraries.
Part 1 (See Part 2 below)
I managed to speedup your division code 5x times on my old laptop (and even 7.5x times on GodBolt servers) using Barrett Reduction, this is a technique that allows to replace single division by several multiplications and additions. Implemented whole code from sctracth just today.
If you want you can jump directly to code location at the end of my post, without reading long description, as code is fully runnable without any knowledge or dependency.
Code below is only for Intel x64, because I used Intel only instructions and only 64-bit variants of them. Sure it can be re-written for x32 too and for other processors, because Barrett algorithm is generic.
To explain whole Barrett Reduction in short pseudo-code I'll write it in Python as it is simplest language suitable for understandable pseudo-code:
# https://www.nayuki.io/page/barrett-reduction-algorithm
def BarrettR(n, s):
return (1 << s) // n
def BarrettDivMod(x, n, r, s):
q = (x * r) >> s
t = x - q * n
return (q, t) if t < n else (q + 1, t - n)
Basically in pseudo code above BarrettR() is done only single time for same divisor (you use same single-word divisor for whole big integer division). BarrettDivMod() is used each time when you want to make division or modulus operations, basically given input x and divisor n it returns tuple (x / n, x % n), nothing else, but does it faster than regular division instruction.
In below C++ code I implement same two functions of Barrett, but do some C++ specific optimizations to make it even more faster. Optimizations are possible due to fact that divisor n is always 64-bit, x is 128-bit but higher half is always smaller than n (last assumption happens because higher half in your big integer division is always a remainder modulus n).
Barrett algorithm works with divisor n that is NOT a power of 2, so divisors like 0, 1, 2, 4, 8, 16, ... are not allowed. This trivial case of divisor you can cover just by doing right bit-shift of big integer, because dividing by power of 2 is just a bit-shift. Any other divisor is allowed, including even divisors that are not power of 2.
Also it is important to note that my BarrettDivMod() accepts ONLY dividend x that is strictly smaller than divisor * 2^64, in other words higher half of 128-bit dividend x should be smaller than divisor. This is always true in your case of your big integer divison function, as higher half is always a remainder modulus divisor. This rule for x should be checked by you, it is checked in my BarrettDivMod() only as DEBUG assertion that is removed in release.
You can notice that BarrettDivMod() has two big branches, these are two variants of same algorithm, first uses CLang/GCC only type unsigned __int128, second uses only 64-bit instructions and hence suitable for MSVC.
I tried to target three compilers CLang/GCC/MSVC, but some how MSVC version got only 2x faster with Barrett, while CLang/GCC are 5x faster. Maybe I did some bug in MSVC code.
You can see that I used your class bi_uint for time measurement of two versions of code - with regular divide and with Barrett. Important to note that I changed your code quite significantly, first to not use u128 (so that MSVC version compiles that has no u128), second not to modify data vector, so it does read only division and doesn't store final result (this read-only is needed for me to run speed tests very fast without copying data vector on each test iteration). So your code is broken in my snippet, it can't-be copy pasted to be used straight away, I only used your code for speed measurement.
Barrett reduction works faster not only because division is slower than multiplication, but also because multiplication and addition are both very well pipelined on moder CPUs, modern CPU can execute several mul or add instructions within one cycle, but only if these several mul/add don't depend on each other's result, in other words CPU can run several instructions in parallel within one cycle. As far as I know division can't be run in parallel, because there is only single module within CPU to make division, but still it is a bit pipelined, because after 50% of first division is done second division can be started in parallel at beginning of CPU pipeline.
On some computers you may notice that regular Divide version is much slower sometimes, this happens because CLang/GCC do fallback to library-based Soft implementation of Division even for 128 bit dividend. In this case my Barrett may show even 7-10x times speedup, as it doesn't use library functions.
To overcome issue described above, about Soft division, it is better to add Assembly code with usage of DIV instruction directly, or to find some Intrinsic function that implements this inside your compiler (I think CLang/GCC have such intrinsic). Also I can write this Assembly implementation if needed, just tell me in comments.
Update. As promised, implemented Assembly variant of 128 bit division for CLang/GCC, function UDiv128Asm(). After this change it is used as a main implementation for CLang/GCC 128 division instead of regular u128(a) / b. You may come back to regular u128 impementation by replacing #if 0 with #if 1 inside body of UDiv128() function.
Try it online!
#include <cstdint>
#include <bit>
#include <stdexcept>
#include <string>
#include <immintrin.h>
#if defined(_MSC_VER) && !defined(__clang__)
#define IS_MSVC 1
#else
#define IS_MSVC 0
#endif
#if IS_MSVC
#include <intrin.h>
#endif
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#ifdef _DEBUG
#define DASSERT_MSG(cond, msg) ASSERT_MSG(cond, msg)
#else
#define DASSERT_MSG(cond, msg)
#endif
#define DASSERT(cond) DASSERT_MSG(cond, "")
using u16 = uint16_t;
using u32 = uint32_t;
using i64 = int64_t;
using u64 = uint64_t;
using UllPtr = unsigned long long *;
inline int GetExp(double x) {
return int((std::bit_cast<uint64_t>(x) >> 52) & 0x7FF) - 1023;
}
inline size_t BitSizeWrong(uint64_t x) {
return x == 0 ? 0 : (GetExp(x) + 1);
}
inline size_t BitSize(u64 x) {
size_t r = 0;
if (x >= (u64(1) << 32)) {
x >>= 32;
r += 32;
}
while (x >= 0x100) {
x >>= 8;
r += 8;
}
while (x) {
x >>= 1;
++r;
}
return r;
}
#if !IS_MSVC
inline u64 UDiv128Asm(u64 h, u64 l, u64 d, u64 * r) {
u64 q;
asm (R"(
.intel_syntax
mov rdx, %V[h]
mov rax, %V[l]
div %V[d]
mov %V[r], rdx
mov %V[q], rax
)"
: [q] "=r" (q), [r] "=r" (*r)
: [h] "r" (h), [l] "r" (l), [d] "r" (d)
: "rax", "rdx"
);
return q;
}
#endif
inline std::pair<u64, u64> UDiv128(u64 hi, u64 lo, u64 d) {
#if IS_MSVC
u64 r, q = _udiv128(hi, lo, d, &r);
return std::make_pair(q, r);
#else
#if 0
using u128 = unsigned __int128;
auto const dnd = (u128(hi) << 64) | lo;
return std::make_pair(u64(dnd / d), u64(dnd % d));
#else
u64 r, q = UDiv128Asm(hi, lo, d, &r);
return std::make_pair(q, r);
#endif
#endif
}
inline std::pair<u64, u64> UMul128(u64 a, u64 b) {
#if IS_MSVC
u64 hi, lo = _umul128(a, b, &hi);
return std::make_pair(hi, lo);
#else
using u128 = unsigned __int128;
auto const x = u128(a) * b;
return std::make_pair(u64(x >> 64), u64(x));
#endif
}
inline std::pair<u64, u64> USub128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
u64 r_hi, r_lo;
_subborrow_u64(_subborrow_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
return std::make_pair(r_hi, r_lo);
}
inline std::pair<u64, u64> UAdd128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
u64 r_hi, r_lo;
_addcarry_u64(_addcarry_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
return std::make_pair(r_hi, r_lo);
}
inline int UCmp128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
if (a_hi != b_hi)
return a_hi < b_hi ? -1 : 1;
return a_lo == b_lo ? 0 : a_lo < b_lo ? -1 : 1;
}
std::pair<u64, size_t> BarrettRS64(u64 n) {
// https://www.nayuki.io/page/barrett-reduction-algorithm
ASSERT_MSG(n >= 3 && (n & (n - 1)) != 0, "n " + std::to_string(n))
size_t const nbits = BitSize(n);
// 2^s = q * n + r; 2^s = (2^64 + q0) * n + r; 2^s - n * 2^64 = q0 * n + r
u64 const dnd_hi = (nbits >= 64 ? 0ULL : (u64(1) << nbits)) - n;
auto const q0 = UDiv128(dnd_hi, 0, n).first;
return std::make_pair(q0, nbits);
}
template <bool Use128 = true, bool Adjust = true>
std::pair<u64, u64> BarrettDivMod64(u64 x_hi, u64 x_lo, u64 n, u64 r, size_t s) {
// ((x_hi * 2^64 + x_lo) * (2^64 + r)) >> (64 + s)
DASSERT(x_hi < n);
#if !IS_MSVC
if constexpr(Use128) {
using u128 = unsigned __int128;
u128 const xf = (u128(x_hi) << 64) | x_lo;
u64 q = u64((u128(x_hi) * r + xf + u64((u128(x_lo) * r) >> 64)) >> s);
if (s < 64) {
u64 t = x_lo - q * n;
if constexpr(Adjust) {
u64 const mask = ~u64(i64(t - n) >> 63);
q += mask & 1;
t -= mask & n;
}
return std::make_pair(q, t);
} else {
u128 t = xf - u128(q) * n;
return t < n ? std::make_pair(q, u64(t)) : std::make_pair(q + 1, u64(t) - n);
}
} else
#endif
{
auto const w1a = UMul128(x_lo, r).first;
auto const [w2b, w1b] = UMul128(x_hi, r);
auto const w2c = x_hi, w1c = x_lo;
u64 w1, w2 = _addcarry_u64(0, w1a, w1b, (UllPtr)&w1);
w2 += _addcarry_u64(0, w1, w1c, (UllPtr)&w1);
w2 += w2b + w2c;
if (s < 64) {
u64 q = (w2 << (64 - s)) | (w1 >> s);
u64 t = x_lo - q * n;
if constexpr(Adjust) {
u64 const mask = ~u64(i64(t - n) >> 63);
q += mask & 1;
t -= mask & n;
}
return std::make_pair(q, t);
} else {
u64 const q = w2;
auto const [b_hi, b_lo] = UMul128(q, n);
auto const [t_hi, t_lo] = USub128(x_hi, x_lo, b_hi, b_lo);
return t_hi != 0 || t_lo >= n ? std::make_pair(q + 1, t_lo - n) : std::make_pair(q, t_lo);
}
}
}
#include <random>
#include <iomanip>
#include <iostream>
#include <chrono>
void TestBarrett() {
std::mt19937_64 rng{123}; //{std::random_device{}()};
for (size_t i = 0; i < (1 << 11); ++i) {
size_t const nbits = rng() % 63 + 2;
u64 n = 0;
do {
n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
} while (!(n >= 3 && (n & (n - 1)) != 0));
auto const [br, bs] = BarrettRS64(n);
for (size_t j = 0; j < (1 << 6); ++j) {
u64 const hi = rng() % n, lo = rng();
auto const [ref_q, ref_r] = UDiv128(hi, lo, n);
u64 bar_q = 0, bar_r = 0;
for (size_t k = 0; k < 2; ++k) {
bar_q = 0; bar_r = 0;
if (k == 0)
std::tie(bar_q, bar_r) = BarrettDivMod64<true>(hi, lo, n, br, bs);
else
std::tie(bar_q, bar_r) = BarrettDivMod64<false>(hi, lo, n, br, bs);
ASSERT_MSG(bar_q == ref_q && bar_r == ref_r, "i " + std::to_string(i) + ", j " + std::to_string(j) + ", k " + std::to_string(k) +
", nbits " + std::to_string(nbits) + ", n " + std::to_string(n) + ", bar_q " + std::to_string(bar_q) +
", ref_q " + std::to_string(ref_q) + ", bar_r " + std::to_string(bar_r) + ", ref_r " + std::to_string(ref_r));
}
}
}
}
class bi_uint
{
public:
using u64_t = std::uint64_t;
constexpr static std::size_t u64_bits = 8 * sizeof(u64_t);
//little-endian
std::vector<u64_t> data;
static auto constexpr DefPrep = [](auto n){
return std::make_pair(false, false);
};
static auto constexpr DefDivMod = [](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
return UDiv128(dnd_hi, dnd_lo, dsr);
};
//User should guarantee data.size()>0 and val>0
template <typename PrepT = decltype(DefPrep), typename DivModT = decltype(DefDivMod)>
void self_div(const u64_t val, PrepT const & Prep = DefPrep, DivModT const & DivMod = DefDivMod)
{
auto it = data.rbegin();
if(data.size() == 1) {
*it /= val;
return;
}
u64_t rem_hi = 0, rem_lo = 0;
if(*it < val) {
rem_lo = *it++;
//data.pop_back();
}
auto const prep = Prep(val);
u64_t r = rem_lo % val;
u64_t q = 0;
while(it != data.rend()) {
rem_hi = r;
rem_lo = *it;
std::tie(q, r) = DivMod(rem_hi, rem_lo, val, prep);
//*it++ = static_cast<u64_t>(q);
it++;
auto volatile out = static_cast<u64_t>(q);
}
}
};
void TestSpeed() {
auto Time = []{
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
};
std::mt19937_64 rng{123};
std::vector<u64> limbs, divisors;
for (size_t i = 0; i < (1 << 17); ++i)
limbs.push_back(rng());
for (size_t i = 0; i < (1 << 8); ++i) {
size_t const nbits = rng() % 63 + 2;
u64 n = 0;
do {
n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
} while (!(n >= 3 && (n & (n - 1)) != 0));
divisors.push_back(n);
}
std::cout << std::fixed << std::setprecision(3);
double div_time = 0;
{
bi_uint x;
x.data = limbs;
auto const tim = Time();
for (auto dsr: divisors)
x.self_div(dsr);
div_time = Time() - tim;
std::cout << "Divide time " << div_time << " sec" << std::endl;
}
{
bi_uint x;
x.data = limbs;
for (size_t i = 0; i < 2; ++i) {
if (IS_MSVC && i == 0)
continue;
auto const tim = Time();
for (auto dsr: divisors)
x.self_div(dsr, [](auto n){ return BarrettRS64(n); },
[i](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
return i == 0 ? BarrettDivMod64<true>(dnd_hi, dnd_lo, dsr, prep.first, prep.second) :
BarrettDivMod64<false>(dnd_hi, dnd_lo, dsr, prep.first, prep.second);
});
double const bar_time = Time() - tim;
std::cout << "Barrett" << (i == 0 ? "128" : "64 ") << " time " << bar_time << " sec, boost " << div_time / bar_time << std::endl;
}
}
}
int main() {
try {
TestBarrett();
TestSpeed();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
Divide time 3.171 sec
Barrett128 time 0.675 sec, boost 4.695
Barrett64 time 0.642 sec, boost 4.937
Part 2
As you have a very interesting question, after few days when I first published this post, I decided to implement from scratch all big integer math.
Below code implements math operations +, -, *, /, <<, >> for natural big numbers (positive integers), and +, -, *, / for floating big numbers. Both types of numbers are of arbitrary size (even millions of bits). Besides those as you requested, I fully implemented Newton-Raphson (both square and cubic variants) and Goldschmidt fast division algorithms.
Here is code snippet only for Newton-Raphson/Golschmidt functions, remaining code as it is very large is linked below on external server:
BigFloat & DivNewtonRaphsonSquare(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
thread_local BigFloat two, c_48_17, c_32_17;
thread_local size_t static_prec = 0;
if (static_prec != BigFloat::prec_) {
two = 2;
c_48_17 = BigFloat(48) / BigFloat(17);
c_32_17 = BigFloat(32) / BigFloat(17);
static_prec = BigFloat::prec_;
}
BigFloat x = c_48_17 - c_32_17 * b;
for (size_t i = 0, num_iters = std::ceil(std::log2(double(static_prec + 1)
/ std::log2(17.0))) + 0.1; i < num_iters; ++i)
x = x * (two - b * x);
*this = a * x;
return BitNorm();
}
BigFloat & DivNewtonRaphsonCubic(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
thread_local BigFloat one, c_140_33, c_m64_11, c_256_99;
thread_local size_t static_prec = 0;
if (static_prec != BigFloat::prec_) {
one = 1;
c_140_33 = BigFloat(140) / BigFloat(33);
c_m64_11 = BigFloat(-64) / BigFloat(11);
c_256_99 = BigFloat(256) / BigFloat(99);
static_prec = BigFloat::prec_;
}
BigFloat e, y, x = c_140_33 + b * (c_m64_11 + b * c_256_99);
for (size_t i = 0, num_iters = std::ceil(std::log2(double(static_prec + 1)
/ std::log2(99.0)) / std::log2(3.0)) + 0.1; i < num_iters; ++i) {
e = one - b * x;
y = x * e;
x = x + y + y * e;
}
*this = a * x;
return BitNorm();
}
BigFloat & DivGoldschmidt(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Goldschmidt_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
BigFloat one = 1, two = 2, f;
for (size_t i = 0;; ++i) {
f = two - b;
a *= f;
b *= f;
if (i % 3 == 0 && (one - b).GetScale() < -i64(prec_) + i64(bit_sizeof(Word)))
break;
}
*this = a;
return BitNorm();
}
See Output: below, it will show that Newton-Raphson and Goldschmidt methods are actually 10x times slower than regular School-grade (called Reference in output) algorithm. Between each other these 3 advanced algorithms are about same speed. Probably Raphson/Goldschmidt could be faster if to use Fast Fourier Transform for multiplication, because multiplication of two large numbers takes 95% of time of these algorithms. In code below all results of Raphson/Goldschmidt algorithms are not only time-measured but also checked for correctness of results compared to School-grade (Reference) algorithm (see diff 2^... in console output, this shows how large is difference of result compared to school grade).
FULL SOURCE CODE HERE. Full code is so huge that it didn't fit into this StackOverflow due to SO limit of 30 000 characters per post, although I wrote this code from scracth specifically for this post. That's why providing external download link (PasteBin server), also click Try it online! linke below, it is same copy of code that is run live on GodBolt's linux servers:
Try it online!
Output:
========== 1 K bits ==========
Reference 0.000029 sec
Raphson2 0.000066 sec, boost 0.440x, diff 2^-8192
Raphson3 0.000092 sec, boost 0.317x, diff 2^-8192
Goldschmidt 0.000080 sec, boost 0.365x, diff 2^-1022
========== 2 K bits ==========
Reference 0.000071 sec
Raphson2 0.000177 sec, boost 0.400x, diff 2^-16384
Raphson3 0.000283 sec, boost 0.250x, diff 2^-16384
Goldschmidt 0.000388 sec, boost 0.182x, diff 2^-2046
========== 4 K bits ==========
Reference 0.000319 sec
Raphson2 0.000875 sec, boost 0.365x, diff 2^-4094
Raphson3 0.001122 sec, boost 0.285x, diff 2^-32768
Goldschmidt 0.000881 sec, boost 0.362x, diff 2^-32768
========== 8 K bits ==========
Reference 0.000484 sec
Raphson2 0.002281 sec, boost 0.212x, diff 2^-65536
Raphson3 0.002341 sec, boost 0.207x, diff 2^-65536
Goldschmidt 0.002432 sec, boost 0.199x, diff 2^-8189
========== 16 K bits ==========
Reference 0.001199 sec
Raphson2 0.009042 sec, boost 0.133x, diff 2^-16382
Raphson3 0.009519 sec, boost 0.126x, diff 2^-131072
Goldschmidt 0.009047 sec, boost 0.133x, diff 2^-16380
========== 32 K bits ==========
Reference 0.004311 sec
Raphson2 0.039151 sec, boost 0.110x, diff 2^-32766
Raphson3 0.041058 sec, boost 0.105x, diff 2^-262144
Goldschmidt 0.045517 sec, boost 0.095x, diff 2^-32764
========== 64 K bits ==========
Reference 0.016273 sec
Raphson2 0.165656 sec, boost 0.098x, diff 2^-524288
Raphson3 0.210301 sec, boost 0.077x, diff 2^-65535
Goldschmidt 0.208081 sec, boost 0.078x, diff 2^-65534
========== 128 K bits ==========
Reference 0.059469 sec
Raphson2 0.725865 sec, boost 0.082x, diff 2^-1048576
Raphson3 0.735530 sec, boost 0.081x, diff 2^-1048576
Goldschmidt 0.703991 sec, boost 0.084x, diff 2^-131069
========== 256 K bits ==========
Reference 0.326368 sec
Raphson2 3.007454 sec, boost 0.109x, diff 2^-2097152
Raphson3 2.977631 sec, boost 0.110x, diff 2^-2097152
Goldschmidt 3.363632 sec, boost 0.097x, diff 2^-262141
========== 512 K bits ==========
Reference 1.138663 sec
Raphson2 12.827783 sec, boost 0.089x, diff 2^-524287
Raphson3 13.799401 sec, boost 0.083x, diff 2^-524287
Goldschmidt 15.836072 sec, boost 0.072x, diff 2^-524286
On most of the modern CPUs, division is indeed much slower than multiplication.
Referring to
https://agner.org/optimize/instruction_tables.pdf
That on Intel Skylake an MUL/IMUL has a latency of 3-4 cycles; while an DIV/IDIV could take 26-90 cycles; which is 7 - 23 times slower than MUL; so your initial benchmark result isn't really a surprise.
If you happen to be on x86 CPU, as showing in the answer below, if this is indeed the bottleneck you could try to utilize AVX/SSE instructions. Basically you'd need to rely on special instructions than a general one like DIV/IDIV.
How to divide a __m256i vector by an integer variable?
Hardware: Intel skylake
This is based on: Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size
I tried creating some benchmarks to test this to see how I could best use it but the results have been weird:
Here is the benchmark code:
#include <assert.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
#include <x86intrin.h>
#define NLOOPS 5
#ifndef PARTITION_PER_CORE
#define PARTITION_PER_CORE 1
#endif
#ifndef NTHREADS
#define NTHREADS 8
#endif
#ifndef PARTITION
#define PARTITION 64
#endif
#ifndef TEST_SIZE
#define TEST_SIZE (1000000u)
#endif
#define XOR_ATOMIC(X, Y) __atomic_fetch_xor(X, Y, __ATOMIC_RELAXED)
#define XOR_CAS_LOOP(X, Y) \
do { \
volatile uint64_t _sink = __atomic_fetch_xor(X, Y, __ATOMIC_RELAXED); \
} while (0)
#define ADD_ATOMIC(X, Y) __atomic_fetch_add(X, Y, __ATOMIC_RELAXED)
#define XADD_ATOMIC(X, Y) \
do { \
volatile uint64_t _sink = __atomic_fetch_add(X, Y, __ATOMIC_RELAXED); \
} while (0)
#define XOR(X, Y) *(X) ^= Y
#define ADD(X, Y) *(X) += (Y)
#ifndef OP1
#define OP1(X, Y) ADD(X, Y)
#endif
#ifndef OP2
#define OP2(X, Y) ADD(X, Y)
#endif
#define bench_flush_all_pending() asm volatile("" : : : "memory");
#define bench_do_not_optimize_out(X) asm volatile("" : : "r,m"(X) : "memory")
uint64_t * r;
pthread_barrier_t b;
uint64_t total_cycles;
void
init() {
#if PARTITION_PER_CORE
assert(sysconf(_SC_NPROCESSORS_ONLN) * PARTITION <= sysconf(_SC_PAGE_SIZE));
#else
assert(NTHREADS * PARTITION <= sysconf(_SC_PAGE_SIZE));
#endif
r = (uint64_t *)mmap(NULL,
sysconf(_SC_PAGE_SIZE),
(PROT_READ | PROT_WRITE),
(MAP_ANONYMOUS | MAP_PRIVATE),
(-1),
0);
assert(r != NULL);
assert(!pthread_barrier_init(&b, NULL, NTHREADS));
total_cycles = 0;
}
void *
run1(void * targ) {
uint64_t * t_region = (uint64_t *)targ;
const uint64_t y = rand();
// page in memory / get in cache
*t_region = 0;
pthread_barrier_wait(&b);
uint64_t start_cycles = __rdtsc();
for (uint32_t i = 0; i < TEST_SIZE; ++i) {
OP1(t_region, y);
bench_do_not_optimize_out(t_region);
}
bench_flush_all_pending();
uint64_t end_cycles = __rdtsc();
__atomic_fetch_add(&total_cycles,
(end_cycles - start_cycles),
__ATOMIC_RELAXED);
return NULL;
}
void *
run2(void * targ) {
uint64_t * t_region = (uint64_t *)targ;
const uint64_t y = rand();
// page in memory / get in cache
*t_region = 0;
pthread_barrier_wait(&b);
uint64_t start_cycles = __rdtsc();
for (uint32_t i = 0; i < TEST_SIZE; ++i) {
OP2(t_region, y);
bench_do_not_optimize_out(t_region);
}
bench_flush_all_pending();
uint64_t end_cycles = __rdtsc();
__atomic_fetch_add(&total_cycles,
(end_cycles - start_cycles),
__ATOMIC_RELAXED);
return NULL;
}
void
test() {
init();
pthread_t * tids = (pthread_t *)calloc(NTHREADS, sizeof(pthread_t));
assert(tids != NULL);
cpu_set_t cset;
CPU_ZERO(&cset);
const uint32_t stack_size = (1 << 18);
uint32_t ncores = sysconf(_SC_NPROCESSORS_ONLN);
for (uint32_t i = 0; i < NTHREADS; ++i) {
CPU_SET(i % ncores, &cset);
pthread_attr_t attr;
assert(!pthread_attr_init(&attr));
assert(!pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cset));
assert(!pthread_attr_setstacksize(&attr, stack_size));
#if PARTITION_PER_CORE
uint64_t * t_region = r + (i % ncores) * (PARTITION / sizeof(uint64_t));
#else
uint64_t * t_region = r + (i) * (PARTITION / sizeof(uint64_t));
#endif
if (i % 2) {
assert(!pthread_create(tids + i, &attr, run1, (void *)t_region));
}
else {
assert(!pthread_create(tids + i, &attr, run2, (void *)t_region));
}
CPU_ZERO(&cset);
assert(!pthread_attr_destroy(&attr));
}
for (uint32_t i = 0; i < NTHREADS; ++i) {
pthread_join(tids[i], NULL);
}
free(tids);
}
int
main(int argc, char ** argv) {
double results[NLOOPS];
for(uint32_t i = 0; i < NLOOPS; ++i) {
test();
double cycles_per_op = total_cycles;
cycles_per_op /= NTHREADS * TEST_SIZE;
results[i] = cycles_per_op;
}
char buf[64] = "";
strcpy(buf, argv[0]);
uint32_t len = strlen(buf);
uint32_t start_op1 = 0, start_op2 = 0;
for (uint32_t i = 0; i < len; ++i) {
if (start_op1 == 0 && buf[i] == '-') {
start_op1 = i + 1;
}
else if (buf[i] == '-') {
start_op2 = i + 1;
buf[i] = 0;
}
}
fprintf(stderr,
"Results: %s\n\t"
"nthreads : %d\n\t"
"partition size : %d\n\t"
"partion_per_core : %s\n\t"
"Op1 : %s\n\t"
"Op2 : %s\n\t"
"Cycles Per Op : %.3lf",
argv[0],
NTHREADS,
PARTITION,
PARTITION_PER_CORE ? "true" : "false",
buf + start_op1,
buf + start_op2,
results[0]);
for(uint32_t i = 1; i < NLOOPS; ++i) {
fprintf(stderr, ", %.3lf", results[i]);
}
fprintf(stderr, "\n\n");
assert(!munmap(r, sysconf(_SC_PAGE_SIZE)));
assert(!pthread_barrier_destroy(&b));
}
and then I've been using:
#! /bin/bash
CFLAGS="-O3 -march=native -mtune=native"
LDFLAGS="-lpthread"
# NOTE: nthreads * partion must be <= PAGE_SIZE (test intentionally
# fails assertion if memory spans multiple pages
# nthreads : number of threads performing operations
# op1 : even tids will perform op1 (tid is order they where created
# starting from 0)
# partition_per_core: boolean, if true multiple threads pinned to the
# same core will share the same destination, if
# false each thread will have a unique memory
# destination (this can be ignored in nthreads <= ncores)
# op2 : odd tids will perform op2 (tid is order they where created
# starting from 0)
# partition : space between destinations (i.e partition
# destinations by cache line size or 2 * cache
# line size)
for nthreads in 8; do
for partition_per_core in 1; do
for op1 in ADD XOR ADD_ATOMIC XADD_ATOMIC XOR_ATOMIC XOR_CAS_LOOP; do
for op2 in ADD XOR ADD_ATOMIC XADD_ATOMIC XOR_ATOMIC XOR_CAS_LOOP; do
for partition in 64 128; do
g++ ${CFLAGS} test_atomics.cc -o test-${op1}-${op2} -DPARTITION=${partition} -DNTHREADS=${nthreads} -DPARTITION_PER_CORE=${partition_per_core} -DOP1=${op1} -DOP2=${op2} ${LDFLAGS};
./test-${op1}-${op2};
rm -f test-${op1}-${op2};
done
echo "--------------------------------------------------------------------------------"
done
echo "--------------------------------------------------------------------------------"
echo "--------------------------------------------------------------------------------"
done
echo "--------------------------------------------------------------------------------"
echo "--------------------------------------------------------------------------------"
echo "--------------------------------------------------------------------------------"
done
echo "--------------------------------------------------------------------------------"
echo "--------------------------------------------------------------------------------"
echo "--------------------------------------------------------------------------------"
echo "--------------------------------------------------------------------------------"
done
To run it.
Some results make sense:
i.e with lock xaddq I get:
Results: ./test-XADD_ATOMIC-XADD_ATOMIC
nthreads : 8
partition size : 64
partion_per_core : true
Op1 : XADD_ATOMIC
Op2 : XADD_ATOMIC
Cycles Per Op : 21.547, 20.089, 38.852, 26.723, 25.934
Results: ./test-XADD_ATOMIC-XADD_ATOMIC
nthreads : 8
partition size : 128
partion_per_core : true
Op1 : XADD_ATOMIC
Op2 : XADD_ATOMIC
Cycles Per Op : 19.607, 19.187, 19.483, 18.857, 18.721
which basically shows an improvement across the board with 128 byte partition. In general I have been able to see that 128 byte partition is beneficial with atomic operations EXCEPT for CAS loops i.e:
Results: ./test-XOR_CAS_LOOP-XOR_CAS_LOOP
nthreads : 8
partition size : 64
partion_per_core : true
Op1 : XOR_CAS_LOOP
Op2 : XOR_CAS_LOOP
Cycles Per Op : 20.273, 20.061, 20.737, 21.240, 21.747
Results: ./test-XOR_CAS_LOOP-XOR_CAS_LOOP
nthreads : 8
partition size : 128
partion_per_core : true
Op1 : XOR_CAS_LOOP
Op2 : XOR_CAS_LOOP
Cycles Per Op : 20.632, 20.432, 21.710, 22.627, 23.070
gives basically the opposite of the expected result. As well I have been unable to see any difference between 64 byte and 128 byte partition size when mixing atomic with non atomic operations i.e:
Results: ./test-XADD_ATOMIC-ADD
nthreads : 8
partition size : 64
partion_per_core : true
Op1 : XADD_ATOMIC
Op2 : ADD
Cycles Per Op : 11.117, 11.186, 11.223, 11.169, 11.138
Results: ./test-XADD_ATOMIC-ADD
nthreads : 8
partition size : 128
partion_per_core : true
Op1 : XADD_ATOMIC
Op2 : ADD
Cycles Per Op : 11.126, 11.157, 11.072, 11.227, 11.080
What I am curious about is:
Why is CAS loop not like the rest of the atomics in terms of best partition size?
If my understanding is correct in that sharing in the L2 cache is what makes 128 byte partition more effective, why does this only seem to affect atomic operations? Non atomics should still be constantly invalidating cache lines.
Is there a general rule for understanding when 128 byte partition is needed versus when you can save memory with 64 partition?
Note: I am not asking nor do I expect anyone to read my code. I only included it as context for the question / where I got my numbers. If you have questions / concerns about the benchmarking method comment and I'll reply.
I am trying some benchmarks using Intel AVX2 and Posix threads.
Let's suppose that I am trying to find the minimum value in a sample.
When I create a simple program I run the avx_min function.
When I create a program which inside creates a Posix thread, I have changed the implementation of avx_min to avx_min_thread like it is shown below, but actual implementation remains the same. This function can be used for more than one threads and it does not need a synchronization, as threads do not "conflict" (tid = 0,1, 2, etc.).
When I compile both implementations without specifying any optimization flag, they give me both the same time result. One the other size, when I compile them using the -O3 flag they result in different execution times and I can not figure out why this happens.
P.S: I compile them using:
case 1 (without creating a thread): g++ -mavx2 -O3 -o avxMinO3 avxMinO3.cpp
case 2 (creating a posix thread inside): g++ -mavx2 avxMinO3_t.cpp -lpthread -O3 -o avxMinO3_t
P.S 2:
1st case execution time: 0.34 sec
2nd case execution time: 0.049 sec
Case 1:
double initialize_input(int32_t** relation, int32_t value_bound, int32_t input_size){
clock_t t;
srand(time(NULL));
t = clock();
for(int32_t i = 0 ; i < input_size ; i++){
(*relation)[i] = rand() % value_bound;
}
t = clock() - t;
return ((double) t) / CLOCKS_PER_SEC;
}
int* avx_min(int32_t** relation, int32_t rel_size, double* function_time){
clock_t tic, tac;
__m256i input_buffer;
int32_t* rel = (*relation);
__m256i min = _mm256_set_epi32(INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX);
tic = clock();
for(int i = 0 ; i < rel_size ; i += 8){
input_buffer = _mm256_stream_load_si256((__m256i*)&rel[i]);
min = _mm256_min_epi32(min, input_buffer);
}
tac = clock();
double time_diff = (double)(tac - tic);
(*function_time) = time_diff / CLOCKS_PER_SEC;
int* temp = (int*)&min;
return temp;
}
int main(int argc, char** argv) {
int32_t* relation;
double* function_time;
int32_t input_size = 1024 * 1024 * 1024;
int32_t value_bound = 1000;
int alloc_time = initialize_input(&relation, value_bound, input_size);
int* res = avx_min(&relation, input_size, function_time);
return 0;
}
Case 2:
template<typename T>
struct thread_input {
T* relation;
T rel_size;
double function_time;
short numberOfThreads;
short tid;
};
template<typename T, typename S, typename I, typename RELTYPE>
T** createAndInitInputPtr(S numberOfThreads, I rel_size, S value_bound, RELTYPE** relation ){
T **result = new T*[numberOfThreads];
for (int i = 0; i < numberOfThreads; i++) {
result[i] = new T;
result[i]->rel_size = rel_size;
result[i]->relation = (*relation);
result[i]->numberOfThreads = numberOfThreads;
result[i]->tid = i;
}
return result;
}
void* avx_min_t(void* input){
clock_t tic, tac;
struct thread_input<int32_t> *input_ptr;
input_ptr = (struct thread_input<int32_t>*) input;
int32_t* relation = input_ptr->relation;
int32_t rel_size = input_ptr->rel_size;
int32_t start = input_ptr->tid * 8;
int32_t offset = input_ptr->numberOfThreads * 8;
__m256i input_buffer;
__m256i min = _mm256_set_epi32(INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX);
tic = clock();
for(int i = start ; i < rel_size ; i += offset){
input_buffer = _mm256_stream_load_si256((__m256i*)&relation[i]);
min = _mm256_min_epi32(min, input_buffer);
}
tac = clock();
double time_diff = (double)(tac-tic);
time_diff = time_diff / CLOCKS_PER_SEC;
input_ptr->function_time = time_diff;
}
int main(int argc, char* argv[]){
int rel_size = 1024 * 1024 * 1024;
short numberOfThreads = 1;
short value_bound = 1000;
pthread_t* threads = new pthread_t[numberOfThreads];
short flag = 1; // flag to check proper aligned memory allocations
int32_t* relation;
double alloc_time = 0.0;
flag = posix_memalign((void**)&relation, 32, rel_size * sizeof(int32_t));
if(flag) {
std::cout << "Memory allocation problem. Exiting..." << std::endl;
exit(1);
}
alloc_time += initialize_input(&relation, value_bound, rel_size);
struct thread_input<int32_t> **input_ptr = createAndInitInputPtr<struct thread_input<int32_t>, short, int, int32_t>(numberOfThreads, rel_size, value_bound, &relation);
clock_t tic = clock();
for (int i = 0; i < numberOfThreads; i++) {
pthread_create(&threads[i], NULL, avx_min_t,(void*) input_ptr[i]);
}
for (int i = 0; i < numberOfThreads; i++) {
pthread_join(threads[i], NULL);
}
tic = clock()-tic;
double time = tic / CLOCKS_PER_SEC;
std::cout << time << std::endl;
return 0;
}
void* avx_min_t(void* input) doesn't do anything with min so the SIMD work loading from the array optimizes away.
Its inner loop compiles to this with gcc -O3 -march=haswell, and clang is basically the same.
.L3:
add ebx, r12d
cmp r13d, ebx
jg .L3
So it's literally just an empty loop in asm, taking 0.04 seconds to increment a pointer by 4GB / 32 bytes times.
for(int i = start ; i < rel_size ; i += offset){
}
I think you meant to return something, because the function is declared void* and has undefined behaviour from falling off the end of a non-void function. GCC and clang warn about this by default without even needing -Wall. https://godbolt.org/z/Z1GWpU
<source>: In function 'void* avx_min_t(void*)':
<source>:66:1: warning: no return statement in function returning non-void [-Wreturn-type]
66 | }
| ^
Always check your compiler warnings, especially when your code behaves strangely. Enable -Wall and fix any warnings, too.
I'm storing the IP address in sockaddr_in6 which supports an array of four 32-bit, addr[4]. Essentially a 128 bit number.
I'm trying to calculate number of IPs in a given IPv6 range (how many IPs between). So it's a matter of subtracting one from another using two arrays with a length of four.
The problem is since there's no 128bit data type, I can't convert into decimal.
Thanks a ton!
You could use some kind of big-int library (if you can tolerate LGPL, GMP is the choice). Fortunately, 128 bit subtraction is easy to simulate by hand if necessary. Here is a quick and dirty demonstration of computing the absolute value of (a-b), for 128 bit values:
#include <iostream>
#include <iomanip>
struct U128
{
unsigned long long hi;
unsigned long long lo;
};
bool subtract(U128& a, U128 b)
{
unsigned long long carry = b.lo > a.lo;
a.lo -= b.lo;
unsigned long long carry2 = b.hi > a.hi || a.hi == b.hi && carry;
a.hi -= carry;
a.hi -= b.hi;
return carry2 != 0;
}
int main()
{
U128 ipAddressA = { 45345, 345345 };
U128 ipAddressB = { 45345, 345346 };
bool carry = subtract(ipAddressA, ipAddressB);
// Carry being set means that we underflowed; that ipAddressB was > ipAddressA.
// Lets just compute 0 - ipAddressA as a means to calculate the negation
// (0-x) of our current value. This gives us the absolute value of the
// difference.
if (carry)
{
ipAddressB = ipAddressA;
ipAddressA = { 0, 0 };
subtract(ipAddressA, ipAddressB);
}
// Print gigantic hex string of the 128-bit value
std::cout.fill ('0');
std::cout << std::hex << std::setw(16) << ipAddressA.hi << std::setw(16) << ipAddressA.lo << std::endl;
}
This gives you the absolute value of the difference. If the range is not huge (64 bits or less), then ipAddressA.lo can be your answer as a simple unsigned long long.
If you have perf concerns, you can make use of compiler intrinsics for taking advantage of certain architectures, such as amd64 if you want it to be optimal on that processor. _subborrow_u64 is the amd64 intrinsic for the necessary subtraction work.
The in6_addr structure stores the address in network byte order - or 'big endian' - with the most significant byte # s6_addr[0]. You can't count on the other union members being consistently named, or defined. Even If you accessed the union through a (non-portable) uint32_t field, the values would have to be converted with ntohl. So a portable method of finding the difference needs some work.
You can convert the in6_addr to uint64_t[2]. Sticking with typical 'bignum' conventions, we use [0] for the low 64-bits and [1] for the high 64-bits:
static inline void
in6_to_u64 (uint64_t dst[2], const struct in6_addr *src)
{
uint64_t hi = 0, lo = 0;
for (unsigned int i = 0; i < 8; i++)
{
hi = (hi << 8) | src->s6_addr[i];
lo = (lo << 8) | src->s6_addr[i + 8];
}
dst[0] = lo, dst[1] = hi;
}
and the difference:
static inline unsigned int
u64_diff (uint64_t d[2], const uint64_t x[2], const uint64_t y[2])
{
unsigned int b = 0, bi;
for (unsigned int i = 0; i < 2; i++)
{
uint64_t di, xi, yi, tmp;
xi = x[i], yi = y[i];
tmp = xi - yi;
di = tmp - b, bi = tmp > xi;
d[i] = di, b = bi | (di > tmp);
}
return b; /* borrow flag = (x < y) */
}