Modern C++: initialize constexpr tables

Modern C++: initialize constexpr tables - c++

Suppose I have a class X, which functionality requires a lot of constant table values, say an array A[1024]. I have a recurrent function f that computes its values, smth like
A[x] = f(A[x - 1]);
Suppose that A[0] is a known constant, therefore the rest of the array is constant too. What is the best way to calculate these values beforehand, using features of modern C++, and without storaging file with hardcoded values of this array? My workaround was a const static dummy variable:
const bool X::dummy = X::SetupTables();
bool X::SetupTables() {
A[0] = 1;
for (size_t i = 1; i <= A.size(); ++i)
A[i] = f(A[i - 1]);
}
But I believe, it’s not the most beautiful way to go.
Note: I emphasize that array is rather big and I want to avoid monstrosity of the code.

Since C++14, for loops are allowed in constexpr functions. Moreover, since C++17, std::array::operator[] is constexpr too.
So you can write something like this:
template<class T, size_t N, class F>
constexpr auto make_table(F func, T first)
{
std::array<T, N> a {first};
for (size_t i = 1; i < N; ++i)
{
a[i] = func(a[i - 1]);
}
return a;
}
Example: https://godbolt.org/g/irrfr2

I think this way is more readable:
#include <array>
constexpr int f(int a) { return a + 1; }
constexpr void init(auto &A)
{
A[0] = 1;
for (int i = 1; i < A.size(); i++) {
A[i] = f(A[i - 1]);
}
}
int main() {
std::array<int, 1024> A;
A[0] = 1;
init(A);
}
I need to make a disclaimer, that for big array sizes it is not guaranteed to generate array in constant time. And the accepted answer is more likely to generate the full array during template expansion.
But the way I propose has number of advantages:
It is quite safe that the compiler will not eat up all your memory and fails to expand the template.
The compilation speed is significantly faster
You use C++-ish interface when you use an array
The code is in general more readable
In a particular example when you need only one value, the variant with templates generated for me only a single number, while the variant with std::array generated a loop.
Update
Thanks to Navin I found a way to force compile time evaluation of the array.
You can force it to run at compile time if you return by value: std::array A = init();
So with slight modification the code looks as follows:
#include <array>
constexpr int f(int a) { return a + 1; }
constexpr auto init()
{
// Need to initialize the array
std::array<int, SIZE> A = {0};
A[0] = 1;
for (unsigned i = 1; i < A.size(); i++) {
A[i] = f(A[i - 1]);
}
return A;
}
int main() {
auto A = init();
return A[SIZE - 1];
}
To have this compiled one needs C++17 support, otherwise operator [] from std::array is not constexpr. I also update the measurements.
On assembly output
As I mentioned earlier the template variant is more concise. Please look here for more detail.
In the template variant, when I just pick the last value of the array, the whole assembly looks as follows:
main:
mov eax, 1024
ret
While for std::array variant I have a loop:
main:
subq $3984, %rsp
movl $1, %eax
.L2:
leal 1(%rax), %edx
movl %edx, -120(%rsp,%rax,4)
addq $1, %rax
cmpq $1024, %rax
jne .L2
movl 3972(%rsp), %eax
addq $3984, %rsp
ret
With std::array and return by value the assemble is identical to version with templates:
main:
mov eax, 1024
ret
On compilation speed
I compared these two variants:
test2.cpp:
#include <utility>
constexpr int f(int a) { return a + 1; }
template<int... Idxs>
constexpr void init(int* A, std::integer_sequence<int, Idxs...>) {
auto discard = {A[Idxs] = f(A[Idxs - 1])...};
static_cast<void>(discard);
}
int main() {
int A[SIZE];
A[0] = 1;
init(A + 1, std::make_integer_sequence<int, sizeof A / sizeof *A - 1>{});
}
test.cpp:
#include <array>
constexpr int f(int a) { return a + 1; }
constexpr void init(auto &A)
{
A[0] = 1;
for (int i = 1; i < A.size(); i++) {
A[i] = f(A[i - 1]);
}
}
int main() {
std::array<int, SIZE> A;
A[0] = 1;
init(A);
}
The results are:
| Size | Templates (s) | std::array (s) | by value |
|-------+---------------+----------------+----------|
| 1024 | 0.32 | 0.23 | 0.38s |
| 2048 | 0.52 | 0.23 | 0.37s |
| 4096 | 0.94 | 0.23 | 0.38s |
| 8192 | 1.87 | 0.22 | 0.46s |
| 16384 | 3.93 | 0.22 | 0.76s |
How I generated:
for SIZE in 1024 2048 4096 8192 16384
do
echo $SIZE
time g++ -DSIZE=$SIZE test2.cpp
time g++ -DSIZE=$SIZE test.cpp
time g++ -std=c++17 -DSIZE=$SIZE test3.cpp
done
And if you enable optimizations, the speed of code with template is even worse:
| Size | Templates (s) | std::array (s) | by value |
|-------+---------------+----------------+----------|
| 1024 | 0.92 | 0.26 | 0.29s |
| 2048 | 2.81 | 0.25 | 0.33s |
| 4096 | 10.94 | 0.23 | 0.36s |
| 8192 | 52.34 | 0.24 | 0.39s |
| 16384 | 211.29 | 0.24 | 0.56s |
How I generated:
for SIZE in 1024 2048 4096 8192 16384
do
echo $SIZE
time g++ -O3 -march=native -DSIZE=$SIZE test2.cpp
time g++ -O3 -march=native -DSIZE=$SIZE test.cpp
time g++ -O3 -std=c++17 -march=native -DSIZE=$SIZE test3.cpp
done
My gcc version:
$ g++ --version
g++ (Debian 7.2.0-1) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

One example:
#include <utility>
constexpr int f(int a) { return a + 1; }
template<int... Idxs>
constexpr void init(int* A, std::integer_sequence<int, Idxs...>) {
auto discard = {A[Idxs] = f(A[Idxs - 1])...};
static_cast<void>(discard);
}
int main() {
int A[1024];
A[0] = 1;
init(A + 1, std::make_integer_sequence<int, sizeof A / sizeof *A - 1>{});
}
Requires -ftemplate-depth=1026 g++ command line switch.
Example how to make it a static member:
struct B
{
int A[1024];
B() {
A[0] = 1;
init(A + 1, std::make_integer_sequence<int, sizeof A / sizeof *A - 1>{});
};
};
struct C
{
static B const b;
};
B const C::b;

just for fun, a c++17 compact one-liner might be ( requires an std::array A, or some other memory-contiguous tuple-like ):
std::apply( [](auto, auto&... x){ ( ( x = f((&x)[-1]) ), ... ); }, A );
note that this can be used in a constexpr function too.
That said, from c++14 we can use loops in constexpr functions, so we can write a constexpr function returning an std::array directly, written (almost) the usual way.

Related

How to quickly divide a big uinteger into a word?

I am currently developing a class to work with big unsigned integers. However, I need incomplete functionality, namely:
bi_uint+=bi_uint - Already implemented. No complaints.
bi_uint*=std::uint_fast64_t - Already implemented. No complaints.
bi_uint/=std::uint_fast64_t - Implemented but works very slowly, also requires a type that is twice as wide as uint_fast64_t. In the test case, division was 35 times slower than multiplication
Next, I will give my implementation of division, which is based on a simple long division algorithm:
#include <climits>
#include <cstdint>
#include <limits>
#include <vector>
class bi_uint
{
public:
using u64_t = std::uint_fast64_t;
constexpr static std::size_t u64_bits = CHAR_BIT * sizeof(u64_t);
using u128_t = unsigned __int128;
static_assert(sizeof(u128_t) >= sizeof(u64_t) * 2);
//little-endian
std::vector<u64_t> data;
//User should guarantee data.size()>0 and val>0
void self_div(const u64_t val)
{
auto it = data.rbegin();
if(data.size() == 1) {
*it /= val;
return;
}
u128_t rem = 0;
if(*it < val) {
rem = *it++;
data.pop_back();
}
u128_t r = rem % val;
while(it != data.rend()) {
rem = (r << u64_bits) + *it;
const u128_t q = rem / val;
r = rem % val;
*it++ = static_cast<u64_t>(q);
}
}
};
You can see that the unsigned __int128 type was used, thefore, this option is not portable and is tied to a single compiler - GCC and also require x64 platform.
After reading the page about division algorithms, I feel the appropriate algorithm would be "Newton-Raphson division". However, the "Newton–Raphson division" algorithm seems complicated to me. I guess there is a simpler algorithm for dividing the type "big_uint/uint" that would have almost the same performance.
Q: How to fast divide a bi_uint into a u64_t?
I have about 10^6 iterations, each iteration uses all the operations listed
If this is easily achievable, then I would like to have portability and not use unsigned __int128. Otherwise, I prefer to abandon portability in favor of an easier way.
EDIT1:
This is an academic project, I am not able to use third-party libraries.

Part 1 (See Part 2 below)
I managed to speedup your division code 5x times on my old laptop (and even 7.5x times on GodBolt servers) using Barrett Reduction, this is a technique that allows to replace single division by several multiplications and additions. Implemented whole code from sctracth just today.
If you want you can jump directly to code location at the end of my post, without reading long description, as code is fully runnable without any knowledge or dependency.
Code below is only for Intel x64, because I used Intel only instructions and only 64-bit variants of them. Sure it can be re-written for x32 too and for other processors, because Barrett algorithm is generic.
To explain whole Barrett Reduction in short pseudo-code I'll write it in Python as it is simplest language suitable for understandable pseudo-code:
# https://www.nayuki.io/page/barrett-reduction-algorithm
def BarrettR(n, s):
return (1 << s) // n
def BarrettDivMod(x, n, r, s):
q = (x * r) >> s
t = x - q * n
return (q, t) if t < n else (q + 1, t - n)
Basically in pseudo code above BarrettR() is done only single time for same divisor (you use same single-word divisor for whole big integer division). BarrettDivMod() is used each time when you want to make division or modulus operations, basically given input x and divisor n it returns tuple (x / n, x % n), nothing else, but does it faster than regular division instruction.
In below C++ code I implement same two functions of Barrett, but do some C++ specific optimizations to make it even more faster. Optimizations are possible due to fact that divisor n is always 64-bit, x is 128-bit but higher half is always smaller than n (last assumption happens because higher half in your big integer division is always a remainder modulus n).
Barrett algorithm works with divisor n that is NOT a power of 2, so divisors like 0, 1, 2, 4, 8, 16, ... are not allowed. This trivial case of divisor you can cover just by doing right bit-shift of big integer, because dividing by power of 2 is just a bit-shift. Any other divisor is allowed, including even divisors that are not power of 2.
Also it is important to note that my BarrettDivMod() accepts ONLY dividend x that is strictly smaller than divisor * 2^64, in other words higher half of 128-bit dividend x should be smaller than divisor. This is always true in your case of your big integer divison function, as higher half is always a remainder modulus divisor. This rule for x should be checked by you, it is checked in my BarrettDivMod() only as DEBUG assertion that is removed in release.
You can notice that BarrettDivMod() has two big branches, these are two variants of same algorithm, first uses CLang/GCC only type unsigned __int128, second uses only 64-bit instructions and hence suitable for MSVC.
I tried to target three compilers CLang/GCC/MSVC, but some how MSVC version got only 2x faster with Barrett, while CLang/GCC are 5x faster. Maybe I did some bug in MSVC code.
You can see that I used your class bi_uint for time measurement of two versions of code - with regular divide and with Barrett. Important to note that I changed your code quite significantly, first to not use u128 (so that MSVC version compiles that has no u128), second not to modify data vector, so it does read only division and doesn't store final result (this read-only is needed for me to run speed tests very fast without copying data vector on each test iteration). So your code is broken in my snippet, it can't-be copy pasted to be used straight away, I only used your code for speed measurement.
Barrett reduction works faster not only because division is slower than multiplication, but also because multiplication and addition are both very well pipelined on moder CPUs, modern CPU can execute several mul or add instructions within one cycle, but only if these several mul/add don't depend on each other's result, in other words CPU can run several instructions in parallel within one cycle. As far as I know division can't be run in parallel, because there is only single module within CPU to make division, but still it is a bit pipelined, because after 50% of first division is done second division can be started in parallel at beginning of CPU pipeline.
On some computers you may notice that regular Divide version is much slower sometimes, this happens because CLang/GCC do fallback to library-based Soft implementation of Division even for 128 bit dividend. In this case my Barrett may show even 7-10x times speedup, as it doesn't use library functions.
To overcome issue described above, about Soft division, it is better to add Assembly code with usage of DIV instruction directly, or to find some Intrinsic function that implements this inside your compiler (I think CLang/GCC have such intrinsic). Also I can write this Assembly implementation if needed, just tell me in comments.
Update. As promised, implemented Assembly variant of 128 bit division for CLang/GCC, function UDiv128Asm(). After this change it is used as a main implementation for CLang/GCC 128 division instead of regular u128(a) / b. You may come back to regular u128 impementation by replacing #if 0 with #if 1 inside body of UDiv128() function.
Try it online!
#include <cstdint>
#include <bit>
#include <stdexcept>
#include <string>
#include <immintrin.h>
#if defined(_MSC_VER) && !defined(__clang__)
#define IS_MSVC 1
#else
#define IS_MSVC 0
#endif
#if IS_MSVC
#include <intrin.h>
#endif
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#ifdef _DEBUG
#define DASSERT_MSG(cond, msg) ASSERT_MSG(cond, msg)
#else
#define DASSERT_MSG(cond, msg)
#endif
#define DASSERT(cond) DASSERT_MSG(cond, "")
using u16 = uint16_t;
using u32 = uint32_t;
using i64 = int64_t;
using u64 = uint64_t;
using UllPtr = unsigned long long *;
inline int GetExp(double x) {
return int((std::bit_cast<uint64_t>(x) >> 52) & 0x7FF) - 1023;
}
inline size_t BitSizeWrong(uint64_t x) {
return x == 0 ? 0 : (GetExp(x) + 1);
}
inline size_t BitSize(u64 x) {
size_t r = 0;
if (x >= (u64(1) << 32)) {
x >>= 32;
r += 32;
}
while (x >= 0x100) {
x >>= 8;
r += 8;
}
while (x) {
x >>= 1;
++r;
}
return r;
}
#if !IS_MSVC
inline u64 UDiv128Asm(u64 h, u64 l, u64 d, u64 * r) {
u64 q;
asm (R"(
.intel_syntax
mov rdx, %V[h]
mov rax, %V[l]
div %V[d]
mov %V[r], rdx
mov %V[q], rax
)"
: [q] "=r" (q), [r] "=r" (*r)
: [h] "r" (h), [l] "r" (l), [d] "r" (d)
: "rax", "rdx"
);
return q;
}
#endif
inline std::pair<u64, u64> UDiv128(u64 hi, u64 lo, u64 d) {
#if IS_MSVC
u64 r, q = _udiv128(hi, lo, d, &r);
return std::make_pair(q, r);
#else
#if 0
using u128 = unsigned __int128;
auto const dnd = (u128(hi) << 64) | lo;
return std::make_pair(u64(dnd / d), u64(dnd % d));
#else
u64 r, q = UDiv128Asm(hi, lo, d, &r);
return std::make_pair(q, r);
#endif
#endif
}
inline std::pair<u64, u64> UMul128(u64 a, u64 b) {
#if IS_MSVC
u64 hi, lo = _umul128(a, b, &hi);
return std::make_pair(hi, lo);
#else
using u128 = unsigned __int128;
auto const x = u128(a) * b;
return std::make_pair(u64(x >> 64), u64(x));
#endif
}
inline std::pair<u64, u64> USub128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
u64 r_hi, r_lo;
_subborrow_u64(_subborrow_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
return std::make_pair(r_hi, r_lo);
}
inline std::pair<u64, u64> UAdd128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
u64 r_hi, r_lo;
_addcarry_u64(_addcarry_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
return std::make_pair(r_hi, r_lo);
}
inline int UCmp128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
if (a_hi != b_hi)
return a_hi < b_hi ? -1 : 1;
return a_lo == b_lo ? 0 : a_lo < b_lo ? -1 : 1;
}
std::pair<u64, size_t> BarrettRS64(u64 n) {
// https://www.nayuki.io/page/barrett-reduction-algorithm
ASSERT_MSG(n >= 3 && (n & (n - 1)) != 0, "n " + std::to_string(n))
size_t const nbits = BitSize(n);
// 2^s = q * n + r; 2^s = (2^64 + q0) * n + r; 2^s - n * 2^64 = q0 * n + r
u64 const dnd_hi = (nbits >= 64 ? 0ULL : (u64(1) << nbits)) - n;
auto const q0 = UDiv128(dnd_hi, 0, n).first;
return std::make_pair(q0, nbits);
}
template <bool Use128 = true, bool Adjust = true>
std::pair<u64, u64> BarrettDivMod64(u64 x_hi, u64 x_lo, u64 n, u64 r, size_t s) {
// ((x_hi * 2^64 + x_lo) * (2^64 + r)) >> (64 + s)
DASSERT(x_hi < n);
#if !IS_MSVC
if constexpr(Use128) {
using u128 = unsigned __int128;
u128 const xf = (u128(x_hi) << 64) | x_lo;
u64 q = u64((u128(x_hi) * r + xf + u64((u128(x_lo) * r) >> 64)) >> s);
if (s < 64) {
u64 t = x_lo - q * n;
if constexpr(Adjust) {
u64 const mask = ~u64(i64(t - n) >> 63);
q += mask & 1;
t -= mask & n;
}
return std::make_pair(q, t);
} else {
u128 t = xf - u128(q) * n;
return t < n ? std::make_pair(q, u64(t)) : std::make_pair(q + 1, u64(t) - n);
}
} else
#endif
{
auto const w1a = UMul128(x_lo, r).first;
auto const [w2b, w1b] = UMul128(x_hi, r);
auto const w2c = x_hi, w1c = x_lo;
u64 w1, w2 = _addcarry_u64(0, w1a, w1b, (UllPtr)&w1);
w2 += _addcarry_u64(0, w1, w1c, (UllPtr)&w1);
w2 += w2b + w2c;
if (s < 64) {
u64 q = (w2 << (64 - s)) | (w1 >> s);
u64 t = x_lo - q * n;
if constexpr(Adjust) {
u64 const mask = ~u64(i64(t - n) >> 63);
q += mask & 1;
t -= mask & n;
}
return std::make_pair(q, t);
} else {
u64 const q = w2;
auto const [b_hi, b_lo] = UMul128(q, n);
auto const [t_hi, t_lo] = USub128(x_hi, x_lo, b_hi, b_lo);
return t_hi != 0 || t_lo >= n ? std::make_pair(q + 1, t_lo - n) : std::make_pair(q, t_lo);
}
}
}
#include <random>
#include <iomanip>
#include <iostream>
#include <chrono>
void TestBarrett() {
std::mt19937_64 rng{123}; //{std::random_device{}()};
for (size_t i = 0; i < (1 << 11); ++i) {
size_t const nbits = rng() % 63 + 2;
u64 n = 0;
do {
n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
} while (!(n >= 3 && (n & (n - 1)) != 0));
auto const [br, bs] = BarrettRS64(n);
for (size_t j = 0; j < (1 << 6); ++j) {
u64 const hi = rng() % n, lo = rng();
auto const [ref_q, ref_r] = UDiv128(hi, lo, n);
u64 bar_q = 0, bar_r = 0;
for (size_t k = 0; k < 2; ++k) {
bar_q = 0; bar_r = 0;
if (k == 0)
std::tie(bar_q, bar_r) = BarrettDivMod64<true>(hi, lo, n, br, bs);
else
std::tie(bar_q, bar_r) = BarrettDivMod64<false>(hi, lo, n, br, bs);
ASSERT_MSG(bar_q == ref_q && bar_r == ref_r, "i " + std::to_string(i) + ", j " + std::to_string(j) + ", k " + std::to_string(k) +
", nbits " + std::to_string(nbits) + ", n " + std::to_string(n) + ", bar_q " + std::to_string(bar_q) +
", ref_q " + std::to_string(ref_q) + ", bar_r " + std::to_string(bar_r) + ", ref_r " + std::to_string(ref_r));
}
}
}
}
class bi_uint
{
public:
using u64_t = std::uint64_t;
constexpr static std::size_t u64_bits = 8 * sizeof(u64_t);
//little-endian
std::vector<u64_t> data;
static auto constexpr DefPrep = [](auto n){
return std::make_pair(false, false);
};
static auto constexpr DefDivMod = [](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
return UDiv128(dnd_hi, dnd_lo, dsr);
};
//User should guarantee data.size()>0 and val>0
template <typename PrepT = decltype(DefPrep), typename DivModT = decltype(DefDivMod)>
void self_div(const u64_t val, PrepT const & Prep = DefPrep, DivModT const & DivMod = DefDivMod)
{
auto it = data.rbegin();
if(data.size() == 1) {
*it /= val;
return;
}
u64_t rem_hi = 0, rem_lo = 0;
if(*it < val) {
rem_lo = *it++;
//data.pop_back();
}
auto const prep = Prep(val);
u64_t r = rem_lo % val;
u64_t q = 0;
while(it != data.rend()) {
rem_hi = r;
rem_lo = *it;
std::tie(q, r) = DivMod(rem_hi, rem_lo, val, prep);
//*it++ = static_cast<u64_t>(q);
it++;
auto volatile out = static_cast<u64_t>(q);
}
}
};
void TestSpeed() {
auto Time = []{
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
};
std::mt19937_64 rng{123};
std::vector<u64> limbs, divisors;
for (size_t i = 0; i < (1 << 17); ++i)
limbs.push_back(rng());
for (size_t i = 0; i < (1 << 8); ++i) {
size_t const nbits = rng() % 63 + 2;
u64 n = 0;
do {
n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
} while (!(n >= 3 && (n & (n - 1)) != 0));
divisors.push_back(n);
}
std::cout << std::fixed << std::setprecision(3);
double div_time = 0;
{
bi_uint x;
x.data = limbs;
auto const tim = Time();
for (auto dsr: divisors)
x.self_div(dsr);
div_time = Time() - tim;
std::cout << "Divide time " << div_time << " sec" << std::endl;
}
{
bi_uint x;
x.data = limbs;
for (size_t i = 0; i < 2; ++i) {
if (IS_MSVC && i == 0)
continue;
auto const tim = Time();
for (auto dsr: divisors)
x.self_div(dsr, [](auto n){ return BarrettRS64(n); },
[i](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
return i == 0 ? BarrettDivMod64<true>(dnd_hi, dnd_lo, dsr, prep.first, prep.second) :
BarrettDivMod64<false>(dnd_hi, dnd_lo, dsr, prep.first, prep.second);
});
double const bar_time = Time() - tim;
std::cout << "Barrett" << (i == 0 ? "128" : "64 ") << " time " << bar_time << " sec, boost " << div_time / bar_time << std::endl;
}
}
}
int main() {
try {
TestBarrett();
TestSpeed();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
Divide time 3.171 sec
Barrett128 time 0.675 sec, boost 4.695
Barrett64 time 0.642 sec, boost 4.937
Part 2
As you have a very interesting question, after few days when I first published this post, I decided to implement from scratch all big integer math.
Below code implements math operations +, -, *, /, <<, >> for natural big numbers (positive integers), and +, -, *, / for floating big numbers. Both types of numbers are of arbitrary size (even millions of bits). Besides those as you requested, I fully implemented Newton-Raphson (both square and cubic variants) and Goldschmidt fast division algorithms.
Here is code snippet only for Newton-Raphson/Golschmidt functions, remaining code as it is very large is linked below on external server:
BigFloat & DivNewtonRaphsonSquare(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
thread_local BigFloat two, c_48_17, c_32_17;
thread_local size_t static_prec = 0;
if (static_prec != BigFloat::prec_) {
two = 2;
c_48_17 = BigFloat(48) / BigFloat(17);
c_32_17 = BigFloat(32) / BigFloat(17);
static_prec = BigFloat::prec_;
}
BigFloat x = c_48_17 - c_32_17 * b;
for (size_t i = 0, num_iters = std::ceil(std::log2(double(static_prec + 1)
/ std::log2(17.0))) + 0.1; i < num_iters; ++i)
x = x * (two - b * x);
*this = a * x;
return BitNorm();
}
BigFloat & DivNewtonRaphsonCubic(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
thread_local BigFloat one, c_140_33, c_m64_11, c_256_99;
thread_local size_t static_prec = 0;
if (static_prec != BigFloat::prec_) {
one = 1;
c_140_33 = BigFloat(140) / BigFloat(33);
c_m64_11 = BigFloat(-64) / BigFloat(11);
c_256_99 = BigFloat(256) / BigFloat(99);
static_prec = BigFloat::prec_;
}
BigFloat e, y, x = c_140_33 + b * (c_m64_11 + b * c_256_99);
for (size_t i = 0, num_iters = std::ceil(std::log2(double(static_prec + 1)
/ std::log2(99.0)) / std::log2(3.0)) + 0.1; i < num_iters; ++i) {
e = one - b * x;
y = x * e;
x = x + y + y * e;
}
*this = a * x;
return BitNorm();
}
BigFloat & DivGoldschmidt(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Goldschmidt_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
BigFloat one = 1, two = 2, f;
for (size_t i = 0;; ++i) {
f = two - b;
a *= f;
b *= f;
if (i % 3 == 0 && (one - b).GetScale() < -i64(prec_) + i64(bit_sizeof(Word)))
break;
}
*this = a;
return BitNorm();
}
See Output: below, it will show that Newton-Raphson and Goldschmidt methods are actually 10x times slower than regular School-grade (called Reference in output) algorithm. Between each other these 3 advanced algorithms are about same speed. Probably Raphson/Goldschmidt could be faster if to use Fast Fourier Transform for multiplication, because multiplication of two large numbers takes 95% of time of these algorithms. In code below all results of Raphson/Goldschmidt algorithms are not only time-measured but also checked for correctness of results compared to School-grade (Reference) algorithm (see diff 2^... in console output, this shows how large is difference of result compared to school grade).
FULL SOURCE CODE HERE. Full code is so huge that it didn't fit into this StackOverflow due to SO limit of 30 000 characters per post, although I wrote this code from scracth specifically for this post. That's why providing external download link (PasteBin server), also click Try it online! linke below, it is same copy of code that is run live on GodBolt's linux servers:
Try it online!
Output:
========== 1 K bits ==========
Reference 0.000029 sec
Raphson2 0.000066 sec, boost 0.440x, diff 2^-8192
Raphson3 0.000092 sec, boost 0.317x, diff 2^-8192
Goldschmidt 0.000080 sec, boost 0.365x, diff 2^-1022
========== 2 K bits ==========
Reference 0.000071 sec
Raphson2 0.000177 sec, boost 0.400x, diff 2^-16384
Raphson3 0.000283 sec, boost 0.250x, diff 2^-16384
Goldschmidt 0.000388 sec, boost 0.182x, diff 2^-2046
========== 4 K bits ==========
Reference 0.000319 sec
Raphson2 0.000875 sec, boost 0.365x, diff 2^-4094
Raphson3 0.001122 sec, boost 0.285x, diff 2^-32768
Goldschmidt 0.000881 sec, boost 0.362x, diff 2^-32768
========== 8 K bits ==========
Reference 0.000484 sec
Raphson2 0.002281 sec, boost 0.212x, diff 2^-65536
Raphson3 0.002341 sec, boost 0.207x, diff 2^-65536
Goldschmidt 0.002432 sec, boost 0.199x, diff 2^-8189
========== 16 K bits ==========
Reference 0.001199 sec
Raphson2 0.009042 sec, boost 0.133x, diff 2^-16382
Raphson3 0.009519 sec, boost 0.126x, diff 2^-131072
Goldschmidt 0.009047 sec, boost 0.133x, diff 2^-16380
========== 32 K bits ==========
Reference 0.004311 sec
Raphson2 0.039151 sec, boost 0.110x, diff 2^-32766
Raphson3 0.041058 sec, boost 0.105x, diff 2^-262144
Goldschmidt 0.045517 sec, boost 0.095x, diff 2^-32764
========== 64 K bits ==========
Reference 0.016273 sec
Raphson2 0.165656 sec, boost 0.098x, diff 2^-524288
Raphson3 0.210301 sec, boost 0.077x, diff 2^-65535
Goldschmidt 0.208081 sec, boost 0.078x, diff 2^-65534
========== 128 K bits ==========
Reference 0.059469 sec
Raphson2 0.725865 sec, boost 0.082x, diff 2^-1048576
Raphson3 0.735530 sec, boost 0.081x, diff 2^-1048576
Goldschmidt 0.703991 sec, boost 0.084x, diff 2^-131069
========== 256 K bits ==========
Reference 0.326368 sec
Raphson2 3.007454 sec, boost 0.109x, diff 2^-2097152
Raphson3 2.977631 sec, boost 0.110x, diff 2^-2097152
Goldschmidt 3.363632 sec, boost 0.097x, diff 2^-262141
========== 512 K bits ==========
Reference 1.138663 sec
Raphson2 12.827783 sec, boost 0.089x, diff 2^-524287
Raphson3 13.799401 sec, boost 0.083x, diff 2^-524287
Goldschmidt 15.836072 sec, boost 0.072x, diff 2^-524286

On most of the modern CPUs, division is indeed much slower than multiplication.
Referring to
https://agner.org/optimize/instruction_tables.pdf
That on Intel Skylake an MUL/IMUL has a latency of 3-4 cycles; while an DIV/IDIV could take 26-90 cycles; which is 7 - 23 times slower than MUL; so your initial benchmark result isn't really a surprise.
If you happen to be on x86 CPU, as showing in the answer below, if this is indeed the bottleneck you could try to utilize AVX/SSE instructions. Basically you'd need to rely on special instructions than a general one like DIV/IDIV.
How to divide a __m256i vector by an integer variable?

Fast algorithm for computing intersections of N sets

I have n sets A0,A2,...An-1 holding items of a set E.
I define a configuration C as the integer made of n bits, so C has values between 0 and 2^n-1. Now, I define the following:
(C) an item e of E is in configuration C
<=> for each bit b of C, if b==1 then e is in Ab, else e is not in Ab
For instance for n=3, the configuration C=011 corresponds to items of E that are in A0 and A1 but NOT in A2 (the NOT is important)
C[bitmap] is the count of elements that have exactly that presence/absence pattern in the sets. C[001] is the number of elements in A0 that aren't also in any other sets.
Another possible definition is :
(V) an item e of E is in configuration V
<=> for each bit b of V, if b==1 then e is in Ab
For instance for n=3, the (V) configuration V=011 corresponds to items of E that are in A0 and A1
V[bitmap] is the count of the intersection of the selected sets. (i.e. the count of how many elements are in all of the sets where the bitmap is true.) V[001] is the number of elements in A0. V[011] is the number of elements in A0 and A1, regardless of whether or not they're also in A2.
In the following, the first picture shows items of sets A0, A1 and A2, the second picture shows size of (C) configurations and the third picture shows size of (V) configurations.
I can also represent the configurations by either of two vectors:
C[001]= 5 V[001]=14
C[010]=10 V[010]=22
C[100]=11 V[100]=24
C[011]= 2 V[011]= 6
C[101]= 3 V[101]= 7
C[110]= 6 V[110]=10
C[111]= 4 V[111]= 4
What I want is to write a C/C++ function that transforms C into V as efficiently as possible. A naive approach could be the following 'transfo' function that is obviously in O(4^n) :
#include <vector>
#include <cstdio>
using namespace std;
vector<size_t> transfo (const vector<size_t>& C)
{
vector<size_t> V (C.size());
for (size_t i=0; i<C.size(); i++)
{
V[i] = 0;
for (size_t j=0; j<C.size(); j++)
{
if ((j&i)==i) { V[i] += C[j]; }
}
}
return V;
}
int main()
{
vector<size_t> C = {
/* 000 */ 0,
/* 001 */ 5,
/* 010 */ 10,
/* 011 */ 2,
/* 100 */ 11,
/* 101 */ 3,
/* 110 */ 6,
/* 111 */ 4
};
vector<size_t> V = transfo (C);
for (size_t i=1; i<V.size(); i++) { printf ("[%2ld] C=%2ld V=%2ld\n", i, C[i], V[i]); }
}
My question is : is there a more efficient algorithm than the naive one for transforming a vector C into a vector V ? And what would be the complexity of such a "good" algorithm ?
Note that I could be interested by any SIMD solution.

Well, you are trying to compute 2n values, so you cannot do better than O(2n).
The naive approach starts from the observation that V[X] is obtained by fixing all the 1 bits in X and iterating over all the possible values where the 0 bits are. For example,
V[010] = C[010] + C[011] + C[110] + C[111]
But this approach performs O(2n) additions for every element of V, yielding a total complexity of O(4n).
Here is an O(n × 2n) algorithm. I too am curious if an O(2n) algorithm exists.
Let n = 4. Let us consider the full table of V versus C. Each line in the table below corresponds to one value of V and this value is calculated by summing up the columns marked with a *. The layout of * symbols can be easily deduced from the naive approach.
|0000|0001|0010|0011|0100|0101|0110|0111||1000|1001|1010|1011|1100|1101|1110|1111
0000| * | * | * | * | * | * | * | * || * | * | * | * | * | * | * | *
0001| | * | | * | | * | | * || | * | | * | | * | | *
0010| | | * | * | | | * | * || | | * | * | | | * | *
0011| | | | * | | | | * || | | | * | | | | *
0100| | | | | * | * | * | * || | | | | * | * | * | *
0101| | | | | | * | | * || | | | | | * | | *
0110| | | | | | | * | * || | | | | | | * | *
0111| | | | | | | | * || | | | | | | | *
-------------------------------------------------------------------------------------
1000| | | | | | | | || * | * | * | * | * | * | * | *
1001| | | | | | | | || | * | | * | | * | | *
1010| | | | | | | | || | | * | * | | | * | *
1011| | | | | | | | || | | | * | | | | *
1100| | | | | | | | || | | | | * | * | * | *
1101| | | | | | | | || | | | | | * | | *
1110| | | | | | | | || | | | | | | * | *
1111| | | | | | | | || | | | | | | | *
Notice that the top-left, top-right and bottom-right corners contain identical layouts. Therefore, we can perform some calculations in bulk as follows:
Compute the bottom half of the table (the bottom-right corner).
Add the values to the top half.
Compute the top-left corner.
If we let q = 2n, Thus the recurrent complexity is
T(q) = 2T(q/2) + O(q)
which solves using the Master Theorem to
T(q) = O(q log q)
or, in terms of n,
T(n) = O(n × 2n)

According to the great observation of #CătălinFrâncu, I wrote two recursive implementations of the transformation (see code below) :
transfo_recursive: very straightforward recursive implementation
transfo_avx2 : still recursive but use AVX2 for last step of the recursion for n=3
I propose here that the sizes of the counters are coded on 32 bits and that the n value can grow up to 28.
I also wrote an iterative implementation (transfo_iterative) based on my own observation on the recursion behaviour. Actually, I guess it is close to the non recursive implementation proposed by #chtz.
Here is the benchmark code:
// compiled with: g++ -O3 intersect.cpp -march=native -mavx2 -lpthread -DNDEBUG
#include <vector>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <cmath>
#include <thread>
#include <algorithm>
#include <sys/times.h>
#include <immintrin.h>
#include <boost/align/aligned_allocator.hpp>
using namespace std;
////////////////////////////////////////////////////////////////////////////////
typedef u_int32_t Count;
// Note: alignment is important for AVX2
typedef std::vector<Count,boost::alignment::aligned_allocator<Count, 8*sizeof(Count)>> CountVector;
typedef void (*callback) (CountVector::pointer C, size_t q);
typedef vector<pair<const char*, callback>> FunctionsVector;
unsigned int randomSeed = 0;
////////////////////////////////////////////////////////////////////////////////
double timestamp()
{
struct timespec timet;
clock_gettime(CLOCK_MONOTONIC, &timet);
return timet.tv_sec + (timet.tv_nsec/ 1000000000.0);
}
////////////////////////////////////////////////////////////////////////////////
CountVector getRandomVector (size_t n)
{
// We use the same seed, so we'll get the same random values
srand (randomSeed);
// We fill a vector of size q=2^n with random values
CountVector C(1ULL<<n);
for (size_t i=0; i<C.size(); i++) { C[i] = rand() % (1ULL<<(8*sizeof(Count))); }
return C;
}
////////////////////////////////////////////////////////////////////////////////
void copy_add_block (CountVector::pointer C, size_t q)
{
for (size_t i=0; i<q/2; i++) { C[i] += C[i+q/2]; }
}
////////////////////////////////////////////////////////////////////////////////
void copy_add_block_avx2 (CountVector::pointer C, size_t q)
{
__m256i* target = (__m256i*) (C);
__m256i* source = (__m256i*) (C+q/2);
size_t imax = q/(2*8);
for (size_t i=0; i<imax; i++)
{
target[i] = _mm256_add_epi32 (source[i], target[i]);
}
}
////////////////////////////////////////////////////////////////////////////////
// Naive approach : O(4^n)
////////////////////////////////////////////////////////////////////////////////
CountVector transfo_naive (const CountVector& C)
{
CountVector V (C.size());
for (size_t i=0; i<C.size(); i++)
{
V[i] = 0;
for (size_t j=0; j<C.size(); j++)
{
if ((j&i)==i) { V[i] += C[j]; }
}
}
return V;
}
////////////////////////////////////////////////////////////////////////////////
// Recursive approach : O(n.2^n)
////////////////////////////////////////////////////////////////////////////////
void transfo_recursive (CountVector::pointer C, size_t q)
{
if (q>1)
{
transfo_recursive (C+q/2, q/2);
transfo_recursive (C, q/2);
copy_add_block (C, q);
}
}
////////////////////////////////////////////////////////////////////////////////
// Iterative approach : O(n.2^n)
////////////////////////////////////////////////////////////////////////////////
void transfo_iterative (CountVector::pointer C, size_t q)
{
size_t i = 0;
for (size_t n=q; n>1; n>>=1, i++)
{
size_t d = 1<<i;
for (ssize_t j=q-1-d; j>=0; j--)
{
if ( ((j>>i)&1)==0) { C[j] += C[j+d]; }
}
}
}
////////////////////////////////////////////////////////////////////////////////
// Recursive AVX2 approach : O(n.2^n)
////////////////////////////////////////////////////////////////////////////////
#define ROTATE1(s) _mm256_permutevar8x32_epi32 (s, _mm256_set_epi32(0,7,6,5,4,3,2,1))
#define ROTATE2(s) _mm256_permutevar8x32_epi32 (s, _mm256_set_epi32(0,0,7,6,5,4,3,2))
#define ROTATE4(s) _mm256_permutevar8x32_epi32 (s, _mm256_set_epi32(0,0,0,0,7,6,5,4))
void transfo_avx2 (CountVector::pointer V, size_t N)
{
__m256i k1 = _mm256_set_epi32 (0,0xFFFFFFFF,0,0xFFFFFFFF,0,0xFFFFFFFF,0,0xFFFFFFFF);
__m256i k2 = _mm256_set_epi32 (0,0,0xFFFFFFFF,0xFFFFFFFF,0,0,0xFFFFFFFF,0xFFFFFFFF);
__m256i k4 = _mm256_set_epi32 (0,0,0,0,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF);
if (N==8)
{
__m256i* source = (__m256i*) (V);
*source = _mm256_add_epi32 (*source, _mm256_and_si256(ROTATE1(*source),k1));
*source = _mm256_add_epi32 (*source, _mm256_and_si256(ROTATE2(*source),k2));
*source = _mm256_add_epi32 (*source, _mm256_and_si256(ROTATE4(*source),k4));
}
else // if (N>8)
{
transfo_avx2 (V+N/2, N/2);
transfo_avx2 (V, N/2);
copy_add_block_avx2 (V, N);
}
}
#define ROTATE1_AND(s) _mm256_srli_epi64 ((s), 32) // odd 32bit elements
#define ROTATE2_AND(s) _mm256_bsrli_epi128 ((s), 8) // high 64bit halves
// gcc doesn't have _mm256_zextsi128_si256
// and _mm256_castsi128_si256 doesn't guarantee zero extension
// vperm2i118 can do the same job as vextracti128, but is slower on Ryzen
#ifdef __clang__ // high 128bit lane
#define ROTATE4_AND(s) _mm256_zextsi128_si256(_mm256_extracti128_si256((s),1))
#else
//#define ROTATE4_AND(s) _mm256_castsi128_si256(_mm256_extracti128_si256((s),1))
#define ROTATE4_AND(s) _mm256_permute2x128_si256((s),(s),0x81) // high bit set = zero that lane
#endif
void transfo_avx2_pcordes (CountVector::pointer C, size_t q)
{
if (q==8)
{
__m256i* source = (__m256i*) (C);
__m256i tmp = *source;
tmp = _mm256_add_epi32 (tmp, ROTATE1_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE2_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE4_AND(tmp));
*source = tmp;
}
else //if (N>8)
{
transfo_avx2_pcordes (C+q/2, q/2);
transfo_avx2_pcordes (C, q/2);
copy_add_block_avx2 (C, q);
}
}
////////////////////////////////////////////////////////////////////////////////
// Template specialization (same as transfo_avx2_pcordes)
////////////////////////////////////////////////////////////////////////////////
template <int n>
void transfo_template (__m256i* C)
{
const size_t q = 1ULL << n;
transfo_template<n-1> (C);
transfo_template<n-1> (C + q/2);
__m256i* target = (__m256i*) (C);
__m256i* source = (__m256i*) (C+q/2);
for (size_t i=0; i<q/2; i++)
{
target[i] = _mm256_add_epi32 (source[i], target[i]);
}
}
template <>
void transfo_template<0> (__m256i* C)
{
__m256i* source = (__m256i*) (C);
__m256i tmp = *source;
tmp = _mm256_add_epi32 (tmp, ROTATE1_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE2_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE4_AND(tmp));
*source = tmp;
}
void transfo_recur_template (CountVector::pointer C, size_t q)
{
#define CASE(n) case 1ULL<<n: transfo_template<n> ((__m256i*)C); break;
q = q / 8; // 8 is the number of 32 bits items in the AVX2 registers
// We have to 'link' the dynamic value of q with a static template specialization
switch (q)
{
CASE( 1); CASE( 2); CASE( 3); CASE( 4); CASE( 5); CASE( 6); CASE( 7); CASE( 8); CASE( 9);
CASE(10); CASE(11); CASE(12); CASE(13); CASE(14); CASE(15); CASE(16); CASE(17); CASE(18); CASE(19);
CASE(20); CASE(21); CASE(22); CASE(23); CASE(24); CASE(25); CASE(26); CASE(27); CASE(28); CASE(29);
default: printf ("transfo_template undefined for q=%ld\n", q); break;
}
}
////////////////////////////////////////////////////////////////////////////////
// Recursive approach multithread : O(n.2^n)
////////////////////////////////////////////////////////////////////////////////
void transfo_recur_thread (CountVector::pointer C, size_t q)
{
std::thread t1 (transfo_recur_template, C+q/2, q/2);
std::thread t2 (transfo_recur_template, C, q/2);
t1.join();
t2.join();
copy_add_block_avx2 (C, q);
}
////////////////////////////////////////////////////////////////////////////////
void header (const char* title, const FunctionsVector& functions)
{
printf ("\n");
for (size_t i=0; i<functions.size(); i++) { printf ("------------------"); } printf ("\n");
printf ("%s\n", title);
for (size_t i=0; i<functions.size(); i++) { printf ("------------------"); } printf ("\n");
printf ("%3s\t", "# n");
for (auto fct : functions) { printf ("%20s\t", fct.first); }
printf ("\n");
}
////////////////////////////////////////////////////////////////////////////////
// Check that alternative implementations provide the same result as the naive one
////////////////////////////////////////////////////////////////////////////////
void check (const FunctionsVector& functions, size_t nmin, size_t nmax)
{
header ("CHECK (0 values means similar to naive approach)", functions);
for (size_t n=nmin; n<=nmax; n++)
{
printf ("%3ld\t", n);
CountVector reference = transfo_naive (getRandomVector(n));
for (auto fct : functions)
{
// We call the (in place) transformation
CountVector C = getRandomVector(n);
(*fct.second) (C.data(), C.size());
int nbDiffs= 0;
for (size_t i=0; i<C.size(); i++)
{
if (reference[i]!=C[i]) { nbDiffs++; }
}
printf ("%20ld\t", nbDiffs);
}
printf ("\n");
}
}
////////////////////////////////////////////////////////////////////////////////
// Performance test
////////////////////////////////////////////////////////////////////////////////
void performance (const FunctionsVector& functions, size_t nmin, size_t nmax)
{
header ("PERFORMANCE", functions);
for (size_t n=nmin; n<=nmax; n++)
{
printf ("%3ld\t", n);
for (auto fct : functions)
{
// We compute the average time for several executions
// We use more executions for small n values in order
// to have more accurate results
size_t nbRuns = 1ULL<<(2+nmax-n);
vector<double> timeValues;
// We run the test several times
for (size_t r=0; r<nbRuns; r++)
{
// We don't want to measure time for vector fill
CountVector C = getRandomVector(n);
double t0 = timestamp();
(*fct.second) (C.data(), C.size());
double t1 = timestamp();
timeValues.push_back (t1-t0);
}
// We sort the vector of times in order to get the median value
std::sort (timeValues.begin(), timeValues.end());
double median = timeValues[timeValues.size()/2];
printf ("%20lf\t", log(1000.0*1000.0*median)/log(2));
}
printf ("\n");
}
}
////////////////////////////////////////////////////////////////////////////////
//
////////////////////////////////////////////////////////////////////////////////
int main (int argc, char* argv[])
{
size_t nmin = argc>=2 ? atoi(argv[1]) : 14;
size_t nmax = argc>=3 ? atoi(argv[2]) : 28;
// We get a common random seed
randomSeed = time(NULL);
FunctionsVector functions = {
make_pair ("transfo_recursive", transfo_recursive),
make_pair ("transfo_iterative", transfo_iterative),
make_pair ("transfo_avx2", transfo_avx2),
make_pair ("transfo_avx2_pcordes", transfo_avx2_pcordes),
make_pair ("transfo_recur_template", transfo_recur_template),
make_pair ("transfo_recur_thread", transfo_recur_thread)
};
// We check for some n that alternative implementations
// provide the same result as the naive approach
check (functions, 5, 15);
// We run the performance test
performance (functions, nmin, nmax);
}
And here is the performance graph:
One can observe that the simple recursive implementation is pretty good, even compared to the AVX2 version. The iterative implementation is a little bit disappointing but I made no big effort to optimize it.
Finally, for my own use case with 32 bits counters and for n values up to 28, these implementations are obviously ok for me compared to the initial "naive" approach in O(4^n).
Update
Following some remarks from #PeterCordes and #chtz, I added the following implementations:
transfo-avx2-pcordes : the same as transfo-avx2 with some AVX2 optimizations
transfo-recur-template : the same as transfo-avx2-pcordes but using C++ template specialization for implementing recursion
transfo-recur-thread : usage of multithreading for the two initial recursive calls of transfo-recur-template
Here is the updated benchmark result:
A few remarks about this result:
the AVX2 implementations are logically the best options but maybe not with the maximum potential x8 speedup with counters of 32 bits
among the AVX2 implementations, the template specialization brings a little speedup but it almost fades for bigger values for n
the simple two-threads version has bad results for n<20; for n>=20, there is always a little speedup but far from a potential 2x.

SIMD Program slow runtime

I'm starting with SIMD programming but i don't know what to do at this moment. I'm trying to diminish runtime but its doing it the other way.
This is my basic code:
https://codepaste.net/a8ut89
void blurr2(double * u, double * r) {
int i;
double dos[2] = { 2.0, 2.0 };
for (i = 0; i < SIZE - 1; i++) {
r[i] = u[i] + u[i + 1];
}
}
blurr2: 0.43s
int contarNegativos(double * u) {
int i;
int contador = 0;
for (i = 0; i < SIZE; i++) {
if (u[i] < 0) {
contador++;
}
}
return contador;
}
negativeCount: 1.38s
void ord(double * v, double * u, double * r) {
int i;
for (i = 0; i < SIZE; i += 2) {
r[i] = *(__int64*)&(v[i]) | *(__int64*)&(u[i]);
}
}
ord: 0.33
And this is my SIMD code:
https://codepaste.net/fbg1g5
void blurr2(double * u, double * r) {
__m128d rp2;
__m128d rdos;
__m128d rr;
int i;
int sizeAux = SIZE % 2 == 1 ? SIZE : SIZE - 1;
double dos[2] = { 2.0, 2.0 };
rdos = *(__m128d*)dos;
for (i = 0; i < sizeAux; i += 2) {
rp2 = *(__m128d*)&u[i + 1];
rr = _mm_add_pd(*(__m128d*)&u[i], rp2);
*((__m128d*)&r[i]) = _mm_div_pd(rr, rdos);
}
}
blurr2: 0.42s
int contarNegativos(double * u) {
__m128d rcero;
__m128d rr;
int i;
double cero[2] = { 0.0, 0.0 };
int contador = 0;
rcero = *(__m128d*)cero;
for (i = 0; i < SIZE; i += 2) {
rr = _mm_cmplt_pd(*(__m128d*)&u[i], rcero);
if (((__int64 *)&rr)[0]) {
contador++;
};
if (((__int64 *)&rr)[1]) {
contador++;
};
}
return contador;
}
negativeCount: 1.42s
void ord(double * v, double * u, double * r) {
__m128d rr;
int i;
for (i = 0; i < SIZE; i += 2) {
*((__m128d*)&r[i]) = _mm_or_pd(*(__m128d*)&v[i], *(__m128d*)&u[i]);
}
}
ord: 0.35s
**Differents solutions.
Can you explain me what i'm doing wrong? I'm a bit lost...

Use _mm_loadu_pd instead of pointer-casting and dereferencing a __m128d. Your code is guaranteed to segfault on gcc/clang where __m128d is assumed to be aligned.
blurr2: multiply by 0.5 instead of dividing by 2. It will be much faster. (I commented the same thing on a question with the exact same code in the last day or two, was that also you?)
negativeCount: _mm_castpd_si128 the compare result to integer, and accumulate it with _mm_sub_epi64. (The bit pattern is all-zero or all-one, i.e. 2's complement 0 / -1).
#include <immintrin.h>
#include <stdint.h>
static const size_t SIZE = 1024;
uint64_t countNegative(double * u) {
__m128i counts = _mm_setzero_si128();
for (size_t i = 0; i < SIZE; i += 2) {
__m128d cmp = _mm_cmplt_pd(_mm_loadu_pd(&u[i]), _mm_setzero_pd());
counts = _mm_sub_epi64(counts, _mm_castpd_si128(cmp));
}
//return counts[0] + counts[1]; // GNU C only, and less efficient
// horizontal sum
__m128i hi64 = _mm_shuffle_epi32(counts, _MM_SHUFFLE(1, 0, 3, 2));
counts = _mm_add_epi64(counts, hi64);
uint64_t scalarcount = _mm_cvtsi128_si64(counts);
return scalarcount;
}
To learn more about efficient vector horizontal sums, see Fastest way to do horizontal float vector sum on x86. But the first rule is to do it outside the loop.
(source + asm on the Godbolt compiler explorer)
From MSVC (which I'm guessing you're using, or you'd get segfaults from *(__m128d*)foo), the inner loop is:
$LL4#countNegat:
movups xmm0, XMMWORD PTR [rcx]
lea rcx, QWORD PTR [rcx+16]
cmpltpd xmm0, xmm2
psubq xmm1, xmm0
sub rax, 1
jne SHORT $LL4#countNegat
It could maybe go faster with unrolling (and maybe two vector accumulators), but this is fairly good and might go close to 1.25 clocks per 16 bytes on Sandybridge/Haswell. (Bottleneck on 5 fused-domain uops).
Your version was actually unpacking to integer inside the inner loop! And if you were using MSVC -Ox, it was actually branching instead of using a branchless compare + conditional add. I'm surprised it wasn't slower than the scalar version.
Also, (int64_t *)&rr violates strict aliasing. char* can alias anything, but it's not safe to cast other pointers onto SIMD vectors and expect it to work. If it does, you got lucky. Compilers usually generate similar code for that or intrinsics, and usually not worse for proper intrinsics.

Do you know that ord function with SIMD is not 1:1 to ord function without using SIMD instructions ?
In ord function without using SIMD, result of OR operation is calculated for even indexes
r[0] = v[0] | u[0],
r[2] = v[2] | u[2],
r[4] = v[4] | u[4]
what with odd indexes? maybe, if OR operations are calculated for all indexes, it will take more time than now.

Efficient float to int without overflow

using int_type = int;
int_type min = std::numeric_limits<Depth>::min();
int_type max = std::numeric_limits<Depth>::max();
int_type convert(float f) {
if(f < static_cast<float>(min)) return min; // overflow
else if(f > static_cast<float>(max)) return max; // overflow
else return static_cast<int_type>(f);
}
Is there a more efficient way to convert float f to int_type, while clamping it to the minimal and maximal values of the integer type?
For example, without casting min and max to float for the comparisons.

Sometimes Almost always, trusting the compiler is the best thing to do.
This code:
template<class Integral>
__attribute__((noinline))
int convert(float f)
{
using int_type = Integral;
constexpr int_type min = std::numeric_limits<int_type>::min();
constexpr int_type max = std::numeric_limits<int_type>::max();
constexpr float fmin = static_cast<float>(min);
constexpr float fmax = static_cast<float>(max);
if(f < fmin) return min; // overflow
if(f > fmax) return max; // overflow
return static_cast<int_type>(f);
}
compiled with -O2 and -fomit-frame-pointer, yields:
__Z7convertIiEif: ## #_Z7convertIiEif
.cfi_startproc
movl $-2147483648, %eax ## imm = 0xFFFFFFFF80000000
movss LCPI1_0(%rip), %xmm1 ## xmm1 = mem[0],zero,zero,zero
ucomiss %xmm0, %xmm1
ja LBB1_3
movl $2147483647, %eax ## imm = 0x7FFFFFFF
ucomiss LCPI1_1(%rip), %xmm0
ja LBB1_3
cvttss2si %xmm0, %eax
LBB1_3:
retq
I'm not sure it could be any more efficient.
Note LCPI_x defined here:
.section __TEXT,__literal4,4byte_literals
.align 2
LCPI1_0:
.long 3472883712 ## float -2.14748365E+9
LCPI1_1:
.long 1325400064 ## float 2.14748365E+9
How about clamping using fmin(), fmax()... [thanks to njuffa for the question]
The code does become more efficient, because the conditional jumps are removed. However, it starts to behave incorrectly at the clamping limits.
Consider:
template<class Integral>
__attribute__((noinline))
int convert2(float f)
{
using int_type = Integral;
constexpr int_type min = std::numeric_limits<int_type>::min();
constexpr int_type max = std::numeric_limits<int_type>::max();
constexpr float fmin = static_cast<float>(min);
constexpr float fmax = static_cast<float>(max);
f = std::min(f, fmax);
f = std::max(f, fmin);
return static_cast<int_type>(f);
}
call with
auto i = convert2<int>(float(std::numeric_limits<int>::max()));
results in:
-2147483648
Clearly we need to reduce the limits by epsilon because of a float's inability to accurately represent the full range of an int, so...
template<class Integral>
__attribute__((noinline))
int convert2(float f)
{
using int_type = Integral;
constexpr int_type min = std::numeric_limits<int_type>::min();
constexpr int_type max = std::numeric_limits<int_type>::max();
constexpr float fmin = static_cast<float>(min) - (std::numeric_limits<float>::epsilon() * static_cast<float>(min));
constexpr float fmax = static_cast<float>(max) - (std::numeric_limits<float>::epsilon() * static_cast<float>(max));
f = std::min(f, fmax);
f = std::max(f, fmin);
return static_cast<int_type>(f);
}
Should be better...
except that now the same function call yields:
2147483392
Incidentally, working on this actually led me to a bug in the original code. Because of the same rounding error issue, the > and < operators need to be replaced with >= and <=.
like so:
template<class Integral>
__attribute__((noinline))
int convert(float f)
{
using int_type = Integral;
constexpr int_type min = std::numeric_limits<int_type>::min();
constexpr int_type max = std::numeric_limits<int_type>::max();
constexpr float fmin = static_cast<float>(min);
constexpr float fmax = static_cast<float>(max);
if(f <= fmin) return min; // overflow
if(f >= fmax) return max; // overflow
return static_cast<int_type>(f);
}

For 32-bit integers, you can let the CPU do some of the clamping work for you.
The cvtss2si instruction will actually return 0x80000000 in the case of an out of range floating point number. This lets you eliminate one test most of the time:
int convert(float value)
{
int result = _mm_cvtss_si32(_mm_load_ss(&value));
if (result == 0x80000000 && value > 0.0f)
result = 0x7fffffff;
return result;
}
If you have lots of them to convert, then _mm_cvtps_epi32 lets you process four at once (with the same behaviour on overflow). That should be much faster than processing them one at a time, but you'd need to structure the code differently to make use of it.

If you are want to truncate, you can take advantage of avx2 and avx instructions 512:
#include <float.h>
int main() {
__m256 a = {5.423423, -4.243423, 423.4234234, FLT_MAX, 79.4234876, 19.7, 8.5454, 7675675.6};
__m256i b = _mm256_cvttps_epi32(a);
void p256_hex_u32(__m256i in) {
alignas(32) uint32_t v[8];
_mm256_store_si256((__m256i*)v, in);
printf("v4_u32: %d %d %d %d %d %d %d %d\n", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}
Compile with:
g++ -std=c++17 -mavx2 a.cpp && ./a.out
and for mavx512 (my cpu does not support so I will not provide a working test, feel free to edit):
_mm512_maskz_cvtt_roundpd_epi64(k, value, _MM_FROUND_NO_EXC);

Any way faster than pow() to compute an integer power of 10 in C++?

I know power of 2 can be implemented using << operator.
What about power of 10? Like 10^5? Is there any way faster than pow(10,5) in C++? It is a pretty straight-forward computation by hand. But seems not easy for computers due to binary representation of the numbers... Let us assume I am only interested in integer powers, 10^n, where n is an integer.

Something like this:
int quick_pow10(int n)
{
static int pow10[10] = {
1, 10, 100, 1000, 10000,
100000, 1000000, 10000000, 100000000, 1000000000
};
return pow10[n];
}
Obviously, can do the same thing for long long.
This should be several times faster than any competing method. However, it is quite limited if you have lots of bases (although the number of values goes down quite dramatically with larger bases), so if there isn't a huge number of combinations, it's still doable.
As a comparison:
#include <iostream>
#include <cstdlib>
#include <cmath>
static int quick_pow10(int n)
{
static int pow10[10] = {
1, 10, 100, 1000, 10000,
100000, 1000000, 10000000, 100000000, 1000000000
};
return pow10[n];
}
static int integer_pow(int x, int n)
{
int r = 1;
while (n--)
r *= x;
return r;
}
static int opt_int_pow(int n)
{
int r = 1;
const int x = 10;
while (n)
{
if (n & 1)
{
r *= x;
n--;
}
else
{
r *= x * x;
n -= 2;
}
}
return r;
}
int main(int argc, char **argv)
{
long long sum = 0;
int n = strtol(argv[1], 0, 0);
const long outer_loops = 1000000000;
if (argv[2][0] == 'a')
{
for(long i = 0; i < outer_loops / n; i++)
{
for(int j = 1; j < n+1; j++)
{
sum += quick_pow10(n);
}
}
}
if (argv[2][0] == 'b')
{
for(long i = 0; i < outer_loops / n; i++)
{
for(int j = 1; j < n+1; j++)
{
sum += integer_pow(10,n);
}
}
}
if (argv[2][0] == 'c')
{
for(long i = 0; i < outer_loops / n; i++)
{
for(int j = 1; j < n+1; j++)
{
sum += opt_int_pow(n);
}
}
}
std::cout << "sum=" << sum << std::endl;
return 0;
}
Compiled with g++ 4.6.3, using -Wall -O2 -std=c++0x, gives the following results:
$ g++ -Wall -O2 -std=c++0x pow.cpp
$ time ./a.out 8 a
sum=100000000000000000
real 0m0.124s
user 0m0.119s
sys 0m0.004s
$ time ./a.out 8 b
sum=100000000000000000
real 0m7.502s
user 0m7.482s
sys 0m0.003s
$ time ./a.out 8 c
sum=100000000000000000
real 0m6.098s
user 0m6.077s
sys 0m0.002s
(I did have an option for using pow as well, but it took 1m22.56s when I first tried it, so I removed it when I decided to have optimised loop variant)

There are certainly ways to compute integral powers of 10 faster than using std::pow()! The first realization is that pow(x, n) can be implemented in O(log n) time. The next realization is that pow(x, 10) is the same as (x << 3) * (x << 1). Of course, the compiler knows the latter, i.e., when you are multiplying an integer by the integer constant 10, the compiler will do whatever is fastest to multiply by 10. Based on these two rules it is easy to create fast computations, even if x is a big integer type.
In case you are interested in games like this:
A generic O(log n) version of power is discussed in Elements of Programming.
Lots of interesting "tricks" with integers are discussed in Hacker's Delight.

A solution for any base using template meta-programming :
template<int E, int N>
struct pow {
enum { value = E * pow<E, N - 1>::value };
};
template <int E>
struct pow<E, 0> {
enum { value = 1 };
};
Then it can be used to generate a lookup-table that can be used at runtime :
template<int E>
long long quick_pow(unsigned int n) {
static long long lookupTable[] = {
pow<E, 0>::value, pow<E, 1>::value, pow<E, 2>::value,
pow<E, 3>::value, pow<E, 4>::value, pow<E, 5>::value,
pow<E, 6>::value, pow<E, 7>::value, pow<E, 8>::value,
pow<E, 9>::value
};
return lookupTable[n];
}
This must be used with correct compiler flags in order to detect the possible overflows.
Usage example :
for(unsigned int n = 0; n < 10; ++n) {
std::cout << quick_pow<10>(n) << std::endl;
}

An integer power function (which doesn't involve floating-point conversions and computations) may very well be faster than pow():
int integer_pow(int x, int n)
{
int r = 1;
while (n--)
r *= x;
return r;
}
Edit: benchmarked - the naive integer exponentiation method seems to outperform the floating-point one by about a factor of two:
h2co3-macbook:~ h2co3$ cat quirk.c
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <errno.h>
#include <string.h>
#include <math.h>
int integer_pow(int x, int n)
{
int r = 1;
while (n--)
r *= x;
return r;
}
int main(int argc, char *argv[])
{
int x = 0;
for (int i = 0; i < 100000000; i++) {
x += powerfunc(i, 5);
}
printf("x = %d\n", x);
return 0;
}
h2co3-macbook:~ h2co3$ clang -Wall -o quirk quirk.c -Dpowerfunc=integer_pow
h2co3-macbook:~ h2co3$ time ./quirk
x = -1945812992
real 0m1.169s
user 0m1.164s
sys 0m0.003s
h2co3-macbook:~ h2co3$ clang -Wall -o quirk quirk.c -Dpowerfunc=pow
h2co3-macbook:~ h2co3$ time ./quirk
x = -2147483648
real 0m2.898s
user 0m2.891s
sys 0m0.004s
h2co3-macbook:~ h2co3$

No multiplication and no table version:
//Nx10^n
int Npow10(int N, int n){
N <<= n;
while(n--) N += N << 2;
return N;
}

Here is a stab at it:
// specialize if you have a bignum integer like type you want to work with:
template<typename T> struct is_integer_like:std::is_integral<T> {};
template<typename T> struct make_unsigned_like:std::make_unsigned<T> {};
template<typename T, typename U>
T powT( T base, U exponent ) {
static_assert( is_integer_like<U>::value, "exponent must be integer-like" );
static_assert( std::is_same< U, typename make_unsigned_like<U>::type >::value, "exponent must be unsigned" );
T retval = 1;
T& multiplicand = base;
if (exponent) {
while (true) {
// branch prediction will be awful here, you may have to micro-optimize:
retval *= (exponent&1)?multiplicand:1;
// or /2, whatever -- `>>1` is probably faster, esp for bignums:
exponent = exponent>>1;
if (!exponent)
break;
multiplicand *= multiplicand;
}
}
return retval;
}
What is going on above is a few things.
First, so BigNum support is cheap, it is templateized. Out of the box, it supports any base type that supports *= own_type and either can be implicitly converted to int, or int can be implicitly converted to it (if both is true, problems will occur), and you need to specialize some templates to indicate that the exponent type involved is both unsigned and integer-like.
In this case, integer-like and unsigned means that it supports &1 returning bool and >>1 returning something it can be constructed from and eventually (after repeated >>1s) reaches a point where evaluating it in a bool context returns false. I used traits classes to express the restriction, because naive use by a value like -1 would compile and (on some platforms) loop forever, while (on others) would not.
Execution time for this algorithm, assuming multiplication is O(1), is O(lg(exponent)), where lg(exponent) is the number of times it takes to <<1 the exponent before it evaluates as false in a boolean context. For traditional integer types, this would be the binary log of the exponents value: so no more than 32.
I also eliminated all branches within the loop (or, made it obvious to existing compilers that no branch is needed, more precisely), with just the control branch (which is true uniformly until it is false once). Possibly eliminating even that branch might be worth it for high bases and low exponents...

Now, with constexpr, you can do like so:
constexpr int pow10(int n) {
int result = 1;
for (int i = 1; i<=n; ++i)
result *= 10;
return result;
}
int main () {
int i = pow10(5);
}
i will be calculated at compile time. ASM generated for x86-64 gcc 9.2:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 100000
mov eax, 0
pop rbp
ret

You can use the lookup table which will be by far the fastest
You can also consider using this:-
template <typename T>
T expt(T p, unsigned q)
{
T r(1);
while (q != 0) {
if (q % 2 == 1) { // q is odd
r *= p;
q--;
}
p *= p;
q /= 2;
}
return r;
}

This function will calculate x ^ y much faster then pow. In case of integer values.
int pot(int x, int y){
int solution = 1;
while(y){
if(y&1)
solution*= x;
x *= x;
y >>= 1;
}
return solution;
}

A generic table builder based on constexpr functions. The floating point part requires c++20 and gcc, but the non-floating point part works for c++17. If you change the "auto" type param to "long" you can use c++14. Not properly tested.
#include <cstdio>
#include <cassert>
#include <cmath>
// Precomputes x^N
// Inspired by https://stackoverflow.com/a/34465458
template<auto x, unsigned char N, typename AccumulatorType>
struct PowTable {
constexpr PowTable() : mTable() {
AccumulatorType p{ 1 };
for (unsigned char i = 0; i < N; ++i) {
p *= x;
mTable[i] = p;
}
}
AccumulatorType operator[](unsigned char n) const {
assert(n < N);
return mTable[n];
}
AccumulatorType mTable[N];
};
long pow10(unsigned char n) {
static constexpr PowTable<10l, 10, long> powTable;
return powTable[n-1];
}
double powe(unsigned char n) {
static constexpr PowTable<2.71828182845904523536, 10, double> powTable;
return powTable[n-1];
}
int main() {
printf("10^3=%ld\n", pow10(3));
printf("e^2=%f", powe(2));
assert(pow10(3) == 1000);
assert(powe(2) - 7.389056 < 0.001);
}

Based on Mats Petersson approach, but compile time generation of cache.
#include <iostream>
#include <limits>
#include <array>
// digits
template <typename T>
constexpr T digits(T number) {
return number == 0 ? 0
: 1 + digits<T>(number / 10);
}
// pow
// https://stackoverflow.com/questions/24656212/why-does-gcc-complain-error-type-intt-of-template-argument-0-depends-on-a
// unfortunatly we can't write `template <typename T, T N>` because of partial specialization `PowerOfTen<T, 1>`
template <typename T, uintmax_t N>
struct PowerOfTen {
enum { value = 10 * PowerOfTen<T, N - 1>::value };
};
template <typename T>
struct PowerOfTen<T, 1> {
enum { value = 1 };
};
// sequence
template<typename T, T...>
struct pow10_sequence { };
template<typename T, T From, T N, T... Is>
struct make_pow10_sequence_from
: make_pow10_sequence_from<T, From, N - 1, N - 1, Is...> {
//
};
template<typename T, T From, T... Is>
struct make_pow10_sequence_from<T, From, From, Is...>
: pow10_sequence<T, Is...> {
//
};
// base10list
template <typename T, T N, T... Is>
constexpr std::array<T, N> base10list(pow10_sequence<T, Is...>) {
return {{ PowerOfTen<T, Is>::value... }};
}
template <typename T, T N>
constexpr std::array<T, N> base10list() {
return base10list<T, N>(make_pow10_sequence_from<T, 1, N+1>());
}
template <typename T>
constexpr std::array<T, digits(std::numeric_limits<T>::max())> base10list() {
return base10list<T, digits(std::numeric_limits<T>::max())>();
};
// main pow function
template <typename T>
static T template_quick_pow10(T n) {
static auto values = base10list<T>();
return values[n];
}
// client code
int main(int argc, char **argv) {
long long sum = 0;
int n = strtol(argv[1], 0, 0);
const long outer_loops = 1000000000;
if (argv[2][0] == 't') {
for(long i = 0; i < outer_loops / n; i++) {
for(int j = 1; j < n+1; j++) {
sum += template_quick_pow10(n);
}
}
}
std::cout << "sum=" << sum << std::endl;
return 0;
}
Code does not contain quick_pow10, integer_pow, opt_int_pow for better readability, but tests done with them in the code.
Compiled with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5), using -Wall -O2 -std=c++0x, gives the following results:
$ g++ -Wall -O2 -std=c++0x main.cpp
$ time ./a.out 8 a
sum=100000000000000000
real 0m0.438s
user 0m0.432s
sys 0m0.008s
$ time ./a.out 8 b
sum=100000000000000000
real 0m8.783s
user 0m8.777s
sys 0m0.004s
$ time ./a.out 8 c
sum=100000000000000000
real 0m6.708s
user 0m6.700s
sys 0m0.004s
$ time ./a.out 8 t
sum=100000000000000000
real 0m0.439s
user 0m0.436s
sys 0m0.000s

if you want to calculate, e.g.,10^5, then you can:
int main() {
cout << (int)1e5 << endl; // will print 100000
cout << (int)1e3 << endl; // will print 1000
return 0;
}

result *= 10 can also be written as result = (result << 3) + (result << 1)
constexpr int pow10(int n) {
int result = 1;
for (int i = 0; i < n; i++) {
result = (result << 3) + (result << 1);
}
return result;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Modern C++: initialize constexpr tables - c++

Related

How to quickly divide a big uinteger into a word?

Fast algorithm for computing intersections of N sets

SIMD Program slow runtime

Efficient float to int without overflow

Any way faster than pow() to compute an integer power of 10 in C++?

Categories

Resources