INTEL SIMD: why is inplace multiplication so slow?

INTEL SIMD: why is inplace multiplication so slow? - c++

I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.
The simplest can be boiled down to something like these:
void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);
for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);
dst += 4;
src += 4;
}
}
testing code:
for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}
When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)
Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..
--
This has been answered in the comments. It's denormals during artificial testing.

Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.
(Turns out, yes that was the problem for the OP).
Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.
It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.
I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.
There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)
Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?
Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

Related

Efficient equality test for bitstrings with arbitrary offsets

I have more than 1e7 sequences of tokens, where each token can only take one of four possible values.
In order to make this dataset fit into memory, I decided to encode each token in 2 bits, which allows to store 4 tokens in a byte instead of just one (when using a char for each token / std::string for a sequence). I store each sequence in a char array.
For some algorithm, I need to test arbitrary subsequences of two token sequences for exact equality. Each subsequence can have an arbitrary offset. The length is typically between 10 and 30 tokens (random) and is the same for the two subsequences.
My current method is to operate in chunks:
Copy up to 32 tokens (each having 2 bit) from each subsequences into an uint64_t. This is realized in a loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t.
Compare the two uint64_t. If they are not equal, return.
Repeat until all tokens in the subsequences have been processed.
#include <climits>
#include <cstdint>
using Block = char;
constexpr int BitsPerToken = 2;
constexpr int TokenPerBlock = sizeof(Block) * CHAR_BIT / BitsPerToken;
Block getTokenFromBlock(Block b, int nt) noexcept
{
return (b >> (nt * BitsPerToken)) & ((1UL << (BitsPerToken)) - 1);
}
bool seqEqual(Block const* seqA, int startA, int endA, Block const* seqB, int startB, int endB) noexcept
{
using CompareBlock = uint64_t;
constexpr int TokenPerCompareBlock = sizeof(CompareBlock) * CHAR_BIT / BitsPerToken;
const int len = endA - startA;
int posA = startA;
int posB = startB;
CompareBlock curA = 0;
CompareBlock curB = 0;
for (int i = 0; i < len; ++i, ++posA, ++posB)
{
const int cmpIdx = i % TokenPerBlock;
const int blockA = posA / TokenPerBlock;
const int idxA = posA % TokenPerBlock;
const int blockB = posB / TokenPerBlock;
const int idxB = posB % TokenPerBlock;
if ((i % TokenPerCompareBlock) == 0)
{
if (curA != curB)
return false;
curA = 0;
curB = 0;
}
curA += getTokenFromBlock(seqA[blockA], idxA) << (BitsPerToken * cmpIdx);
curB += getTokenFromBlock(seqB[blockB], idxB) << (BitsPerToken * cmpIdx);
}
if (curA != curB)
return false;
return true;
}
I figured that this should be quite fast (comparing 32 tokens simultaneously), but it is more than two times slower than using an std::string (with each token stored in a char) and its operator==.
I have looked into std::memcmp, but cannot use it because the subsequence might start somewhere within a byte (at a multiple of 2 bits, though).
Another candidate would be boost::dynamic_bitset, which basically implements the same storage format. However, it does not include equality tests.
How can I achieve fast equality tests using this compressed format?

First of all, this is the kind of computation where the target processor, RAM, compiler and compiler flags can drastically change the results. Unfortunately these critical information are not provided. Let's assume you use a quite recent mainstream x86-64 processor, a common DDR4-SDRAM, a compiler like Clang/GCC relatively up-to-date, and optimizations are enabled (ie. -O3 and possibly -march=native).
Clang and GCC use a fast comparison functions for comparing strings : respectively memcmp for GCC 12 and bcmp for Clang 15. The two functions are highly optimized on most platforms : they typically compare short strings by blocks of 8 bytes (uint64_t) and large strings by using SIMD instructions.
Your optimization is good to reduce the memory footprint but it introduces more computation and there is a high chance for the operation to be already compute-bound if the input buffer is already in the CPU cache. In addition, the computation is not SIMD-friendly due to the inner loop : the compiler will certainly not generate an efficient code due toe the bit-wise operations. The thing is scalar codes are slow. In fact, scalar byte-per-byte computations are generally so slow that they are usually far from being able to saturate the RAM bandwidth (at least the one achievable using only 1 core) as opposed to to memcmp. For example, a Skylake/Coffeelake processor at 4 GHz can only read 8 GiB/s from the L1 cache using a scalar byte-per-byte code while an AVX-2 SIMD code can read 256 GiB/s. For the write it is twice smaller : 4 GiB/s VS 128 GiB/s. A 1-channel DDR4-SDRAM # 3200MHz can theoretically reach ~24 GiB/s, that is, far more than a byte-per-byte scalar sequential code. The L3 cache have a much bigger bandwidth.
If you want a fast code for large sequences, then you need to either help your compiler so it can use SIMD instruction (not so easy in this case), to use non-portable SIMD intrinsics or possibly to use a relatively-portable SIMD library to generate quite-good SIMD code (though low-level platform-dependent intrinsics are more flexible/featureful).
I expect the main bottleneck to come from the "loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t". Indeed, this loop will likely generate a dependency chain of instructions (operating on the same uint64_t variable) that cannot be executed efficiently by the processor nor easily optimized by the compiler.
A typical solution would be to read blocks of 8 bytes (using memcpy to do it correctly, and hope the compiler optimize it properly). The bits can be reordered using a bswap instruction on x86-64 processors and it is not needed on big-endian processors. A shift+mask can be applied so to compare only the useful part. Here is an (untested) example to show the idea:
if(length >= 16)
{
uint64_t block1, block2;
uint64_t prev_block1 = 0, prev_block2 = 0;
unsigned int shift1 = (start1 % 4) * 2;
unsigned int shift2 = (start2 % 4) * 2;
uint64_t mask = 0xFFFFFFFFFFFFFF00ull;
// Read blocks 7 byte per 7 byte for sake of simplicity
for(size_t i=0; i<length-7 ; i+=7)
{
// Safe and cheap and GCC/Clang
memcpy(&block1, charArray1[i], 8);
memcpy(&block2, charArray2[i], 8);
// Architecture-dependent: reorder bytes on little-endian processors.
// There is a fast instruction for that on x86-64 processors: bswap.
// See: https://stackoverflow.com/questions/36497605
block1 = reorder_bytes(block1);
block2 = reorder_bytes(block2);
block1 = (block1 << shift1) & mask;
block2 = (block2 << shift2) & mask;
if(block1 != block2)
return false;
}
}
// TODO: compute the reminder part for the last block
This operation can be done using the SSE/AVX instruction set so to be faster for large sequences. Note you can perform a special optimization when shift1 == shift2 (especially when the both are equal to 0).
One should keep in mind that the bit-packing computation is pretty expensive, even using a SIMD code. It will certainly not be faster than a memcpy unless the operation is memory bound which is unlikely to be the case. For example, a Skylake/Coffeelake processor can load and compare 2 blocks of 32 bytes (ie. 32 tokens per block) in only 1 cycle (reciprocal throughput) using the AVX-2 SIMD instruction set, while there is no chance each iteration of the above bit-packing loop can take less than 2 cycles to compute 7 bytes (ie. 28 tokens). Using AVX-2 to optimize the above code is possible but the AVX lanes and the byte reordering results in several additional instructions being required so it will certainly be still slightly slower than just a basic very-fast comparisons (few cycles to compute ~120 tokens).
The only use-case where packing can help is when multiple core are used to do the computation. Indeed, in that case, the bit-packing code can scale well because it is likely compute-bound while the string-based version will quickly be limited by the speed of the RAM since it is likely memory-bound.

If there are only 10million tokens total, its 20Mbit or 2-3MB. If you keep their shifted versions in different arrays such as from 2 bit shifted to 30 bit shifted (assuming 4byte comparison at once, ignore 32 bit shift as it means just a different starting position), you can do a direct comparison (std::memcmp) with no shifting involved (fast) after selecting the right array with modulo of the arbitrary offset. But this requires the token sequence to be constant through many function calls (if not lifetime of program).
If these tokens are part of a much bigger data, you can put a caching layer (that caches fixed length chunks and joins them to get requested sub-sequence for A and B) just before the shifted initialization. Maybe LRU/LFU works fast enough if its token access pattern is cache-friendly. If its not cache friendly, then perhaps just reaching the arrays could be the bottleneck with or without shifting.
If you do checking per byte instead of per 4 bytes, it requires only 4 arrays instead of 16 and it shouldn't add too big requirement with caching.
You can also add an XOR result of fixed-length (like 50-100) sub-sequences for every offset as a way of quicker exiting. Again, this requires 4x more memory space. If XOR results of first tokens (+fixed length) are not equal, then they are not equal. This would reduce number of comparisons at least.
Another way is directly caching f(x,y)->bool like Python language does with its own caching. But this would be much worse than "fixed-length-chunked-caching & joining them" due to non-reusable parts & a lot of duplication.

Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?

I'm thinking about writing a SIMD vector math library, so as a quick benchmark I wrote a program that does 100 million (4 float) vector element-wise multiplications and adds them to a cumulative total. For my classic, non-SIMD variation I just made a struct with 4 floats and wrote my own multiply function "multiplyTwo" that multiplies two such structs element wise and returns another struct. For my SIMD variation I used "immintrin.h" along with __m128, _mm_set_ps, and _mm_mul_ps. I'm running on an i7-8565U processor (whiskey lake) and compiling with: g++ main.cpp -mavx -o test.exe to enable the AVX extension instructions in GCC.
The weird thing is that the SIMD version takes about 1.4 seconds, and the non-SIMD version takes only 1 second. I feel as though I'm doing something wrong, as I thought the SIMD version should run 4 times faster. Any help is appreciated, the code is below. I've placed the Non-SIMD code in comments, the code in it's current form is the SIMD version.
#include "immintrin.h" // for AVX
#include <iostream>
struct NonSIMDVec {
float x, y, z, w;
};
NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b);
int main() {
union { __m128 result; float res[4]; };
// union { NonSIMDVec result; float res[4]; };
float total = 0;
for(unsigned i = 0; i < 100000000; ++i) {
__m128 a4 = _mm_set_ps(0.0000002f, 1.23f, 2.0f, (float)i);
__m128 b4 = _mm_set_ps((float)i, 1.3f, 2.0f, 0.000001f);
// NonSIMDVec a4 = {0.0000002f, 1.23f, 2.0f, (float)i};
// NonSIMDVec b4 = {(float)i, 1.3f, 2.0f, 0.000001f};
result = _mm_mul_ps(a4, b4);
// result = multiplyTwo(a4, b4);
total += res[0];
total += res[1];
total += res[2];
total += res[3];
}
std::cout << total << '\n';
}
NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b)
{ return {a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w}; }

With optimization disabled (the gcc default is -O0), intrinsics are often terrible. Anti-optimized -O0 code-gen for intrinsics usually hurts a lot (even more than for scalar), and some of the function-like intrinsics introduce extra store/reload overhead. Plus the extra store-forwarding latency of -O0 tends to hurt more because there's less ILP when you do things with 1 vector instead of 4 scalars.
Use gcc -march=native -O3
But even with optimization enabled, your code is still written to destroy the performance of SIMD by doing a horizontal add of each vector inside the loop. See How to Calculate Vector Dot Product Using SSE Intrinsic Functions in C for how to not do that: use _mm_add_ps to accumulate a __m128 total vector, and only horizontal sum it outside the loop.
You bottleneck your loop on FP-add latency by doing scalar total += inside the loop. That loop-carried dependency chain means your loop can't run any faster than 1 float per 4 cycles on your Skylake-derived microarchitecture where addss latency is 4 cycles. (https://agner.org/optimize/)
Even better than __m128 total, use 4 or 8 vectors to hide FP add latency, so your SIMD loop can bottleneck on mul/add (or FMA) throughput instead of latency.
Once you fix that, then as #harold points out the way you're using _mm_set_ps inside the loop will result in pretty bad asm from the compiler. It's not a good choice inside a loop when the operands aren't constants, or at least loop-invariant.
Your example here is clearly artificial; normally you'd be loading SIMD vectors from memory. But if you did need to update a loop counter in a __m128 vector, you might use tmp = _mm_add_ps(tmp, _mm_set_ps(1.0, 0, 0, 0)). Or unroll with adding 1.0, 2.0, 3.0, and 4.0 so the loop-carried dependency is only the += 4.0 in the one element.
x + 0.0 is the identity operation even for FP (except maybe with signed zero) so you can do it to the other elements without changing them.
Or for the low element of a vector, you can use _mm_add_ss (scalar) to only modify it.

Fastest way to downcast an array short to char

I have to process roughly 2000, 100 element arrays every second. The arrays come to me as shorts, w/ the data in the upper bits and need to be shifted and cast to chars. Is this as efficient as I can get, or is there a faster way to perform this operation? (I have to skip 2 of the values)
for(int i = 0; i < 48; i++)
{
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}

Even if shift and bitwise operation are fast, you can try to process the short array as a char pointer as other advised in comments. It is allowed per standard and for common architectures does what is expected - left the endianness problem.
So you could try to first determine your endianness:
bool isBigEndian() {
short i = 1; // sets only lowest order bit
char *ix = reinterpret_cast<char *>(&i);
return (*ix == 0); // will be 1 if little endian
}
Your loop now becomes:
int shft = isBigEndian()? 0 : 1;
char * pb = reinterpret_cast<char *>(b);
for(int i = 0; i < 48; i++)
{
a[i] = pt[2 * i + shft];
a[i+48] = pt[2 * i + 50 + shft];
}
But as always for low level optimisation, this has to be benchmarked with the compiler and compiler options that will be used in production code.

You could put a wrapper class around these arrays, so code that accesses elements of the wrapper in order actually accesses every other byte of the underlying memory.
This will probably defeat auto-vectorization, though. Other than that, having all the code that would read a actually read b and increment its pointers by two instead of one shouldn't change the cost at all.
The two skipped elements are a problem, though. Having your operator[] do if (i>=48) i+=2 might kill this idea. memmove will often be much faster than storing one byte at a time, so you could consider using memmove to make a contiguous array of shorts that you can index even though it seems silly to copy without storing in a better format.
The trick will be to write a wrapper that completely optimizes away to no extra instructions in loops over your arrays. This is possible on x86, where scaled indexing is available in normal effective-addresses in asm instructions, so if the compiler understands what's going on, it can make code that's just as efficient.
Having arrays of shorts does take twice as much memory, so cache effects could matter.
It all depends on what you need to do with the byte arrays.
If you do need to convert, use SIMD
For x86 targets, you can get a big speedup with SIMD vectors instead of looping one char at a time. For other compile targets you care about, you can write similar special versions. I assume ARM NEON has similar shuffling capability, for example.
When writing a platform-specific version, you also get to make all the endian and unaligned-access assumptions that are true on that platform.
#ifdef __SSE2__ // will be true for all x86-64 builds and most i386 builds
#include <immintrin.h>
static __m128i pack2(const short *p) {
__m128i lo = _mm_loadu_si128((__m128i*)p);
__m128i hi = _mm_loadu_si128((__m128i*)(p + 8));
lo = _mm_srli_epi16(lo, 8); // logical shift, not arithmetic, because we need the high byte to be zero
hi = _mm_srli_epi16(hi, 8);
return _mm_packus_epi16(lo, hi); // treats input as signed, saturates to unsigned 0x0 .. 0xff range
}
#endif // SSE2
void conv(char *a, const short *b) {
#ifdef __SSE2__
for(int i = 0; i < 48; i+=16) {
__m128i low = pack2(b+i);
_mm_storeu_si128((__m128i *)(a+i), low);
__m128i high = pack2(b+i + 50);
_mm_storeu_si128((__m128i *)(a+i + 48), high);
}
#else
/******* Fallback C version *******/
for(int i = 0; i < 48; i++) {
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}
#endif
}
As you can see on the Godbolt Compiler Explorer, gcc fully unrolls the loop since it's only a few iterations when storing 16B at a time.
This should perform ok, but on pre-Skylake will bottleneck on shifting both vectors of shorts before the store. Haswell can only sustain one psrli per clock. (Skylake can sustain one per 0.5c when the shift-count is an immediate. See Agner Fog's guide and insn tables, links at the x86 tag wiki.)
You might get better results from loading from (__m128i*)(1 + (char*)p) so the bytes we want are already in the low half of each 16bit element. We'd still have to mask off the high half of each element with _mm_and_si128 instead of shifting, but PAND can run on any vector execution port, so it has three per clock throughput.
More importantly, with AVX it can be combined with an unaligned load. e.g. vpand xmm0, xmm5, [rsi], where xmm5 is a mask of _mm_set1_epi16(0x00ff), and [rsi] holds 2*i + 1 + (char*)b. fused-domain uop throughput is probably going to be an issue, like is common for code with a lot of loads/stores as well as computation.
Unaligned accesses are slightly slower than aligned accesses, but at least half your vector accesses will be unaligned anyway (since skipping two shorts means skipping 4B). On Intel SnB-family CPUs, I don't think it's slower to have loads that are split across a cache-line boundary in a 15:1 split compared to a 12:4 split. (The no-split case is definitely faster, though.) If b is 16B-aligned, then it'll be worth testing the mask version against the shift version.
I didn't write up complete code for this version, because you'll end up reading one byte past the end of b unless you take special precautions. This is fine if you make sure b has padding of some sort so it doesn't go right to the end of a memory page.
AVX2
With AVX2, vpackuswb ymm operates in two separate lanes. IDK if there's anything to gain from doing the load and mask (or shift) on 256b vectors and then using a vextracti128 and 128b pack on the two halves of the 256b vector.
Or maybe do a 256b pack between two vectors and then a vpermq (_mm256_permute4x64_epi64) to sort things out:
lo = _mm256_loadu(b..); // { b[15..8] | b[7..0] }
hi = // { b[31..24] | b[23..16] }
// mask or shift
__m256i packed = _mm256_packus_epi16(lo, hi); // [ a31..24 a15..8 | a23..16 a7..0 ]
packed = _mm256_permute4x64_epi64(packed, _MM_SHUFFLE(3, 1, 2, 0));
Of course, use any portable optimizations you can in the C version. e.g. Serge Ballesta's suggestion of just copying the desired bytes after figuring out their location from the endianness of the machine. (Preferably at compile time by checking GNU C's __BYTE_ORDER__ macro.

Fast implementation of a large integer counter (in C/C++)

My goal is as the following,
Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.
For example, if the range can be represented simply by an unsigned, then
void increment (unsigned &n) {++n;}
is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,
typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
if (ctr[0] < max) {++ctr[0]; return;}
if (ctr[1] < max) {++ctr[1]; return;}
if (ctr[2] < max) {++ctr[2]; return;}
if (ctr[3] < max) {++ctr[3]; return;}
ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}
So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max, and then ctr[1], and so on. If all 256-bits are set, then we reset it to all zero, and start again.
The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,
void increment (ctr_type &ctr)
{
std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
if (k < 4)
++ctr[k];
else
memset(ctr.data(), 0, 32);
}
If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0. And thus the value k will be the index of the first element that is less than the maximum.
I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.
However, both clang, gcc, or intel icpc generate binaries that the second was much faster.
My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.
What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K where K > 1, but almost always remain a constant. My current implementation is similar to the above.
To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated. If I generate one __m128i integer at once, then the cost of increment matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4], initialized equivalent to the following,
block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);
And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment is indeed the one slow down the program. Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.
So to summary, my main question is that, if anyone know a way to improve the above counter implementation.

It's not only slow, but impossible. The total energy of universe is insufficient for 2^256 bit changes. And that would require gray counter.
Next thing before optimization is to fix the original implementation
void increment (ctr_type &ctr)
{
if (++ctr[0] != 0) return;
if (++ctr[1] != 0) return;
if (++ctr[2] != 0) return;
++ctr[3];
}
If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9, 19,29,39,49,...99, 199,299,... and 1999,2999,3999,..., 9999.
As a reply to the comment -- it takes 2^64 iterations to have the first overflow. Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out. That's about 136 years.
EDIT
If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:
(*counter) += 1;
while (*counter == 0)
{
counter++; // Move to next word
if (counter > tail_of_array) {
counter = head_of_array;
memset(counter,0, 16);
break;
}
}
The point being, that the overflow is still very infrequent. Almost always there's just one word to be incremented.

If you're using GCC or compilers with __int128 like Clang or ICC
unsigned __int128 H = 0, L = 0;
L++;
if (L == 0) H++;
On systems where __int128 isn't available
std::array<uint64_t, 4> c[4]{};
c[0]++;
if (c[0] == 0)
{
c[1]++;
if (c[1] == 0)
{
c[2]++;
if (c[2] == 0)
{
c[3]++;
}
}
}
In inline assembly it's much easier to do this using the carry flag. Unfortunately most high level languages don't have means to access it directly. Some compilers do have intrinsics for adding with carry like __builtin_uaddll_overflow in GCC and __builtin_addcll
Anyway this is rather wasting time since the total number of particles in the universe is only about 1080 and you cannot even count up the 64-bit counter in your life

Neither of your counter versions increment correctly. Instead of counting up to UINT256_MAX, you are actually just counting up to UINT64_MAX 4 times and then starting back at 0 again. This is apparent from the fact that you do not bother to clear any of the indices that has reached the max value until all of them have reached the max value. If you are measuring performance based on how often the counter reaches all bits 0, then this is why. Thus your algorithms do not generate all combinations of 256 bits, which is a stated requirement.

You mention "Generate successive values, such that each new one was never generated before"
To generate a set of such values, look at linear congruential generators
the sequence x = (x*1 + 1) % (power_of_2), you thought about it, this are simply sequential numbers.
the sequence x = (x*13 + 137) % (power of 2) , this generates unique numbers with a predictable period (power_of_2 - 1) and the unique numbers look more "random", kind of pseudo-random. You need to resort to arbitrary precision arithmetic to get it working, and also all the trickeries of multiplications by constants. This will get you a nice way to start.
You also complain that your simple code is "slow"
At 4.2 GHz frequency, running 4 intructions per cycle and using AVX512 vectorizations, on a 64-core computer with a multithreaded version of your program doing nothing else than increments, you get only 64x8x4*232=8796093022208 increments per second, that is 264 increments reached in 25 days. This post is old, you might have reached 841632698362998292480 by now, running such a program on such a machine, and you will gloriously reach 1683265396725996584960 in 2 years time.
You also require "until all possible values are generated".
You can only generate a finite number of values, depending how much you are willing to pay for the energy to power your computers. As mentioned in the other responses, with 128 or 256-bit numbers, even being the richest man in the world, you will never wrap around before the first of these conditions occurs:
getting out of money
end of humankind (nobody will get the outcome of your software)
burning the energy from the last particles of the universe

Multi-word addition can easily be accomplished in portable fashion by using three macros that mimic three types of addition instructions found on many processors:
ADDcc adds two words, and sets the carry if their was unsigned overflow
ADDC adds two words plus carry (from a previous addition)
ADDCcc adds two words plus carry, and sets the carry if their was unsigned overflow
A multi-word addition with two words uses ADDcc of the least significant words followed by ADCC of the most significant words. A multi-word addition with more than two words forms sequence ADDcc, ADDCcc, ..., ADDC. The MIPS architecture is a processor architecture without conditions code and therefore without carry flag. The macro implementations shown below basically follow the techniques used on MIPS processors for multi-word additions.
The ISO-C99 code below shows the operation of a 32-bit counter and a 64-bit counter based on 16-bit "words". I chose arrays as the underlying data structure, but one might also use struct, for example. Use of a struct will be significantly faster if each operand only comprises a few words, as the overhead of array indexing is eliminated. One would want to use the widest available integer type for each "word" for best performance. In the example from the question that would likely be a 256-bit counter comprising four uint64_t components.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
typedef uint16_t T;
/* increment a multi-word counter comprising n words */
void inc_array (T *counter, const T *increment, int n)
{
T cy, t0, t1;
counter [0] = ADDcc (counter [0], increment [0], cy, t0, t1);
for (int i = 1; i < (n - 1); i++) {
counter [i] = ADDCcc (counter [i], increment [i], cy, t0, t1);
}
counter [n-1] = ADDC (counter [n-1], increment [n-1], cy, t0, t1);
}
#define INCREMENT (10)
#define UINT32_ARRAY_LEN (2)
#define UINT64_ARRAY_LEN (4)
int main (void)
{
uint32_t count32 = 0, incr32 = INCREMENT;
T count_arr2 [UINT32_ARRAY_LEN] = {0};
T incr_arr2 [UINT32_ARRAY_LEN] = {INCREMENT};
do {
count32 = count32 + incr32;
inc_array (count_arr2, incr_arr2, UINT32_ARRAY_LEN);
} while (count32 < (0U - INCREMENT - 1));
printf ("count32 = %08x arr_count = %08x\n",
count32, (((uint32_t)count_arr2 [1] << 16) +
((uint32_t)count_arr2 [0] << 0)));
uint64_t count64 = 0, incr64 = INCREMENT;
T count_arr4 [UINT64_ARRAY_LEN] = {0};
T incr_arr4 [UINT64_ARRAY_LEN] = {INCREMENT};
do {
count64 = count64 + incr64;
inc_array (count_arr4, incr_arr4, UINT64_ARRAY_LEN);
} while (count64 < 0xa987654321ULL);
printf ("count64 = %016llx arr_count = %016llx\n",
count64, (((uint64_t)count_arr4 [3] << 48) +
((uint64_t)count_arr4 [2] << 32) +
((uint64_t)count_arr4 [1] << 16) +
((uint64_t)count_arr4 [0] << 0)));
return EXIT_SUCCESS;
}
Compiled with full optimization, the 32-bit example executes in about a second, while the 64-bit example runs for about a minute on a modern PC. The output of the program should look like so:
count32 = fffffffa arr_count = fffffffa
count64 = 000000a987654326 arr_count = 000000a987654326
Non-portable code that is based on inline assembly or proprietary extensions for wide integer types may execute about two to three times as fast as the portable solution presented here.

How to improve precision in multiplication?

I am using C++ to implement transfinite interpolation algorithm (http://en.wikipedia.org/wiki/Transfinite_interpolation). Everything looks good until when I was trying to test some small numbers, the results look weird and incorrect. I guess it must have something to do with loss of precision. The code is
for (int j = 0; j < Ny+1; ++j)
{
for (int i = 0; i < Nx+1; ++i)
{
int imax = Nx;
int jmax = Ny;
double CA1 = (double)i/imax; // s
double CB1 = (double)j/jmax; // t
double CA2 = 1.0-CA1; // 1-s
double CB2 = 1.0-CB1; // 1-t
point U = BD41[j]*CA2 + BD23[j]*CA1;
point V = BD12[i]*CB2 + BD34[i]*CB1;
point UV =
(BD12[0]*CB2 + BD41[jmax]*CB1)*CA2
+ (BD12[imax]*CB2 + BD23[jmax]*CB1)*CA1;
tfiGrid[k][j][i] = U + V - UV;
}
}
I guess when BD12[i] (or BD34[i] or BD41[j] or BD23[j]) is very small, the rounding error or something would accumulated and become in-negligible. Any ideas how to handle this sort of situation?
PS: Even though similar questions have been asked for millions of times. I still cannot figure out is it related to my multiplication or division or subtraction or what?

In addition to the points that Antoine made (all very good):
it's probably worth remembering that adding two values with very
different orders of magnitude will cause a very large loss of
precision. For example, if CA1 is less than about 1E-16,
1.0 - CA1 is probably still 1.0, and even if it is just
a little larger, you'll loose quite a few digits of precision.
If this is the problem, you should be able to isolate it just by
putting a few print statements in the inner loop, and looking at
the values you are adding (or perhaps even with a debugger).
What to do about it is another question. There may be some
numerically stable algorithms for what you are trying to do;
I don't know. Otherwise, you'll probably have to detect the
problem dynamically, and rewrite the equations to avoid it if it
occurs. (For example, to detect if CA1 is too small for the
addition, you might check whether 1.0 / CA1 is more than
a couple of thousand, or million, or however much precision you
can afford to loose.)

The accuracy of the arithmetics built in C/C++ is limited. Of course errors will accumulate in your case.
Have you considered using a library that provides higher precision? Maybe have a look at https://gmplib.org/
A short example that clearifys the higher accuracy:
double d, e, f;
d = 1000000000000000000000000.0;
e = d + 1.0;
f = e - d;
printf( "the difference of %f and %f is %f\n", e, d, f);
This will not print 1 but 0. With gmplib the code would look like:
#include "gmp.h"
mpz_t m, n, r;
mpz_init( m);
mpz_init( n);
mpz_init( r);
mpz_set_str( m, "1 000 000 000 000 000 000 000 000", 10);
mpz_add_ui( n, m, 1);
mpz_sub( r, n, m);
printf( "the difference of %s and %s is %s (using gmp)\n",
mpz_get_str( NULL, 10, n),
mpz_get_str( NULL, 10, m),
mpz_get_str( NULL, 10, r));
mpz_clear( m);
mpz_clear( n);
mpz_clear( r);
This will return 1.

Your algorithm doesn't appear to be accumulating errors by re-using previous computations at each iteration, so it's difficult to answer your question without looking at your data.
Basically, you have 3 options:
Increase the precision of your numbers: x86 CPUs can handle float (32 bit), double (64-bit) and often long double (80-bit). Beyond that you have to use "soft" floating points, where all operations are implemented in software instead of hardware. There is a good C lib that do just that: MPFR based on GMP, GNU recommends using MPFR. I strongly recommend using easier to use C++ wrappers like boost multiprocesion. Expect your computations to be orders of magnitude slower.
Analyze where your precision loss comes from by using something more informative than a single scalar number for your computations. Have a look at interval arithmetic and the MPFI lib, based on MPFR. CADENA is another, very promising solution, based on randomly changing rounding modes of the hardware, and comes with a low run-time cost.
Perform static analysis, which doesn't even require running your code and work by analyzing your algorithms. Unfortunately, I don't have experience with such tools so I can't recommend anything beyond goggling.
I think the best course is to run static or dynamic analysis while developing your algorithm, identify your precision problems, address them by either changing the algorithm or using higher precision for the most unstable variables - and not others to avoid too much performance impact at run-time.

Numerical (i.e., floating point) computation is hard to do precisely. You have to be particularly vary of substractions, is is mostly there where you lose precision. In this case the 1.0 - CA1 and such are suspect (if CA1 is very small, you'll get 1). Reorganize your expressions, the Wikipedia article (a stub!) probably has them written for understanding (showing symmetries, ...) and aestetics, not numerical robustness.
Search for courses/lecture notes on numerical computation, thy should include an introductory chapter on this. And check out Goldberg's classic What every computer scientist should know about floating point arithmetic.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js