High performance comparison of signed int arrays (using Intel IPP library) - c++

We're trying to compare two equally sized native arrays of signed int values using inequality operations, <, <=, > and >=, in a high performance way. As many values are compared, the true/false results would be sotred in a char array of the same size of the input, where 0x00 means false and 0xff means true.
To accomplish this, we're using the Intel IPP library. The problem is that the function we found that does this operation, named ippiCompare_*, from the images and video processing lib, supports only the types unsigned char (Ipp8u), signed/unsigned short (Ipp16s/Ipp16u) and float (Ipp32f). It does not directly support signed int (Ipp32s)
I (only) envision two possible ways of solving this:
Casting the array to one of the directly supported types and executing the comparison in more steps (it would became a short array of twice the size or a char array of four times the size) and merging intermediate results.
Using another function directly supporting signed int arrays from IPP or from another library that could do something equivalent in terms of performance.
But there may be other creative ways... So I ask you're help with that! :)
PS: The advantage of using Intel IPP is the performance gain for large arrays: it uses multi-value processor functions and many cores simultaneously (and maybe more tricks). So simple looped solutions wouldn't do it as fast AFAIK.
PS2: link for the ippiCompare_* doc

You could do the comparison with PCMPEQD followed by a PACKUSDW and PACKUSWB. This would be something along
#include <emmintrin.h>
void cmp(__m128d* a, __m128d* b, v16qi* result, unsigned count) {
for (unsigned i=0; i < count/16; ++i) {
__m128d result0 = _mm_cmpeq_pd(a[0], b[0]); // each line compares 4 integers
__m128d result1 = _mm_cmpeq_pd(a[1], b[1]);
__m128d result2 = _mm_cmpeq_pd(a[2], b[2]);
__m128d result3 = _mm_cmpeq_pd(a[3], b[3]);
a += 4; b+= 4;
v8hi wresult0 = __builtin_ia32_packssdw(result0, result1); //pack 2*4 integer results into 8 words
v8hi wresult1 = __builtin_ia32_packssdw(result0, result1);
*result = __builtin_ia32_packsswb(wresult0, wresult1); //pack 2*8 word results into 16 bytes
result++;
}
}
Needs aligned pointers, a count divisible by 16, some typecasts I have omitted because of lazyness/stupidity and probably a lot of debugging, of course. And I didn't find the intrinsics for packssdw/wb, so I just used the builtins from my compiler.

I thought there is an SSE instruction that would compare integers. Have you look into the intrinsics that can do that?

Backing out of the box for a bit: are you sure this is a performance problem? Unless your data set fits in L1 cache, you will be cache-fill limited and the actual cycles you're spending on your comparison operations (which are hardly slow even when done in the most naive way possible) can't possibly be limiting.

Related

Why does this piece of code written using uint8_t run faster than analogous code written with uint32_t or uint64_t on a 64bit machine?

Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion? Yet while testing my bitset implementation(where the majority of the time depends on bitwise operations), I found I got a ~40% improvement using uint8_t over uint32_t. I'm especially surprised because there is hardly any copying going on that would justify the difference. The same thing occurred regardless of the clang optimisation level.
8bit:
#define mod8(x) x&7
#define div8(x) x>>3
template<unsigned long bits>
struct bitset{
private:
uint8_t fill[8] = {};
uint8_t clear[8];
uint8_t band[(bits/8)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div8(ind)]&fill[mod8(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div8(ind)] |= fill[mod8(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div8(ind)] &= clear[mod8(ind)];
}
bitset(){
for(uint8_t ii = 0, val = 1; ii < 8; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
32bit:
#define mod32(x) x&31
#define div32(x) x>>5
template<unsigned long bits>
struct bitset{
private:
uint32_t fill[32] = {};
uint32_t clear[32];
uint32_t band[(bits/32)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div32(ind)]&fill[mod32(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div32(ind)] |= fill[mod32(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div32(ind)] &= clear[mod32(ind)];
}
bitset(){
for(uint32_t ii = 0, val = 1; ii < 32; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
And here is the benchmark I used(just moves a single 1 from position 0 till the end iteratively):
const int len = 1000000;
bitset<len> bs;
{
auto start = std::chrono::high_resolution_clock::now();
bs.store_high(0);
for (int ii = 1; ii < len; ++ii) {
bs.store_high(ii);
bs.store_low(ii-1);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>((stop-start)).count()<<std::endl;
}
TL:DR: large "buckets" for a bitset mean you access the same one repeatedly when you iterate linearly, creating longer dependency chains that out-of-order exec can't overlap as effectively.
Smaller buckets give instruction-level parallelism, making operations on bits in separate bytes independent of each other.
On possible reason is that you iterate linearly over bits, so all the operations within the same band[] element form one long dependency chain of &= and |= operations, plus store and reload (if the compiler doesn't manage to optimize that away with loop unrolling).
For uint32_t band[], that's a chain of 2x 32 operations, since ii>>5 will give the same index for that long.
Out-of-order exec can only partially overlap execution of these long chains if their latency and instruction-count is too large for the ROB (ReOrder Buffer) and RS (Reservation Station, aka Scheduler). With 64 operations probably including store/reload latency (4 or 5 cycles on modern x86), that's a dep chain length of probably 6 x 64 = 384 cycles, composed of probably at least 128 uops, with some parallelism for loading (or better calculating) 1U<<(n&31) or rotl(-1U, n&31) masks that can "use up" some of the wasted execution slots in the pipeline.
But for uint8_t band[], you've moving to a new element 4x as frequently, after only 2x 8 = 16 operations, so the dep chains are 1/4 the length.
See also Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for another case of a modern x86 CPU overlapping two long dependency chains (a simple chain of imul with no other instruction-level parallelism), especially the part about a single dep chain becoming longer than the RS (scheduler for un-executed uops) being the point at which we start to lose some of the overlap of execution of the independent work. (For the case without lfence to artificially block overlap.)
See also Modern Microprocessors
A 90-Minute Guide! and https://www.realworldtech.com/sandy-bridge/ for some background on how modern OoO exec CPUs decode and look at instructions.
Small vs. large buckets
Large buckets are only useful when scanning through for the first non-zero bit, or filling the whole thing or something. Of course, really you'd want to vectorize that with SIMD, checking 16 or 32 bytes at once to see if there's a non-zero element anywhere in that. Current compilers will vectorize for you in loops that fill the whole array, but not search loops (or anything with a trip-count that can't be calculated ahead of the first iteration), except for ICC which can handle that. Re: using fast operations over bit-vectors, see Howard Hinnant's article (in the context of vector<bool>, which is an unfortunate name for a sometimes-useful data structure.)
C++ unfortunately doesn't make it easy in general to use different sized accesses to the same data, unless you compile with g++ -O3 -fno-strict-aliasing or something like that.
Although unsigned char can always alias anything else, so you could use that for your single-bit accesses, only using uintptr_t (which is likely to be as wide as a register, except on ILP32-on-64bit ISAs) for init or whatever. Or in this case, uint_fast32_t being a 64-bit type on many x86-64 C++ implementations would make it useful for this, unlike usual when that sucks, wasting cache footprint when you're only using the value-range of a 32-bit number and being slower for non-constant division on some CPUs.
On x86 CPU, a byte store is naturally fully efficient, but even on an ARM or something, coalescing in the store buffer could still make adjacent byte RMWs fully efficient. (Are there any modern CPUs where a cached byte store is actually slower than a word store?). And you'd still gain ILP; a slower commit to cache is still not as bad as coupling loads to stores that could have been independent if narrower. Especially important on lower-end CPUs with smaller out-of-order schedulers buffers.
(x86 byte loads need to use movzx to zero-extend to avoid false dependencies, but most compilers know that. Clang is reckless about it which can occasionally hurt.)
(Different sized accesses close to each other can lead to store-forwarding stalls, e.g. a byte store and an unsigned long reload that overlaps that byte will have extra latency: What are the costs of failed store-to-load forwarding on x86?)
Code review:
Storing an array of masks is probably worse than just computing 1u32<<(n&31)) as needed, on most CPUs. If you're really lucky, a smart compiler might manage constant propagation from the constructor into the benchmark loop, and realize that it can rotate or shift inside the loop to generate the bitmask instead of indexing memory in a loop that already does other memory operations.
(Some non-x86 ISAs have better bit-manipulation instructions and can materialize 1<<n cheaply, although x86 can do that in 2 instructions as well if compilers are smart. xor eax,eax / bts eax, esi, with the BTS implicitly masking the shift count by the operand-size. But that only works so well for 32-bit operand-size, not 8-bit. Without BMI2 shlx, x86 variable-count shifts run as 3-uops on Intel CPUs, vs. 1 on AMD.)
Almost certainly not worth it to store both fill[] and clear[] constants. Some ISAs even have an andn instruction that can NOT one of the operands on the fly, i.e. implements (~x) & y in one instruction. For example, x86 with BMI1 extensions has andn. (gcc -march=haswell).
Also, your macros are unsafe: wrap the expression in () so operator-precedence doesn't bits you if you use foo[div8(x) - 1].
As in #define div8(x) (x>>3)
But really, you shouldn't be using CPP macros for stuff like this anyway. Even in modern C, just define static const shift = 3; shift counts and masks. In C++, do that inside the struct/class scope, and use band[idx >> shift] or something. (When I was typing ind, my fingers wanted to type int; idx is probably a better name.)
Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion?
This isn't a universal truth. As always, fit depends on details.
Why does this piece of code written using uint_8 run faster than analogous code written with uint_32 or uint_64 on a 64bit machine?
The title doesn't match the question. There are no such types as uint_X and you aren't using uintX_t. You are using uint_fastX_t. uint_fastX_t is an alias for an integer type that is at least X bytes, that is deemed by the language implementers to provide fastest operations.
If we were to take your earlier mentioned assumption for granted, then it should logically follow that the language implementers would have chosen to use 32/64 bit type as uint_fast8_t. That said, you cannot assume that they have done so and whatever generic measurement (if any) has been used to make that choice doesn't necessarily apply to your case.
That said, regardless of which type uint_fast8_t is an alias of, your test isn't fair for comparing the relative speeds of calculation of potentially different integer types:
uint_fast8_t fill[8] = {};
uint_fast8_t clear[8];
uint_fast8_t band[(bits/8)+1] = {};
uint_fast32_t fill[32] = {};
uint_fast32_t clear[32];
uint_fast32_t band[(bits/32)+1] = {};
Not only are the types (potentially) different, but the sizes of the arrays are too. This can certainly have an effect on the efficiency.

What is the "correct" way to go from avx/sse masks to avx512 masks?

I have some existing avx/sse masks that I got the old way:
auto mask_sse = _mm_cmplt_ps(a, b);
auto mask_avx = _mm_cmp_ps(a, b, 17);
In some circumstances when mixing old avx code with new avx512 code, I want to convert these old style masks into the new avx512 __mmask4 or __mmask8 types.
I tried this:
auto mask_avx512 = _mm_cmp_ps_mask(sse_mask, _mm_setzero_ps(), 25/*nge unordered quiet*/);
and it seems to work for plain old outputs of comparisons, but I don't think it would capture positive NANs correctly that could have been used with an sse4.1 _mm_blendv_ps.
There also is good old _mm_movemask_ps but that looks like it puts the mask all the way out in a general purpose register, and I would need to chain it with a _cvtu32_mask8 to pull it back into one of the dedicated mask registers.
Is there a cleaner way to just directly pull the sign bit out of an old style mask into one of the k registers?
Example Code:
Here's an example program doing the sort of mask conversion the first way I mentioned above
#include "x86intrin.h"
#include <cassert>
#include <cstdio>
int main()
{
auto a = _mm_set_ps(-1, 0, 1, 2);
auto c = _mm_set_ps(3, 4, 5, 6);
auto sse_mask = _mm_cmplt_ps(a, _mm_setzero_ps());
auto avx512_mask = _mm_cmp_ps_mask(sse_mask, _mm_setzero_ps(), 25);
alignas(16) float v1[4];
alignas(16) float v2[4];
_mm_store_ps(v1, _mm_blendv_ps(a, c, sse_mask));
_mm_store_ps(v2, _mm_mask_blend_ps(avx512_mask, a, c));
assert(v1[0] == v2[0]);
assert(v1[1] == v2[1]);
assert(v1[2] == v2[2]);
assert(v1[3] == v2[3]);
return 0;
}
Use an AVX-512 compare intrinsic to get an AVX-512 mask in the first place (like _mm_cmp_ps_mask); that's going to be significantly more efficient than comparing into a vector and then converting it, unless the compiler optimizes away this inefficiency for you. (Consider using a wrapper library like Agner Fog's VCL to try to abstract away the difference. The VCL licence changed recently from GPL to Apache.)
But if you really need this (e.g. as a stop-gap before you finish optimizing), you don't need an FP compare. _mm_cmp_ps in C produces a __m128 result, but it's not really a vector of floats1. It's all-one-bits / all-zero-bits. You just want the bits, so you're looking for the AVX-512 equivalent of vmovmskps, but into a k register instead of GP integer. i.e. VPMOVD2M k, x/y/zmm for 32-bit source elements.
__m128 cmpvec = _mm_cmplt_ps(v, _mm_setzero_ps() );
__mmask8 cmpmask = _mm_movepi32_mask( _mm_castps_si128(cmpvec) ); // <----
// equivalent to comparing into a mask in the first place:
__mmask8 cmpmask = _mm_cmplt_ps_mask(v, _mm_setzero_ps(), _CMP_LT_OQ);
// equivalent to (if I got this right)
__mmask8 cmpmask = _mm_fpclass_ps_mask(v, 0x40 | 0x10); // negative | negative_inf
https://uops.info/ is down right now, otherwise I'd check latency and execution ports of VPMOVD2M vs. VCMPPS into mask (for an UNORD predicate) vs. VFPCLASSPS.
Footnote 1: You could use AVX-512 vfpclassps into a mask, or even compare against itself with a vcmpps predicate like UNORD to detect NAN or not. But those are I think slower.
I would need to chain it with a _cvtu32_mask8 to pull it back into one of the dedicated mask registers.
The way compilers currently do things, __mmask8 is just a typedef for unsigned char, and __mmask16 is unsigned short. They're freely convertible without intrinsics, for good or ill. But in asm, it takes a kmovb k1, eax instruction to get the data from a GP reg to a k mask reg, and that instruction can only run on port 5 in current CPUs.

What is the fastest way to check the leading characters in a char array?

I reached a bottleneck in my code, so the main issue of this question is performance.
I have a hexadecimal checksum and I want to check the leading zeros of an array of chars. This is what I am doing:
bool starts_with (char* cksum_hex, int n_zero) {
bool flag {true};
for (int i=0; i<n_zero; ++i)
flag &= (cksum_hex[i]=='0');
return flag;
}
The above function returns true if the cksum_hex has n_zero leading zeros. However, for my application, this function is very expensive (60% of total time). In other words, it is the bottleneck of my code. So I need to improve it.
I also checked std::string::starts_with which is available in C++20 and I observed no difference in performance:
// I have to convert cksum to string
std::string cksum_hex_s (cksum_hex);
cksum_hex_s.starts_with("000"); // checking for 3 leading zeros
For more information I am using g++ -O3 -std=c++2a and my gcc version is 9.3.1.
Questions
What is the faster way of checking the leading characters in a char array?
Is there a more efficient way of doing it with std::string::starts_with?
Does the bitwise operations help here?
If you modify your function to return early
bool starts_with (char* cksum_hex, int n_zero) {
for (int i=0; i<n_zero; ++i)
{
if (cksum_hex[i] != '0') return false;
}
return true;
}
It will be faster in case of big n_zero and false result. Otherwise, maybe you can try to allocate a global array of characters '0' and use std::memcmp:
// make it as big as you need
constexpr char cmp_array[4] = {'0', '0', '0', '0'};
bool starts_with (char* cksum_hex, int n_zero) {
return std::memcmp(cksum_hex, cmp_array, n_zero) == 0;
}
The problem here is that you need to assume some max possible value of n_zero.
Live example
=== EDIT ===
Considering the complains about no profiling data to justify the suggested approaches, here you go:
Benchmark results comparing early return implementation with memcmp implementation
Benchmark results comparing memcmp implementation with OP original implementation
Data used:
const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";
memcmp is fastest in all cases but cs2 with early return impl.
Presumably you also have the binary checksum? Instead of converting it to ASCII text first, look at the 4*n high bits to check n nibbles directly for 0 rather than checking n bytes for equality to '0'.
e.g. if you have the hash (or the high 8 bytes of it) as a uint64_t or unsigned __int128, right-shift it to keep only the high n nibbles.
I showed some examples of how they compile for x86-64 when both inputs are runtime variables, but these also compile nicely to other ISAs like AArch64. This code is all portable ISO C++.
bool starts_with (uint64_t cksum_high8, int n_zero)
{
int shift = 64 - n_zero * 4; // A hex digit represents a 4-bit nibble
return (cksum_high8 >> shift) == 0;
}
clang does a nice job for x86-64 with -O3 -march=haswell to enable BMI1/BMI2
high_zero_nibbles(unsigned long, int):
shl esi, 2
neg sil # x86 shifts wrap the count so 64 - c is the same as -c
shrx rax, rdi, rsi # BMI2 variable-count shifts save some uops.
test rax, rax
sete al
ret
This even works for n=16 (shift=0) to test all 64 bits. It fails for n_zero = 0 to test none of the bits; it would encounter UB by shifting a uint64_t by a shift count >= its width. (On ISAs like x86 that wrap out-of-bounds shift counts, code-gen that worked for other shift counts would result in checking all 16 bits. As long as the UB wasn't visible at compile time...) Hopefully you're not planning to call this with n_zero=0 anyway.
Other options: create a mask that keeps only the high n*4 bits, perhaps shortening the critical path through cksum_high8 if that's ready later than n_zero. Especially if n_zero is a compile-time constant after inlining, this can be as fast as checking cksum_high8 == 0. (e.g. x86-64 test reg, immediate.)
bool high_zero_nibbles_v2 (uint64_t cksum_high8, int n_zero) {
int shift = 64 - n_zero * 4; // A hex digit represents a 4-bit nibble
uint64_t low4n_mask = (1ULL << shift) - 1;
return cksum_high8 & ~low4n_mask;
}
Or use a bit-scan function to count leading zero bits and compare for >= 4*n. Unfortunately it took ISO C++ until C++20 <bit>'s countl_zero to finally portably expose this common CPU feature that's been around for decades (e.g. 386 bsf / bsr); before that only as compiler extensions like GNU C __builtin_clz.
This is great if you want to know how many and don't have one specific cutoff threshold.
bool high_zero_nibbles_lzcnt (uint64_t cksum_high8, int n_zero) {
// UB on cksum_high8 == 0. Use x86-64 BMI1 _lzcnt_u64 to avoid that, guaranteeing 64 on input=0
return __builtin_clzll(cksum_high8) > 4*n_zero;
}
#include <bit>
bool high_zero_nibbles_stdlzcnt (uint64_t cksum_high8, int n_zero) {
return std::countl_zero(cksum_high8) > 4*n_zero;
}
compile to (clang for Haswell):
high_zero_nibbles_lzcnt(unsigned long, int):
lzcnt rax, rdi
shl esi, 2
cmp esi, eax
setl al # FLAGS -> boolean integer return value
ret
All these instructions are cheap on Intel and AMD, and there's even some instruction-level parallelism between lzcnt and shl.
See asm output for all 4 of these on the Godbolt compiler explorer. Clang compiles 1 and 2 to identical asm. Same for both lzcnt ways with -march=haswell. Otherwise it needs to go out of its way to handle the bsr corner case for input=0, for the C++20 version where that's not UB.
To extend these to wider hashes, you can check the high uint64_t for being all-zero, then proceed to the next uint64_t chunk.
Using an SSE2 compare with pcmpeqb on the string, pmovmskb -> bsf could find the position of the first 1 bit, thus how many leading-'0' characters there were in the string representation, if you have that to start with. So x86 SIMD can do this very efficiently, and you can use that from C++ via intrinsics.
You can make a buffer of zeros large enough for you than compare with memcmp.
const char *zeroBuffer = "000000000000000000000000000000000000000000000000000";
if (memcmp(zeroBuffer, cksum_hex, n_zero) == 0) {
// ...
}
Things you want to check to make your application faster:
1. Can the compiler inline this function in places where it is called?
Either declare the function as inline in a header or put the definition in the compile unit where it is used.
2. Not computing something is faster than computing something more efficiently
Are all calls to this function necessary? High cost is generally the sign of a function called inside a high frequency loop or in an expensive algorithm. You can often reduce the call count, hence the time spent in the function, by optimizing the outer algorithm
3. Is n_zero small or, even better, a constant?
Compilers are pretty good at optimizing algorithm for typically small constant values. If the constant is known to the compiler, it will most likely remove the loop entirely.
4. Does the bitwise operation help here?
It definitely has an effect and allows Clang (but not GCC as far as I can tell) to do some vectorization. Vectorization tend to be faster but that's not always the case depending on your hardware and actual data processed.
Whether it is an optimization or not might depend on how big n_zero is. Considering you are processing checksums, it should be pretty small so it sounds like a potential optimization.
For known n_zero using bitwise operation allows the compiler to remove all branching. I expect, though I did not measure, this to be faster.
std::all_of and std::string::starts_with should be compiled exactly as your implementation except they will use && instead of &.
Unless n_zero is quite high I would agree with others that you may be misinterpreting profiler results. But anyway:
Could the data be swapped to disk? If your system is under RAM pressure, data could be swapped out to disk and need to be loaded back to RAM when you perform the first operation on it. (Assuming this checksum check is the first access to the data in a while.)
Chances are you could use multiple threads/processes to take advantage of a multicore processor.
Maybe you could use statistics/correlation of your input data, or other structural features of your problem.
For instance, if you have a large number of digits (e.g. 50) and you know that the later digits have a higher probability of being nonzero, you can check the last one first.
If nearly all of your checksums should match, you can use [[likely]] to give a compiler hint that this is the case. (Probably won't make a difference but worth a try.)
Adding my two cents to this interesting discussion, though a little late to the game, I gather you could use std::equal, it's a fast method with a slightly different approach, using a hardcoded string with the maximum number of zeros, instead of the number of zeros.
This works passing to the function pointers to the begin and end of the string to be searched, and to string of zeros, specifically iterators to begin and end, end pointing to position of one past of the wanted number of zeros, these will be used as iterators by std::equal:
Sample
bool startsWith(const char* str, const char* end, const char* substr, const char* subend) {
return std::equal(str, end, substr, subend);
}
int main() {
const char* str = "000x1234567";
const char* substr = "0000000000000000000000000000";
std::cout << startsWith(&str[0], &str[3], &substr[0], &substr[3]);
}
Using the test cases in #pptaszni's good answer and the same testing conditions:
const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";
The result where as follows:
Slower than using memcmp but still faster (except for false results with low number of zeros) and more consistent than your original code.
Use std::all_of
return std::all_of(chsum_hex, chsum_hex + n_zero, [](char c){ return c == '0'; })

How to design INT of 16,32, 64 bytes or even bigger in C++

As a beginner, I know we can use an ARRAY to store larger numbers if required, but I want to have a 16 bytes INT data type in c++ on which I can perform all arithmetic operations as performed on basic data types like INT or FLOAT
So can we in effect increase, default data types size as desired, like int of 64 bytes or double of 120 bytes, not directly on basic data type but in effect which is the same as of increasing capacity of datatypes.
Is this even possible, if yes then how and if not then what are completely different ways to achieve the same?
Yes, it's possible, but no, it's not trivial.
First, I feel obliged to point out that this is one area where C and C++ really don't provide as much access to the hardware at the lowest level as you'd really like. In assembly language, you normally get a couple of features that make multiple-precision arithmetic quite a bit easier to implement. One is a carry flag. This tracks whether a previous addition generated a carry (or a previous subtraction a borrow). So to add two 12-bit numbers on a machine with 64-bit registers you'd typically write code on this general order:
; r0 contains the bottom 64-bits of the first operand
; r1 contains the upper 64 bits of the first operand
; r2 contains the lower 64 bits of the second operand
; r3 contains the upper 64 bits of the second operand
add r0, r2
adc r1, r3
Likewise, when you multiply two numbers, most processors generate the full answer in two separate registers, so when (for example) you multiply two 64-bit numbers, you get a 128-bit result.
In C and C++, however, we don't get that. One easy way to get around it is to work in smaller chunks. For example, if we want a 128-bit type on an implementation that provides 64-bit long long as its largest integer type, we can work in 32-bit chunks. When we're going to do an operation, we widen those to a long long, and do the operation on the long long. This way, when we add or multiply two 32-bit chunks, if the result is larger than 32 bits, we can still store it all in our 64-bit long long.
So, for addition life is pretty easy. We add the two lowest order words. We use a bitmask to get the bottom 32 bits and store them into the bottom 32 bits of the result. Then we take the upper 32 bits, and use them as a "carry" when we add the next 32 bits of the operands. Continue until we've added all 128 (or whatever) bits of operands and gotten our overall result.
Subtraction is pretty similar. In fact, we can do 2's complement on the second operand, then add to get our result.
Multiplication gets a little trickier. It's not always immediately obvious how we can carry out multiplication in smaller pieces. The usual is based on the distributive property. That is, we can take some large numbers A and B, and break them up into (a0 + a1) and (b0 + b1), where each an and bn is a 32-bit chunk of the operand. Then we use the distributive property to turn that into:
a0 * b0 + a0 * b1 + a1 * b0 + a1 * b1
This can be extended to an arbitrary number of "chunks", though if you're dealing with really large numbers there are much better ways (e.g., karatsuba).
If you want to define non-atomic big integers, you can use plain structs.
template <std::size_t size>
struct big_int {
std::array<std::int8_t, size> bytes;
};
using int128_t = big_int<16>;
using int256_t = big_int<32>;
using int512_t = big_int<64>;
int main() {
int128_t i128 = { 0 };
}

Getting the high half and low half of a full integer multiply

I start with three values A,B,C (unsigned 32bit integer). And i have to obtain two values D,E (unsigned 32 bit integer also). Where
D = high(A*C);
E = low(A*C) + high(B*C);
I expect that multiply of two 32bit uint produce 64bit result. "high" and "low" is just my covnention for mark the first 32 bits and the last 32 bits in 64bit result of multiply.
I try to obtain optimized code of some allready functional one. I have a short part of the code in huge loop which is just few command lines, however it consumes almost all of computational time (physical simulation for couple of hours computing). That's the reason why i try to optimized this little part and rest of the code could remain more "user-well-arranged".
There is some SSE instructions that are fit for compute mentioned routine. The gcc compiler probably do optimized work. However i do not reject an option to write some piece of code in SSE intructions directly, if it will be necessary.
Be patient with my low experience with SSE please. I will try to write an algorithm for SSE just symbolically. There will be probably some mistakes with ordering masks or understanding the structure.
Store four 32-bit integers into one 128-bit register in order: A,B,C,C.
Apply instruction (probably pmuludq) into mentioned 128-bit register which multiply pairs of 32-bit integeres and return pairs of 64-bit integers as result. So it shoudld calculate multiply of A*C and multiply of B*C simultaneously and return two 64-bit values.
I expect that i have new 128bit register values P,Q,R,S (four 32-bit blocs) where P,Q is 64-bit result of A*C and R,S is 64-bit result of B*C. Then i continue with rearrange values at register into order P,Q,0,R
Take first 64 bits P,Q and add second 64 bits 0,R. The result is a new 64 bits value.
Read first 32 bits of the result as D and last 32 bits of the result as E.
This algorithm should return correct values for E and D.
My question:
Is there a static code in c++ which generate similar SSE routine as mentioned 1-5 SSE algorithm? I preffer solutions with higher performance. If the algorithm is problematic for standart c++ commands, is there a way how to write an algorithm in SSE?
I use TDM-GCC 4.9.2 64-bit compiler.
(note: Question was modified after advice)
(note2: I have an inspiration in this http://sci.tuomastonteri.fi/programming/sse for using SSE for obtain better performance)
You don't need vectors for this unless you have multiple inputs to process in parallel. clang and gcc already do a good job of optimizing the "normal" way to write your code: cast to twice the size, multiply, then shift to get the high half. Compilers recognize this pattern.
They notice that the operands started out as 32bit, so the upper halves are all zero after casting to 64b. Thus, they can use x86's mul insn to do a 32b*32b->64b multiply, instead of doing a full extended-precision 64b multiply. In 64bit mode, they do the same thing with a __uint128_t version of your code.
Both of these functions compile to fairly good code (one mul or imul per multiply).. gcc -m32 doesn't support 128b types, but I won't get into that because 1. you only asked about full multiplies of 32bit values, and 2. you should always use 64bit code when you want something to run fast. If you are doing full-multiplies where the result doesn't fit in a register, clang will avoid a lot of extra mov instructions, because gcc is silly about this. This little test function made a good test-case for that gcc bug report.
That godbolt link includes a function that calls this in a loop, storing the result in an array. It auto-vectorizes with a bunch of shuffling, but still looks like a speedup if you have multiple inputs to process in parallel. A different output format might take less shuffling after the multiply, like maybe storing separate arrays for D and E.
I'm including the 128b version to show that compilers can handle this even when it's not trivial (e.g. just do a 64bit imul instruction to do a 64*64->64b multiply on the 32bit inputs, after zeroing any upper bits that might be sitting in the input registers on function entry.)
When targeting Haswell CPUs and newer, gcc and clang can use the mulx BMI2 instruction. (I used -mno-bmi2 -mno-avx2 in the godbolt link to keep the asm simpler. If you do have a Haswell CPU, just use -O3 -march=haswell.) mulx dest1, dest2, src1 does dest1:dest2 = rdx * src1 while mul src1 does rdx:rax = rax * src1. So mulx has two read-only inputs (one implicit: edx/rdx), and two write-only outputs. This lets compilers do full-multiplies with fewer mov instructions to get data into and out of the implicit registers for mul. This is only a small speedup, esp. since 64bit mulx has 4 cycle latency instead of 3, on Haswell. (Strangely, 64bit mul and mulx are slightly cheaper than 32bit mul and mulx.)
// compiles to good code: you can and should do this sort of thing:
#include <stdint.h>
struct DE { uint32_t D,E; };
struct DE f_structret(uint32_t A, uint32_t B, uint32_t C) {
uint64_t AC = A * (uint64_t)C;
uint64_t BC = B * (uint64_t)C;
uint32_t D = AC >> 32; // high half
uint32_t E = AC + (BC >> 32); // We could cast to uint32_t before adding, but don't need to
struct DE retval = { D, E };
return retval;
}
#ifdef __SIZEOF_INT128__ // IDK the "correct" way to detect __int128_t support
struct DE64 { uint64_t D,E; };
struct DE64 f64_structret(uint64_t A, uint64_t B, uint64_t C) {
__uint128_t AC = A * (__uint128_t)C;
__uint128_t BC = B * (__uint128_t)C;
uint64_t D = AC >> 64; // high half
uint64_t E = AC + (BC >> 64);
struct DE64 retval = { D, E };
return retval;
}
#endif
If I understand it correctly, you want to compute number of potential overflows in A*B. If yes then you have 2 good options - the "use twice as big variable" (write 128bit math function for uint64 - it's not that hard (or wait for me to post it tomorrow)), and the "use floating point type":
(float(A)*float(B))/float(C)
as the loss of precision is minimal (assuming float is 4 bytes, double 8 bytes, and long double 16 bytes long) , and both float and uint32 require 4 bytes of memory (use double for uint64_t as it should be 8 bytes long):
#include <iostream>
#include <conio.h>
#include <stdint.h>
using namespace std;
int main()
{
uint32_t a(-1), b(-1);
uint64_t result1;
float result2;
result1 = uint64_t(a)*uint64_t(b)/4294967296ull; // >>32 would be faster and less memory consuming
result2 = float(a)*float(b)/4294967296.0f;
cout.precision(20);
cout<<result1<<'\n'<<result2;
getch();
return 0;
}
Produces:
4294967294
4294967296
But if you want really precise and correct answer I'd suggest using twice as big type for computing
Now that I think of it - you could use long double for uint64 and double for uint32 instead of writing function for uint64, but I don't think it's guaranteed that long double will be 128bit, and you'll have to check it. I'd go for more universal option.
EDIT:
You can write function to calculate that without using anything more
than A, B and result variable which would be of the same type as A.
Just add rightmost bit of (where Z equals B*(A>>pass_number&1)) Z<<0,
Z<<1, Z<<2 (...) Z<<X in first pass, Z<<-1, Z<<0, Z<<1 (...) Z<<(X-1)
for second (there should be X passes), while right shifting the result
by 1 (the just computed bit becomes irrelevant to us after it's
computed as it won't participate in calculation anymore, and it would
be erased anyway after dividing by 2^X (doing >>X)
(had to place in the "code" as I'm new here and couldn't find another way to prevent formatting script from eating half of it)
It's just a quick idea. You'll have to check it's correctness (sorry, but I'm really tired right now - but the result shouldn't overflow at any point of calculation, as the maximum carry would have value of 2X if I'm correct, and the algorithm itself seems to be good).
I will write code for that tomorrow if you'll still be in need of help.