overflow instead of saturation on 16bit add AVX2

overflow instead of saturation on 16bit add AVX2 - c++

I want to add 2 unsigned vectors using AVX2
__m256i i1 = _mm256_loadu_si256((__m256i *) si1);
__m256i i2 = _mm256_loadu_si256((__m256i *) si2);
__m256i result = _mm256_adds_epu16(i2, i1);
however I need to have overflow instead of saturation that _mm256_adds_epu16 does to be identical with the non-vectorized code, is there any solution for that?

Use normal binary wrapping _mm256_add_epi16 instead of saturating adds.
Two's complement and unsigned addition/subtraction are the same binary operation, that's one of the reasons modern computers use two's complement. As the asm manual entry for vpaddw mentions, the instructions can be used on signed or unsigned integers. (The intrinsics guide entry doesn't mention signedness at all, so is less helpful at clearing up this confusion.)
Compares like _mm_cmpgt_epi32 are sensitive to signedness, but math operations (and cmpeq) aren't.
The intrinsics names Intel chose might look like they're for signed integers specifically, but they always use epi or si for things that work equally on signed and unsigned elements. But no, epu implies a specifically unsigned thing, while epi can be specifically signed operations or can be things that work equally on signed or unsigned. Or things where signedness is irrelevant.
For example, _mm_and_si128 is pure bitwise. _mm_srli_epi32 is a logical right shift, shifting in zeros, like an unsigned C shift. Not copies of the sign bit, that's _mm_srai_epi32 (shift right arithmetic by immediate). Shuffles like _mm_shuffle_epi32 just move data around in chunks.
Non-widening multiplication like _mm_mullo_epi16 and _mm_mullo_epi32 are also the same for signed or unsigned. Only the high-half _mm_mulhi_epu16 or widening multiplies _mm_mul_epu32 have unsigned forms as counterparts to their specifically signed epi16/32 forms.
That's also why 386 only added a scalar integer imul ecx, esi form, not also a mul ecx, esi, because only the FLAGS setting would differ, not the integer result. And SIMD operations don't even have FLAGS outputs.
The intrinsics guide unhelpfully describes _mm_mullo_epi16 as sign-extending and producing a 32-bit product, then truncating to the low 32-bit. The asm manual for pmullw also describes it as signed that way, it seems talking about it as the companion to signed pmulhw. (And has some bugs, like describing the AVX1 VPMULLW xmm1, xmm2, xmm3/m128 form as multiplying 32-bit dword elements, probably a copy/paste error from pmulld)
And sometimes Intel's naming scheme is limited, like _mm_maddubs_epi16 is a u8 x i8 => 16-bit widening multiply, adding pairs horizontally (with signed saturation). I usually have to look up the intrinsic for pmaddubsw to remind myself that they named it after the output element width, not the inputs. The inputs have different signedness so if they have to pick one, side, I guess it makes sense to name it for the output, with the signed saturation that can happen with some inputs, like for pmaddwd.

Related

Is there any difference between overflow and implicit conversion, whether technical or bit level (cpu-register-level)?

(I'm an novice, so there may be inaccuracies in what I say)
In my current mental model, an overflow is an arithmetical phenomenon (occurs when we perform arithmetic operations), and an implicit conversion is an assignment (initialization or not) phenomenon (occurs when we make assignments which the right-hand's value dont fit into left-hand value.
However, often I see the concepts 'overflow' and 'implicit conversion' used interchangeably, , different from what I expect. For example, this quote from the learncpp team, talking about overflow and 'bit insufficiency' for signed int:
Integer overflow (often called overflow for short) occurs when we try to store a value that is outside the range of the type. Essentially, the number we are trying to store requires more bits to represent than the object has available. In such a case, data is lost because the object doesn’t have enough memory to store everything [1].
and this, talking about overflow for unsigned int :
What happens if we try to store the number 280 (which requires 9 bits to represent) in a 1-byte (8-bit) unsigned integer? The answer is overflow [2]*
and especially this one, who uses 'modulo wrapping':
Here’s another way to think about the same thing. Any number bigger than the largest number representable by the type simply “wraps around” (sometimes called “modulo wrapping”). 255 is in range of a 1-byte integer, so 255 is fine. 256, however, is outside the range, so it wraps around to the value 0. 257 wraps around to the value 1. 280 wraps around to the value 24 [2].
In such cases, it is said that assignments that exceed the limits of the lefthand lead to overflow, but I would expect, in this context, the term 'implicit conversion'.
I see the term overflow used also for arithmetic expressions whose result exceeds the limits of the lefthand.
1 Is there any technical difference between implicit conversion and overflow/underflow?
I think so. In the reference [3] in the section 'Numeric conversions - Integral conversions', for unsigned integer:
[...] the resulting value is the smallest unsigned value equal to the source value modulo 2^n
where n is the number of bits used to represent the destination type [3].
and for signed (bold mine):
If the destination type is signed, the value does not change if the source integer can be >represented in the destination type. Otherwise the result is implementation-defined (until C++20)the unique value of the destination type equal to the source value modulo 2n
where n is the number of bits used to represent the destination type. (since C++20). **
(Note that this is different from signed integer arithmetic overflow, which is undefined)**[3].
If we go to the referenced section (Overflow), we found (bold mine):
Unsigned integer arithmetic is always performed modulo 2n
where n is the number of bits in that particular integer. [..]
When signed integer arithmetic operation overflows (the result does not fit in the result type), the behavior is undefined [4].
To me, clearly overflow is an arithmetic phenomenon and implicit conversion is a phenomenon in assignments that do not fit. Is my interpretation accurate?
2 Is there on bit level (cpu) any difference between implicit conversion and overflow?
I think so also. I'm far from being good at c++, and even more so at assembly, but as an experiment, if we check the output of the code below with MSVC (flag /std:c++20) and MASM (Macro Assembly), especially checking the flag register, different phenomena occur if is an arithmetic operation or is an assignment ('Implicit conversion').
(I checked the flags register in the Debugger of Visual Studio 2022. The Assembly below is is practically the same as the from debugging).
#include <iostream>
#include <limits>
int main(void) {
long long x = std::numeric_limits<long long>::max();
int y = x;
//
//
long long k = std::numeric_limits<long long>::max();
++k;
}
The output is:
y$ = 32
k$ = 40
x$ = 48
main PROC
$LN3:
sub rsp, 72 ; 00000048H
call static __int64 std::numeric_limits<__int64>::max(void) ;
std::numeric_limits<__int64>::max
mov QWORD PTR x$[rsp], rax
mov eax, DWORD PTR x$[rsp]
mov DWORD PTR y$[rsp], eax
call static __int64 std::numeric_limits<__int64>::max(void)
; std::numeric_limits<__int64>::max
mov QWORD PTR k$[rsp], rax
mov rax, QWORD PTR k$[rsp]
inc rax
mov QWORD PTR k$[rsp], rax
xor eax, eax
add rsp, 72 ; 00000048H
ret 0
main ENDP
It can be checked at https://godbolt.org/z/6j6G69bTP
The copy-initialization of y in c++ corresponds to that in MASM:
int y = x;
mov eax, DWORD PTR x$[rsp]
mov DWORD PTR y$[rsp], eax
The mov statement simply ignores the 64 bits of 'x' and captures only its 32 bits. It is cast from the operator dword ptr and stores the result in the 32-bit eax register.
The mov statement don't set neither the overflow or carry flag.
The increment of k in c++ corresponds to that in MASM:
++k;
mov rax, QWORD PTR k$[rsp]
inc rax
mov QWORD PTR k$[rsp], rax
When the inc statement is executed, the overflow flag (signed overflow) is set to 1.
To me, although you can implement (mov) conversions in different ways, there is a clear difference between conversions using mov variants and arithmetic overflow: arithmetic sets the flags. Is my interpretation accurate?
Notes
*Apparently there's a discussion about the term overflow for unsigned, but that's not what I'm discussing
References
[1] https://www.learncpp.com/cpp-tutorial/signed-integers/
[2] https://www.learncpp.com/cpp-tutorial/unsigned-integers-and-why-to-avoid-them/
[3] https://en.cppreference.com/w/cpp/language/implicit_conversion
[4] https://en.cppreference.com/w/cpp/language/operator_arithmetic#Overflows

Let's try to break it down. We have to start with some more terms.
Ideal Arithmetic
Ideal arithmetic refers to arithmetic as it takes place in mathematics where the involved numbers are true integers with no limit to their size. When implementing arithmetic on a computer, integer types are generally limited in their size and can only represent a limited range of numbers. The arithmetic between these is no longer ideal in the sense that some arithmetic operations can result in values that are not representable in the types you use for them.
Carry out
A carry out occurs when in an addition, there is carry out of the most significant bit. In architectures with flags, this commonly causes the carry flag to be set. When calculating with unsigned numbers, the presence of a carry out indicates that the result did not fit into the number of bits of the output register and hence does not represent the ideal arithmetic result.
The carry out is also used in multi-word arithmetic to carry the 1 between the words that make up the result.
Overflow
On a two's complement machine, an integer overflows when the carry out of an addition is not equal to the carry into the final bit. In architectures with flags, this commonly causes the overflow flag to be set. When calculating with signed numbers, the presence of overflow indicates that the result did not fit into the output register and hence does not represent the ideal arithmetic result.
With regards to “the result does not fit,” it's like a carry out for signed arithmetic. However, when using multi-word arithmetic of signed numbers you still need to use the normal carry out to carry the one to the next word.
Some authors call carry out “unsigned overflow” and overflow “signed overflow.” The idea here is that in such a nomenclature, overflow refers to any condition in which the result of an operation is not representable. Other kinds of overflows include floating-point overflow, handled on IEEE-754 machines by saturating to +-Infinity.
Conversion
Conversion refers to taking a value represented by one data type and representing it in another data type. When the data types involved are integer types, this is usually done by extension, truncation, or saturation, or reinterpretation
extension is used to convert types into types with more bits and refers to just adding more bits past the most significant bit. For unsigned numbers, zeroes are added (zero extension). For signed numbers, copies of the sign bit are added (sign extension). Extension always preserves the value extended.
truncation is used to convert types into types of less bits and refers to removing bits from the most significant bit until the desired width is reached. If the value is representable in the new type, it is unchanged. Otherwise it will be changed as if by modulo reduction.
saturation is used to convert types into types of the same amount or less bits and works like truncation, but if the value is not representable, it is replaced by the smallest (if less than 0) or largest (if greater than 0) value of the destination type.
reinterpretation is used to convert between types of the same amount of bits and refers to interpreting the bit pattern of the original type as the new type. Values that are representable in the new type are preserved when doing this between signed and unsigned types. (For example, the bit-pattern for a non-negative signed 32 bit integer represents the same number when interpreted as an unsigned 32 bit integer.)
An implicit conversion is just a conversion that happens without being explicitly spelled out by the programmer. Some languages (like C) have these, others don't.
When an attempt is made to convert from one type or another and the result is not representable, some authors too refer to this situation as “overflow,” like with “signed overflow” and “unsigned overflow.” It is however a different phenomenon caused by a change in bit width and not a result of arithmetic. So yes, your interpretation is accurate. These are two separate phenomena related through the common idea of “resulting value doesn't fit type.”
To see how the two are interlinked, you may also interpret addition of two n bit numbers as resulting in a temporary n + 1 bit number such that the addition is always ideal. Then, the result is truncated to n bit and stored in the result register. If the result is not representable, then either carry out or overflow occurred, depending on the desired signedness. The carry out bit is then exactly the most significant bit of the temporary result that is then discarded to reach the final result.
Question 2
To me, although you can implement (mov) conversions in different ways, there is a clear difference between conversions using mov variants and arithmetic overflow: arithmetic sets the flags. Is my interpretation accurate?
The interpretation is not correct and the presence of flags is a red herring. There are both architectures where data moves set flags (e.g. ARMv6-M) and architectures where arithmetic doesn't set flags (e.g. x86 when using the lea instruction to perform it) or that do not even have flags (e.g. RISC-V).
Note also that a conversion (implicit or not) does not necessarily have to result in an instruction. Sign extension and saturations usually do, but zero extension is often implemented by just ensuring that the top part of a register is clear which the CPU may be able to do as a side effect of other operations you want to perform anyway. Truncation may be implemented by just ignoring the top part of the register. Reinterpretation of course by its nature does not generate any code either generally speaking.
As for carry out and overflow, the occurrence of these depend on the values you perform arithmetic with. These are things that just happen and unless you want to detect that they happen, no code is needed for that. It's simply the default thing.

Convert int32_t to unsigned char. AVX

Need to correctly convert YMM with 8 int32_t to XMM with 8 UNSIGNED uint8_t at the bottom, using AVX intrinsics. It should be analogue of static_cast<uint8_t>. It means that C++ standard rules work (modular reduction). So we get truncation of the 2's complement bit-pattern.
For example, (int32_t)(-1) -> (uint8_t)(255), and +200 -> (uint8_t)(200) so we can't use signed or unsigned saturation to 8-bit (or even to 16-bit as an intermediate step).
I have this code as the example:
packssdw xmm0, xmm0
packuswb xmm0, xmm0
movd somewhere, xmm0
But these commands use unsigned saturation, so we get (int32_t)(-1) -> (uint8_t)(0).
I know vcvttss2si and it works correctly but only for one value. For the best performance I want to use vector registers.
Also I know about shuffling but it's enough slow for me.
So Is there another way to convert from int32_t YMM to uint8_t YMM as static_cast<uint8_t>?
UPD: The comment of #chtz is answer of my question.

What method does the computer use to add unsigned integers

int main(){
unsigned int num1 = 0x65764321;
unsigned int num2 = 0x23657432;
unsigned int sum = num1 + num2;
cout << hex << sum;
return 0;
}
If i have two unsigned integers say num1 and num2. And then I tell the computer to unsigned
int sum = num1 + num2;
What method does the computer use to add them, would it be two's complement. Would the sum variable be printed in two's complement.

2's complement addition is identical to unsigned addition as far the actual bits are concerned. In the actual hardware, the design will be something complicated like a https://en.wikipedia.org/wiki/Carry-lookahead_adder, so it can be low latency (not having to wait for the carry to ripple across 32 or 64 bits, because that's too many gate-delays for add to be single-cycle latency.)
One's complement and sign/magnitude are the other signed-integer representations that C++ allows implementations to use, and their wrap-around behaviour is different from unsigned.
For example, one's complement addition has to wrap the carry-out back into the low bit. See this article about optimizing TCP checksum calculation for how you implement one's complement addition on hardware that only provide 2's complement / unsigned addition. (Specifically x86).
C++ leaves signed overflow as undefined behaviour, but real one's complement and sign/magnitude hardware does have specific documented behaviour. reinterpret_casting an unsigned bit pattern to a signed integer gives a result that depends on what kind of hardware you're running on. (All modern hardware is 2's complement, though.)
Since the bitwise operation is the same for unsigned or 2's complement, it's all about how you interpret the results. On CPU architectures like x86 that set flags based on the results of an instruction, the overflow flag is only relevant for the signed interpretation, and the carry flag is only relevant for the unsigned interpretation. The hardware produces both from a single instruction, instead of having separate signed/unsigned add instructions that do the same thing.
See http://teaching.idallen.com/dat2343/10f/notes/040_overflow.txt for a great write-up about unsigned carry vs. signed overflow, and x86 flags.
On other architectures, like MIPS, there is no FLAGS register. You have to use a compare or test instruction to figure out what happened (carry or zero or whatever). The add instruction doesn't set flags. See this MIPS Q&A about add-with-carry for a 64-bit add on 32-bit MIPS.
But for detecting signed overflow, add raises an exception on overflow (where x86 would set OF), so you use addu for signed or unsigned addition if you want it to not fault on signed overflow.
now the overflow flag here is 1(its an example given by our instructor) meaning there is overflow but there is no carry, so how can there be overflow here
You have a C++ program, not an x86 assembly language program! C++ doesn't have a carry or overflow flag.
If you compiled this program for x86 with a non-optimizing compiler, and it used the ADD instruction with your two inputs, you would get OF=1 and CF=0 from that ADD instruction.
But the compiler might use lea edi, [rax+rdx] to do the sum without overwriting either input, and LEA doesn't set flags.
Or if the compiler did the addition at compile time, your code would compile the same as source like this:
cout << hex << 0x88dbb753U;
and no addition of your numbers would take place at run-time. (There will of course be lots of addition in the iostream library functions, and maybe even an add instruction in main() as part of making a stack frame, if your compiler chooses to emit code that sets up a stack frame.)

i have two unsigned integers
What method does the computer use to add them
Whatever method is available on the target CPU architecture. Most have an instruction named ADD.
Would the sum variable be printed in two's complement.
Two's complement is a way to represent an integer type in binary. It is not a way to print numbers.

What bit-level-changes are made while typecasting a float 32bit variable to unsigned integer 32bit?

Consider I want to typecast float 32 bit data to an unsigned integer 32 bit.
float foo = 5.0f;
uint32 goo;
goo = (uint32)foo;
How does the compiler typecast a variable? Is there any intermediate steps; and if yes what are they?

This is going to depend on the hardware to a large degree (though things like the OS/compiler can and will change how the hardware is used).
For example, on older Intel processors (and current ones, in most 32-bit code) you use the x87 instruction set for (most) floating point. In this case, there's an fistp (floating point integer store and pop, though there's also a non-popping variety, in case you also need to continue using that floating point value) that supports storing to an integer, so that's what's typically used. There's a bit in the floating point control word that controls how that conversion will be done (rounding versus truncating) that has to be set correctly as well.
On current hardware (with a current compiler producing 64-bit code) you're typically going to be using SSE instructions, in which case conversion to a signed int can be done with cvtss2si.
Neither of these directly supports unsigned operands though (at least I'm sure SSE doesn't, and to the best of my recollection x87 doesn't either). For these, the compiler probably has a small function in the library, so it's done (at least partly) in software.
If your compiler supports generating AVX256/AVX512 instructions, it could use VCVTTSS2USI to convert from 32-bit float to 64-bit integer, then store the bottom 32-bits of that integer to your destination (and if the result was negative or too large to fit in a 32-bit integer, well...the standard says that gives undefined behavior, so you got what you deserved).

Floating point and unsigned numbers are stored in completely different formats, so the compiler must change from one to the other, as opposed to casting from unsigned to signed where the format of the number is the same and the software just changes how the number is interpreted.
For example, in 32 bit floats the number 5 is represented as 0x40a00000 in float, and 0x00000005 as an unsigned.
This conversion ends up being significant when you're working with weaker microcontrollers, on anything with a FPU it is a few extra instructions.
Here is the wikipedia write up on floating point
Here is a floating point converter that shows the bit level data

Compiler will take care of this part, for example on Intel architecture with the SSE FPU support, GCC defined some operation in "emmtrin.h"
A simple way to figure out is just compile a small c program with assembly output, you will get something like:
movss -4(%rbp), %xmm0
cvtps2pd %xmm0, %xmm0
movsd .LC1(%rip), %xmm1
addsd %xmm1, %xmm0
cvttsd2si %xmm0, %eax
The FPU related instructions (cvtps2pd/cvttsd2si) is used here, this exactly depends on target machine.

Compiler optimization on marking an int unsigned?

For an integer that is never expected to take -ve values, one could unsigned int or int.
From a compiler perspective or purely cpu cycle perspective is there any difference on x86_64 ?

It depends. It might go either way, depending on what you are doing with that int as well as on the properties of the underlying hardware.
An obvious example in unsigned ints favor would be the integer division operation. In C/C++ integer division is supposed to round towards zero, while machine integer division on x86 rounds towards negative infinity. Also, various "optimized" replacements for integer division (shifts, etc.) also generally round towards negative infinity. So, in order to satisfy standard requirements the compiler are forced to adjust the signed integer division results with additional machine instructions. In case of unsigned integer division this problem does not arise, which is why generally integer division works much faster for unsigned types than for signed types.
For example, consider this simple expression
rand() / 2
The code generated for this expression by MSVC complier will generally look as follows
call rand
cdq
sub eax,edx
sar eax,1
Note that instead of a single shift instruction (sar) we are seeing a whole bunch of instructions here, i.e our sar is preceded by two extra instructions (cdq and sub). These extra instructions are there just to "adjust" the division in order to force it to generate the "correct" (from C language point of view) result. Note, that the compiler does not know that your value will always be positive, so it has to generate these instructions always, unconditionally. They will never do anything useful, thus wasting the CPU cycles.
Not take a look at the code for
(unsigned) rand() / 2
It is just
call rand
shr eax,1
In this case a single shift did the trick, thus providing us with an astronomically faster code (for the division alone).
On the other hand, when you are mixing integer arithmetics and FPU floating-point arithmetics, signed integer types might work faster since the FPU instruction set contains immediate instruction for loading/storing signed integer values, but has no instructions for unsigned integer values.
To illustrate this one can use the following simple function
double zero() { return rand(); }
The generated code will generally be very simple
call rand
mov dword ptr [esp],eax
fild dword ptr [esp]
But if we change our function to
double zero() { return (unsigned) rand(); }
the generated code will change to
call rand
test eax,eax
mov dword ptr [esp],eax
fild dword ptr [esp]
jge zero+17h
fadd qword ptr [__real#41f0000000000000 (4020F8h)]
This code is noticeably larger because the FPU instruction set does not work with unsigned integer types, so the extra adjustments are necessary after loading an unsigned value (which is what that conditional fadd does).
There are other contexts and examples that can be used to demonstrate that it works either way. So, again, it all depends. But generally, all this will not matter in the big picture of your program's performance. I generally prefer to use unsigned types to represent unsigned quantities. In my code 99% of integer types are unsigned. But I do it for purely conceptual reasons, not for any performance gains.

Signed types are inherently more optimizable in most cases because the compiler can ignore the possibility of overflow and simplify/rearrange arithmetic in whatever ways it sees fit. On the other hand, unsigned types are inherently safer because the result is always well-defined (even if not to what you naively think it should be).
The one case where unsigned types are better optimizable is when you're writing division/remainder by a power of two. For unsigned types this translates directly to bitshift and bitwise and. For signed types, unless the compiler can establish that the value is known to be positive, it must generate extra code to compensate for the off-by-one issue with negative numbers (according to C, -3/2 is -1, whereas algebraically and by bitwise operations it's -2).

It will almost certainly make no difference, but occasionally the compiler can play games with the signedness of types in order to shave a couple of cycles, but to be honest it probably is a negligible change overall.
For example suppose you have an int x and want to write:
if(x >= 10 && x < 200) { /* ... */ }
You (or better yet, the compiler) can transform this a little to do one less comparison:
if((unsigned int)(x - 10) < 190) { /* ... */ }
This is making an assumption that int is represented in 2's compliment, so that if (x - 10) is less that 0 is becomes a huge value when viewed as an unsigned int. For example, on a typical x86 system, (unsigned int)-1 == 0xffffffff which is clearly bigger than the 190 being tested.
This is micro-optimization at best and best left up the compiler, instead you should write code that expresses what you mean and if it is too slow, profile and decide where it really is necessary to get clever.

I don't imagine it would make much difference in terms of CPU or the compiler. One possible case would be if it enabled the compiler to know that the number would never be negative and optimize away code.
However it IS useful to a human reading your code so they know the domain of the variable in question.

From the ALU's point of view adding (or whatever) signed or unsigned values doesn't make any difference, since they're both represented by a group of bit. 0100 + 1011 is always 1111, but you choose if that is 4 + (-5) = -1 or 4 + 11 = 15.
So I agree with #Mark, you should choose the best data-type to help others understand your code.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js