Compiler optimization on marking an int unsigned?

Compiler optimization on marking an int unsigned? - c++

For an integer that is never expected to take -ve values, one could unsigned int or int.
From a compiler perspective or purely cpu cycle perspective is there any difference on x86_64 ?

It depends. It might go either way, depending on what you are doing with that int as well as on the properties of the underlying hardware.
An obvious example in unsigned ints favor would be the integer division operation. In C/C++ integer division is supposed to round towards zero, while machine integer division on x86 rounds towards negative infinity. Also, various "optimized" replacements for integer division (shifts, etc.) also generally round towards negative infinity. So, in order to satisfy standard requirements the compiler are forced to adjust the signed integer division results with additional machine instructions. In case of unsigned integer division this problem does not arise, which is why generally integer division works much faster for unsigned types than for signed types.
For example, consider this simple expression
rand() / 2
The code generated for this expression by MSVC complier will generally look as follows
call rand
cdq
sub eax,edx
sar eax,1
Note that instead of a single shift instruction (sar) we are seeing a whole bunch of instructions here, i.e our sar is preceded by two extra instructions (cdq and sub). These extra instructions are there just to "adjust" the division in order to force it to generate the "correct" (from C language point of view) result. Note, that the compiler does not know that your value will always be positive, so it has to generate these instructions always, unconditionally. They will never do anything useful, thus wasting the CPU cycles.
Not take a look at the code for
(unsigned) rand() / 2
It is just
call rand
shr eax,1
In this case a single shift did the trick, thus providing us with an astronomically faster code (for the division alone).
On the other hand, when you are mixing integer arithmetics and FPU floating-point arithmetics, signed integer types might work faster since the FPU instruction set contains immediate instruction for loading/storing signed integer values, but has no instructions for unsigned integer values.
To illustrate this one can use the following simple function
double zero() { return rand(); }
The generated code will generally be very simple
call rand
mov dword ptr [esp],eax
fild dword ptr [esp]
But if we change our function to
double zero() { return (unsigned) rand(); }
the generated code will change to
call rand
test eax,eax
mov dword ptr [esp],eax
fild dword ptr [esp]
jge zero+17h
fadd qword ptr [__real#41f0000000000000 (4020F8h)]
This code is noticeably larger because the FPU instruction set does not work with unsigned integer types, so the extra adjustments are necessary after loading an unsigned value (which is what that conditional fadd does).
There are other contexts and examples that can be used to demonstrate that it works either way. So, again, it all depends. But generally, all this will not matter in the big picture of your program's performance. I generally prefer to use unsigned types to represent unsigned quantities. In my code 99% of integer types are unsigned. But I do it for purely conceptual reasons, not for any performance gains.

Signed types are inherently more optimizable in most cases because the compiler can ignore the possibility of overflow and simplify/rearrange arithmetic in whatever ways it sees fit. On the other hand, unsigned types are inherently safer because the result is always well-defined (even if not to what you naively think it should be).
The one case where unsigned types are better optimizable is when you're writing division/remainder by a power of two. For unsigned types this translates directly to bitshift and bitwise and. For signed types, unless the compiler can establish that the value is known to be positive, it must generate extra code to compensate for the off-by-one issue with negative numbers (according to C, -3/2 is -1, whereas algebraically and by bitwise operations it's -2).

It will almost certainly make no difference, but occasionally the compiler can play games with the signedness of types in order to shave a couple of cycles, but to be honest it probably is a negligible change overall.
For example suppose you have an int x and want to write:
if(x >= 10 && x < 200) { /* ... */ }
You (or better yet, the compiler) can transform this a little to do one less comparison:
if((unsigned int)(x - 10) < 190) { /* ... */ }
This is making an assumption that int is represented in 2's compliment, so that if (x - 10) is less that 0 is becomes a huge value when viewed as an unsigned int. For example, on a typical x86 system, (unsigned int)-1 == 0xffffffff which is clearly bigger than the 190 being tested.
This is micro-optimization at best and best left up the compiler, instead you should write code that expresses what you mean and if it is too slow, profile and decide where it really is necessary to get clever.

I don't imagine it would make much difference in terms of CPU or the compiler. One possible case would be if it enabled the compiler to know that the number would never be negative and optimize away code.
However it IS useful to a human reading your code so they know the domain of the variable in question.

From the ALU's point of view adding (or whatever) signed or unsigned values doesn't make any difference, since they're both represented by a group of bit. 0100 + 1011 is always 1111, but you choose if that is 4 + (-5) = -1 or 4 + 11 = 15.
So I agree with #Mark, you should choose the best data-type to help others understand your code.

Related

Is there any difference between overflow and implicit conversion, whether technical or bit level (cpu-register-level)?

(I'm an novice, so there may be inaccuracies in what I say)
In my current mental model, an overflow is an arithmetical phenomenon (occurs when we perform arithmetic operations), and an implicit conversion is an assignment (initialization or not) phenomenon (occurs when we make assignments which the right-hand's value dont fit into left-hand value.
However, often I see the concepts 'overflow' and 'implicit conversion' used interchangeably, , different from what I expect. For example, this quote from the learncpp team, talking about overflow and 'bit insufficiency' for signed int:
Integer overflow (often called overflow for short) occurs when we try to store a value that is outside the range of the type. Essentially, the number we are trying to store requires more bits to represent than the object has available. In such a case, data is lost because the object doesn’t have enough memory to store everything [1].
and this, talking about overflow for unsigned int :
What happens if we try to store the number 280 (which requires 9 bits to represent) in a 1-byte (8-bit) unsigned integer? The answer is overflow [2]*
and especially this one, who uses 'modulo wrapping':
Here’s another way to think about the same thing. Any number bigger than the largest number representable by the type simply “wraps around” (sometimes called “modulo wrapping”). 255 is in range of a 1-byte integer, so 255 is fine. 256, however, is outside the range, so it wraps around to the value 0. 257 wraps around to the value 1. 280 wraps around to the value 24 [2].
In such cases, it is said that assignments that exceed the limits of the lefthand lead to overflow, but I would expect, in this context, the term 'implicit conversion'.
I see the term overflow used also for arithmetic expressions whose result exceeds the limits of the lefthand.
1 Is there any technical difference between implicit conversion and overflow/underflow?
I think so. In the reference [3] in the section 'Numeric conversions - Integral conversions', for unsigned integer:
[...] the resulting value is the smallest unsigned value equal to the source value modulo 2^n
where n is the number of bits used to represent the destination type [3].
and for signed (bold mine):
If the destination type is signed, the value does not change if the source integer can be >represented in the destination type. Otherwise the result is implementation-defined (until C++20)the unique value of the destination type equal to the source value modulo 2n
where n is the number of bits used to represent the destination type. (since C++20). **
(Note that this is different from signed integer arithmetic overflow, which is undefined)**[3].
If we go to the referenced section (Overflow), we found (bold mine):
Unsigned integer arithmetic is always performed modulo 2n
where n is the number of bits in that particular integer. [..]
When signed integer arithmetic operation overflows (the result does not fit in the result type), the behavior is undefined [4].
To me, clearly overflow is an arithmetic phenomenon and implicit conversion is a phenomenon in assignments that do not fit. Is my interpretation accurate?
2 Is there on bit level (cpu) any difference between implicit conversion and overflow?
I think so also. I'm far from being good at c++, and even more so at assembly, but as an experiment, if we check the output of the code below with MSVC (flag /std:c++20) and MASM (Macro Assembly), especially checking the flag register, different phenomena occur if is an arithmetic operation or is an assignment ('Implicit conversion').
(I checked the flags register in the Debugger of Visual Studio 2022. The Assembly below is is practically the same as the from debugging).
#include <iostream>
#include <limits>
int main(void) {
long long x = std::numeric_limits<long long>::max();
int y = x;
//
//
long long k = std::numeric_limits<long long>::max();
++k;
}
The output is:
y$ = 32
k$ = 40
x$ = 48
main PROC
$LN3:
sub rsp, 72 ; 00000048H
call static __int64 std::numeric_limits<__int64>::max(void) ;
std::numeric_limits<__int64>::max
mov QWORD PTR x$[rsp], rax
mov eax, DWORD PTR x$[rsp]
mov DWORD PTR y$[rsp], eax
call static __int64 std::numeric_limits<__int64>::max(void)
; std::numeric_limits<__int64>::max
mov QWORD PTR k$[rsp], rax
mov rax, QWORD PTR k$[rsp]
inc rax
mov QWORD PTR k$[rsp], rax
xor eax, eax
add rsp, 72 ; 00000048H
ret 0
main ENDP
It can be checked at https://godbolt.org/z/6j6G69bTP
The copy-initialization of y in c++ corresponds to that in MASM:
int y = x;
mov eax, DWORD PTR x$[rsp]
mov DWORD PTR y$[rsp], eax
The mov statement simply ignores the 64 bits of 'x' and captures only its 32 bits. It is cast from the operator dword ptr and stores the result in the 32-bit eax register.
The mov statement don't set neither the overflow or carry flag.
The increment of k in c++ corresponds to that in MASM:
++k;
mov rax, QWORD PTR k$[rsp]
inc rax
mov QWORD PTR k$[rsp], rax
When the inc statement is executed, the overflow flag (signed overflow) is set to 1.
To me, although you can implement (mov) conversions in different ways, there is a clear difference between conversions using mov variants and arithmetic overflow: arithmetic sets the flags. Is my interpretation accurate?
Notes
*Apparently there's a discussion about the term overflow for unsigned, but that's not what I'm discussing
References
[1] https://www.learncpp.com/cpp-tutorial/signed-integers/
[2] https://www.learncpp.com/cpp-tutorial/unsigned-integers-and-why-to-avoid-them/
[3] https://en.cppreference.com/w/cpp/language/implicit_conversion
[4] https://en.cppreference.com/w/cpp/language/operator_arithmetic#Overflows

Let's try to break it down. We have to start with some more terms.
Ideal Arithmetic
Ideal arithmetic refers to arithmetic as it takes place in mathematics where the involved numbers are true integers with no limit to their size. When implementing arithmetic on a computer, integer types are generally limited in their size and can only represent a limited range of numbers. The arithmetic between these is no longer ideal in the sense that some arithmetic operations can result in values that are not representable in the types you use for them.
Carry out
A carry out occurs when in an addition, there is carry out of the most significant bit. In architectures with flags, this commonly causes the carry flag to be set. When calculating with unsigned numbers, the presence of a carry out indicates that the result did not fit into the number of bits of the output register and hence does not represent the ideal arithmetic result.
The carry out is also used in multi-word arithmetic to carry the 1 between the words that make up the result.
Overflow
On a two's complement machine, an integer overflows when the carry out of an addition is not equal to the carry into the final bit. In architectures with flags, this commonly causes the overflow flag to be set. When calculating with signed numbers, the presence of overflow indicates that the result did not fit into the output register and hence does not represent the ideal arithmetic result.
With regards to “the result does not fit,” it's like a carry out for signed arithmetic. However, when using multi-word arithmetic of signed numbers you still need to use the normal carry out to carry the one to the next word.
Some authors call carry out “unsigned overflow” and overflow “signed overflow.” The idea here is that in such a nomenclature, overflow refers to any condition in which the result of an operation is not representable. Other kinds of overflows include floating-point overflow, handled on IEEE-754 machines by saturating to +-Infinity.
Conversion
Conversion refers to taking a value represented by one data type and representing it in another data type. When the data types involved are integer types, this is usually done by extension, truncation, or saturation, or reinterpretation
extension is used to convert types into types with more bits and refers to just adding more bits past the most significant bit. For unsigned numbers, zeroes are added (zero extension). For signed numbers, copies of the sign bit are added (sign extension). Extension always preserves the value extended.
truncation is used to convert types into types of less bits and refers to removing bits from the most significant bit until the desired width is reached. If the value is representable in the new type, it is unchanged. Otherwise it will be changed as if by modulo reduction.
saturation is used to convert types into types of the same amount or less bits and works like truncation, but if the value is not representable, it is replaced by the smallest (if less than 0) or largest (if greater than 0) value of the destination type.
reinterpretation is used to convert between types of the same amount of bits and refers to interpreting the bit pattern of the original type as the new type. Values that are representable in the new type are preserved when doing this between signed and unsigned types. (For example, the bit-pattern for a non-negative signed 32 bit integer represents the same number when interpreted as an unsigned 32 bit integer.)
An implicit conversion is just a conversion that happens without being explicitly spelled out by the programmer. Some languages (like C) have these, others don't.
When an attempt is made to convert from one type or another and the result is not representable, some authors too refer to this situation as “overflow,” like with “signed overflow” and “unsigned overflow.” It is however a different phenomenon caused by a change in bit width and not a result of arithmetic. So yes, your interpretation is accurate. These are two separate phenomena related through the common idea of “resulting value doesn't fit type.”
To see how the two are interlinked, you may also interpret addition of two n bit numbers as resulting in a temporary n + 1 bit number such that the addition is always ideal. Then, the result is truncated to n bit and stored in the result register. If the result is not representable, then either carry out or overflow occurred, depending on the desired signedness. The carry out bit is then exactly the most significant bit of the temporary result that is then discarded to reach the final result.
Question 2
To me, although you can implement (mov) conversions in different ways, there is a clear difference between conversions using mov variants and arithmetic overflow: arithmetic sets the flags. Is my interpretation accurate?
The interpretation is not correct and the presence of flags is a red herring. There are both architectures where data moves set flags (e.g. ARMv6-M) and architectures where arithmetic doesn't set flags (e.g. x86 when using the lea instruction to perform it) or that do not even have flags (e.g. RISC-V).
Note also that a conversion (implicit or not) does not necessarily have to result in an instruction. Sign extension and saturations usually do, but zero extension is often implemented by just ensuring that the top part of a register is clear which the CPU may be able to do as a side effect of other operations you want to perform anyway. Truncation may be implemented by just ignoring the top part of the register. Reinterpretation of course by its nature does not generate any code either generally speaking.
As for carry out and overflow, the occurrence of these depend on the values you perform arithmetic with. These are things that just happen and unless you want to detect that they happen, no code is needed for that. It's simply the default thing.

overflow instead of saturation on 16bit add AVX2

I want to add 2 unsigned vectors using AVX2
__m256i i1 = _mm256_loadu_si256((__m256i *) si1);
__m256i i2 = _mm256_loadu_si256((__m256i *) si2);
__m256i result = _mm256_adds_epu16(i2, i1);
however I need to have overflow instead of saturation that _mm256_adds_epu16 does to be identical with the non-vectorized code, is there any solution for that?

Use normal binary wrapping _mm256_add_epi16 instead of saturating adds.
Two's complement and unsigned addition/subtraction are the same binary operation, that's one of the reasons modern computers use two's complement. As the asm manual entry for vpaddw mentions, the instructions can be used on signed or unsigned integers. (The intrinsics guide entry doesn't mention signedness at all, so is less helpful at clearing up this confusion.)
Compares like _mm_cmpgt_epi32 are sensitive to signedness, but math operations (and cmpeq) aren't.
The intrinsics names Intel chose might look like they're for signed integers specifically, but they always use epi or si for things that work equally on signed and unsigned elements. But no, epu implies a specifically unsigned thing, while epi can be specifically signed operations or can be things that work equally on signed or unsigned. Or things where signedness is irrelevant.
For example, _mm_and_si128 is pure bitwise. _mm_srli_epi32 is a logical right shift, shifting in zeros, like an unsigned C shift. Not copies of the sign bit, that's _mm_srai_epi32 (shift right arithmetic by immediate). Shuffles like _mm_shuffle_epi32 just move data around in chunks.
Non-widening multiplication like _mm_mullo_epi16 and _mm_mullo_epi32 are also the same for signed or unsigned. Only the high-half _mm_mulhi_epu16 or widening multiplies _mm_mul_epu32 have unsigned forms as counterparts to their specifically signed epi16/32 forms.
That's also why 386 only added a scalar integer imul ecx, esi form, not also a mul ecx, esi, because only the FLAGS setting would differ, not the integer result. And SIMD operations don't even have FLAGS outputs.
The intrinsics guide unhelpfully describes _mm_mullo_epi16 as sign-extending and producing a 32-bit product, then truncating to the low 32-bit. The asm manual for pmullw also describes it as signed that way, it seems talking about it as the companion to signed pmulhw. (And has some bugs, like describing the AVX1 VPMULLW xmm1, xmm2, xmm3/m128 form as multiplying 32-bit dword elements, probably a copy/paste error from pmulld)
And sometimes Intel's naming scheme is limited, like _mm_maddubs_epi16 is a u8 x i8 => 16-bit widening multiply, adding pairs horizontally (with signed saturation). I usually have to look up the intrinsic for pmaddubsw to remind myself that they named it after the output element width, not the inputs. The inputs have different signedness so if they have to pick one, side, I guess it makes sense to name it for the output, with the signed saturation that can happen with some inputs, like for pmaddwd.

Computational complexity for casting int to unsigned vs complexity for comparing values

I just wanted to know which operation is faster in C/C++, as well as what the computational complexity for a type cast is.
Typecasting x to an unsigned integer like so:
(unsigned int) x
or
Performing a comparison between x and a constant:
x<0
edit: computational complexity as in which process requires the least amount of bit operations on a low level aspect of the hardware in order to successfully carry out the instruction.
edit #2: So to give some context what I'm specifically trying to do is seeing if reducing
if( x < 0)
into
if((((unsigned int)x)>>(((sizeof(int))<<3)-1)))
would be more efficient or not if done over 100,000,000 times, with large quantities for x, above/below (+/-)50,000,000

(unsigned int) x is - for the near-universal 2's complement notation - a compile time operation: you're telling the compiler to treat the content of x as an unsigned value, which doesn't require any runtime machine-code instructions in and of itself, but may change the machine code instructions it emits to support the usage made of the unsigned value, or even cause dead-code elimination optimisations, for example the following could be eliminated completely after the cast:
if ((unsigned int)my_unsigned_int >= 0)
The relevant C++ Standard quote (my boldfacing):
If the destination type is unsigned, the resulting value is the least unsigned integer congruent to the source integer (modulo 2n where n is the number of bits used to represent the unsigned type). [ Note: In a two’s
complement representation, this conversion is conceptual and there is no change in the bit pattern (if there is no truncation). —end note ]
There could be an actual bitwise change requiring an operation on some bizarre hardware using 1's complement or sign/magnitude representations. (Thanks Yuushi for highlighting this in comments).
That contrasts with x < 0, which - for a signed x about which the compiler has no special knowledge, does require a CPU/machine-code instruction to evaluate (if the result is used) and corresponding runtime. That comparison instruction tends to take 1 "cycle" on even older CPUs, but do keep in mind that modern CPU pipelines can execute many such instruction in parallel during a single cycle.
if( x < 0) vs if((((unsigned int)x)>>(((sizeof(int))<<3)-1))) - faster?
The first will always be at least as fast as the second. A comparison to zero is a bread-and-butter operation for the CPU, and the C++ compiler's certain to use an efficient opcode (machine code instruction) for it: you're wasting your time trying to improve on that.

The monster if((((unsigned int)x)>>(((sizeof(int))<<3)-1))) will be slower than the straight forward if(x < 0): Both versions need to compare a value against zero, but your monster adds a shift before the comparison can take place.

To answer your actual edited question, it is unlikely to be faster. In the best case, if x is known at compile time, the compiler will be able to optimize out the branch completely in both cases.
If x is a run-time value, then the first will produce a single test instruction. The second will likely produce a shift-right immediate followed by a test instruction.

performance of unsigned vs signed integers

Is there any performance gain/loss by using unsigned integers over signed integers?
If so, does this goes for short and long as well?

Division by powers of 2 is faster with unsigned int, because it can be optimized into a single shift instruction. With signed int, it usually requires more machine instructions, because division rounds towards zero, but shifting to the right rounds down. Example:
int foo(int x, unsigned y)
{
x /= 8;
y /= 8;
return x + y;
}
Here is the relevant x part (signed division):
movl 8(%ebp), %eax
leal 7(%eax), %edx
testl %eax, %eax
cmovs %edx, %eax
sarl $3, %eax
And here is the relevant y part (unsigned division):
movl 12(%ebp), %edx
shrl $3, %edx

In C++ (and C), signed integer overflow is undefined, whereas unsigned integer overflow is defined to wrap around. Notice that e.g. in gcc, you can use the -fwrapv flag to make signed overflow defined (to wrap around).
Undefined signed integer overflow allows the compiler to assume that overflows don't happen, which may introduce optimization opportunities. See e.g. this blog post for discussion.

unsigned leads to the same or better performance than signed.
Some examples:
Division by a constant which is a power of 2 (see also the answer from FredOverflow)
Division by a constant number (for example, my compiler implements division by 13 using 2 asm instructions for unsigned, and 6 instructions for signed)
Checking whether a number is even (i have no idea why my MS Visual Studio compiler implements it with 4 instructions for signed numbers; gcc does it with 1 instruction, just like in the unsigned case)
short usually leads to the same or worse performance than int (assuming sizeof(short) < sizeof(int)). Performance degradation happens when you assign a result of an arithmetic operation (which is usually int, never short) to a variable of type short, which is stored in the processor's register (which is also of type int). All the conversions from short to int take time and are annoying.
Note: some DSPs have fast multiplication instructions for the signed short type; in this specific case short is faster than int.
As for the difference between int and long, i can only guess (i am not familiar with 64-bit architectures). Of course, if int and long have the same size (on 32-bit platforms), their performance is also the same.
A very important addition, pointed out by several people:
What really matters for most applications is the memory footprint and utilized bandwidth. You should use the smallest necessary integers (short, maybe even signed/unsigned char) for large arrays.
This will give better performance, but the gain is nonlinear (i.e. not by a factor of 2 or 4) and somewhat unpredictable - it depends on cache size and the relationship between calculations and memory transfers in your application.

This will depend on exact implementation. In most cases there will be no difference however. If you really care you have to try all the variants you consider and measure performance.

This is pretty much dependent on the specific processor.
On most processors, there are instructions for both signed and unsigned arithmetic, so the difference between using signed and unsigned integers comes down to which one the compiler uses.
If any of the two is faster, it's completely processor specific, and most likely the difference is miniscule, if it exists at all.

The performance difference between signed and unsigned integers is actually more general than the acceptance answer suggests. Division of an unsigned integer by any constant can be made faster than division of a signed integer by a constant, regardless of whether the constant is a power of two. See http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html
At the end of his post, he includes the following section:
A natural question is whether the same optimization could improve signed division; unfortunately it appears that it does not, for two reasons:
The increment of the dividend must become an increase in the magnitude, i.e. increment if n > 0, decrement if n < 0. This introduces an additional expense.
The penalty for an uncooperative divisor is only about half as much in signed division, leaving a smaller window for improvements.
Thus it appears that the round-down algorithm could be made to work in signed division, but will underperform the standard round-up algorithm.

Not only division by powers of 2 are faster with unsigned type, division by any other values are also faster with unsigned type. If you look at Agner Fog's Instruction tables you'll see that unsigned divisions have similar or better performance than signed versions
For example with the AMD K7
Instruction
Operands
Ops
Latency
Reciprocal throughput
DIV
r8/m8
32
24
23
DIV
r16/m16
47
24
23
DIV
r32/m32
79
40
40
IDIV
r8
41
17
17
IDIV
r16
56
25
25
IDIV
r32
88
41
41
IDIV
m8
42
17
17
IDIV
m16
57
25
25
IDIV
m32
89
41
41
The same thing applies to Intel Pentium
Instruction
Operands
Clock cycles
DIV
r8/m8
17
DIV
r16/m16
25
DIV
r32/m32
41
IDIV
r8/m8
22
IDIV
r16/m16
30
IDIV
r32/m32
46
Of course those are quite ancient. Newer architectures with more transistors might close the gap but the basic things apply: generally you need more micro ops, more logic, more latency to do a signed division

In short, don't bother before the fact. But do bother after.
If you want to have performance you have to use performance optimizations of a compiler which may work against common sense. One thing to remember is that different compilers can compile code differently and they themselves have different sorts of optimizations. If we're talking about a g++ compiler and talking about maxing out it's optimization level by using -Ofast, or at least an -O3 flag, in my experience it can compile long type into code with even better performance than any unsigned type, or even just int.
This is from my own experience and I recommend you to first write your full program and care about such things only after that, when you have your actual code on your hands and you can compile it with optimizations to try and pick the types that actually perform best. This is also a good very general suggestion about code optimization for performance, write quickly first, try compiling with optimizations, tweak things to see what works best. And you should also try using different compilers to compile your program and choosing the one that outputs the most performant machine code.
An optimized multi-threaded linear algebra calculation program can easily have a >10x performance difference finely optimized vs unoptimized. So this does matter.
Optimizer output contradicts logic in plenty of cases. For example, I had a case when a difference between a[x]+=b and a[x]=b changed program execution time almost 2x. And no, a[x]=b wasn't the faster one.
Here's for example NVidia stating that for programming their GPUs:
Note: As was already the recommended best practice, signed arithmetic
should be preferred over unsigned arithmetic wherever possible for
best throughput on SMM. The C language standard places more
restrictions on overflow behavior for unsigned math, limiting compiler
optimization opportunities.

Traditionally int is the native integer format of the target hardware platform. Any other integer type may incur performance penalties.
EDIT:
Things are slightly different on modern systems:
int may in fact be 32-bit on 64-bit systems for compatibility reasons. I believe this happens on Windows systems.
Modern compilers may implicitly use int when performing computations for shorter types in some cases.

IIRC, on x86 signed/unsigned shouldn't make any difference. Short/long, on the other hand, is a different story, since the amount of data that has to be moved to/from RAM is bigger for longs (other reasons may include cast operations like extending a short to long).

Signed and unsigned integers will always both operate as single clock instructions and have the same read-write performance but according to Dr Andrei Alexandrescu unsigned is preferred over signed. The reason for this is you can fit twice the amount of numbers in the same number of bits because you're not wasting the sign bit and you will use fewer instructions checking for negative numbers yielding performance increases from the decreased ROM. In my experience with the Kabuki VM, which features an ultra-high-performance Script Implementation, it is rare that you actually require a signed number when working with memory. I've spend may years doing pointer arithmetic with signed and unsigned numbers and I've found no benefit to the signed when no sign bit is needed.
Where signed may be preferred is when using bit shifting to perform multiplication and division of powers of 2 because you may perform negative powers of 2 division with signed 2's complement integers. Please see some more YouTube videos from Andrei for more optimization techniques. You can also find some good info in my article about the the world's fastest Integer-to-String conversion algorithm.

Unsigned integer is advantageous in that you store and treat both as bitstream, I mean just a data, without sign, so multiplication, devision becomes easier (faster) with bit-shift operations

Causing a divide overflow error (x86)

I have a few questions about divide overflow errors on x86 or x86_64 architecture. Lately I've been reading about integer overflows. Usually, when an arithmetic operation results in an integer overflow, the carry bit or overflow bit in the FLAGS register is set. But apparently, according to this article, overflows resulting from division operations don't set the overflow bit, but rather trigger a hardware exception, similar to when you divide by zero.
Now, integer overflows resulting from division are a lot more rare than say, multiplication. There's only a few ways to even trigger a division overflow. One way would be to do something like:
int16_t a = -32768;
int16_t b = -1;
int16_t c = a / b;
In this case, due to the two's complement representation of signed integers, you can't represent positive 32768 in a signed 16-bit integer, so the division operation overflows, resulting in the erroneous value of -32768.
A few questions:
1) Contrary to what this article says, the above did NOT cause a hardware exception. I'm using an x86_64 machine running Linux, and when I divide by zero the program terminates with a Floating point exception. But when I cause a division overflow, the program continues as usual, silently ignoring the erroneous quotient. So why doesn't this cause a hardware exception?
2) Why are division errors treated so severely by the hardware, as opposed to other arithmetic overflows? Why should a multiplication overflow (which is much more likely to accidentally occur) be silently ignored by the hardware, but a division overflow is supposed to trigger a fatal interrupt?
=========== EDIT ==============
Okay, thanks everyone for the responses. I've gotten responses saying basically that the above 16-bit integer division shouldn't cause a hardware fault because the quotient is still less than the register size. I don't understand this. In this case, the register storing the quotient is 16-bit - which is too small to store signed positive 32768. So why isn't a hardware exception raised?
Okay, let's do this directly in GCC inline assembly and see what happens:
int16_t a = -32768;
int16_t b = -1;
__asm__
(
"xorw %%dx, %%dx;" // Clear the DX register (upper-bits of dividend)
"movw %1, %%ax;" // Load lower bits of dividend into AX
"movw %2, %%bx;" // Load the divisor into BX
"idivw %%bx;" // Divide a / b (quotient is stored in AX)
"movw %%ax, %0;" // Copy the quotient into 'b'
: "=rm"(b) // Output list
:"ir"(a), "rm"(b) // Input list
:"%ax", "%dx", "%bx" // Clobbered registers
);
printf("%d\n", b);
This simply outputs an erroneous value: -32768. Still no hardware exception, even though the register storing the quotient (AX) is too small to fit the quotient. So I don't understand why no hardware fault is raised here.

In C language arithmetic operations are never performed within the types smaller than int. Any time you attempt arithmetic on smaller operands, they are first subjected to integral promotions which convert them to int. If on your platform int is, say, 32-bit wide, then there's no way to force a C program to perform 16-bit division. The compiler will generate 32-bit division instead. This is probably why your C experiment does not produce the expected overflow on division. If your platform does indeed have 32-bit int, then your best bet would be to try the same thing with 32-bit operands (i.e. divide INT_MIN by -1). I'm pretty sure that way you'll be able to eventually reproduce the overflow exception even in C code.
In your assembly code you are using 16-bit division, since you specified BX as the operand for idiv. 16-bit division on x86 divides the 32-bit dividend stored in DX:AX pair by the idiv operand. This is what you are doing in your code. The DX:AX pair is interpreted as one composite 32-bit register, meaning that the sign bit in this pair is now actually the highest-order bit of DX. The highest-order bit of AX is not a sign bit anymore.
And what you did you do with DX? You simply cleared it. You set it to 0. But with DX set to 0, your dividend is interpreted as positive! From the machine point of view, such a DX:AX pair actually represents a positive value +32768. I.e. in your assembly-language experiment you are dividing +32768 by -1. And the result is -32768, as it should be. Nothing unusual here.
If you want to represent -32768 in the DX:AX pair, you have to sign-extend it, i.e. you have to fill DX with all-one bit pattern, instead of zeros. Instead of doing xor DX, DX you should have initialized AX with your -32768 and then done cwd. That would have sign-extended AX into DX.
For example, in my experiment (not GCC) this code
__asm {
mov AX, -32768
cwd
mov BX, -1
idiv BX
}
causes the expected exception, because it does indeed attempt to divide -32768 by -1.

When you get an integer overflow with integer 2's complement add/subtract/multiply you still have a valid result - it's just missing some high order bits. This behaviour is often useful, so it would not be appropriate to generate an exception for this.
With integer division however the result of a divide by zero is useless (since, unlike floating point, 2's complement integers have no INF representation).

Contrary to what this article says, the above did NOT cause a hardware exception
The article did not say that. Is says
... they generate a division error if the source operand (divisor) is zero or if the quotient is too large for the designated register
Register size is definitely greater than 16 bits (32 || 64)

From the relevant section on integer overflow:
Unlike the add, mul, and imul
instructions, the Intel division
instructions div and idiv do not set
the overflow flag; they generate a
division error if the source operand
(divisor) is zero or if the quotient
is too large for the designated
register.
The size of a register is on a modern platform either 32 or 64 bits; 32768 will fit into one of those registers. However, the following code will very likely throw an integer overflow execption (it does on my core Duo laptop on VC8):
int x= INT_MIN;
int y= -1;
int z= x/y;

The reason your example did not generate a hardware exception is due to C's integer promotion rules. Operands smaller than int get automatically promoted to ints before the operation is performed.
As to why different kinds of overflows are handled differently, consider that at the x86 machine level, there's really no such thing a multiplication overflow. When you multiply AX by some other register, the result goes in the DX:AX pair, so there is always room for the result, and thus no occasion to signal an overflow exception. However, in C and other languages, the product of two ints is supposed to fit in an int, so there is such a thing as overflow at the C level. The x86 does sometimes set OF (overflow flag) on MULs, but it just means that the high part of the result is non-zero.

On an implementation with 32-bit int, your example does not result in a divide overflow. It results in a perfectly representable int, 32768, which then gets converted to int16_t in an implementation-defined manner when you make the assignment. This is due to the default promotions specified by the C language, and as a result, an implementation which raised an exception here would be non-conformant.
If you want to try to cause an exception (which still may or may not actually happen, it's up to the implementation), try:
int a = INT_MIN, b = -1, c = a/b;
You might have to do some tricks to prevent the compiler from optimizing it out at compile-time.

I would conjecture that on some old computers, attempting to divide by zero would cause some severe problems (e.g. put the hardware into an endless cycle of trying to subtract enough so the remainder would be less than the dividend until an operator came along to fix things), and this started a tradition of divide overflows being regarded as more severe faults than integer overflows.
From a programming standpoint, there's no reason that an unexpected divide overflow should be any more or less serious than an unexpected integer overflow (signed or unsigned). Given the cost of division, the marginal cost of checking an overflow flag afterward would be pretty slight. Tradition is the only reason I can see for having a hardware trap.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js