Convert int32_t to unsigned char. AVX - c++

Need to correctly convert YMM with 8 int32_t to XMM with 8 UNSIGNED uint8_t at the bottom, using AVX intrinsics. It should be analogue of static_cast<uint8_t>. It means that C++ standard rules work (modular reduction). So we get truncation of the 2's complement bit-pattern.
For example, (int32_t)(-1) -> (uint8_t)(255), and +200 -> (uint8_t)(200) so we can't use signed or unsigned saturation to 8-bit (or even to 16-bit as an intermediate step).
I have this code as the example:
packssdw xmm0, xmm0
packuswb xmm0, xmm0
movd somewhere, xmm0
But these commands use unsigned saturation, so we get (int32_t)(-1) -> (uint8_t)(0).
I know vcvttss2si and it works correctly but only for one value. For the best performance I want to use vector registers.
Also I know about shuffling but it's enough slow for me.
So Is there another way to convert from int32_t YMM to uint8_t YMM as static_cast<uint8_t>?
UPD: The comment of #chtz is answer of my question.

Related

overflow instead of saturation on 16bit add AVX2

I want to add 2 unsigned vectors using AVX2
__m256i i1 = _mm256_loadu_si256((__m256i *) si1);
__m256i i2 = _mm256_loadu_si256((__m256i *) si2);
__m256i result = _mm256_adds_epu16(i2, i1);
however I need to have overflow instead of saturation that _mm256_adds_epu16 does to be identical with the non-vectorized code, is there any solution for that?
Use normal binary wrapping _mm256_add_epi16 instead of saturating adds.
Two's complement and unsigned addition/subtraction are the same binary operation, that's one of the reasons modern computers use two's complement. As the asm manual entry for vpaddw mentions, the instructions can be used on signed or unsigned integers. (The intrinsics guide entry doesn't mention signedness at all, so is less helpful at clearing up this confusion.)
Compares like _mm_cmpgt_epi32 are sensitive to signedness, but math operations (and cmpeq) aren't.
The intrinsics names Intel chose might look like they're for signed integers specifically, but they always use epi or si for things that work equally on signed and unsigned elements. But no, epu implies a specifically unsigned thing, while epi can be specifically signed operations or can be things that work equally on signed or unsigned. Or things where signedness is irrelevant.
For example, _mm_and_si128 is pure bitwise. _mm_srli_epi32 is a logical right shift, shifting in zeros, like an unsigned C shift. Not copies of the sign bit, that's _mm_srai_epi32 (shift right arithmetic by immediate). Shuffles like _mm_shuffle_epi32 just move data around in chunks.
Non-widening multiplication like _mm_mullo_epi16 and _mm_mullo_epi32 are also the same for signed or unsigned. Only the high-half _mm_mulhi_epu16 or widening multiplies _mm_mul_epu32 have unsigned forms as counterparts to their specifically signed epi16/32 forms.
That's also why 386 only added a scalar integer imul ecx, esi form, not also a mul ecx, esi, because only the FLAGS setting would differ, not the integer result. And SIMD operations don't even have FLAGS outputs.
The intrinsics guide unhelpfully describes _mm_mullo_epi16 as sign-extending and producing a 32-bit product, then truncating to the low 32-bit. The asm manual for pmullw also describes it as signed that way, it seems talking about it as the companion to signed pmulhw. (And has some bugs, like describing the AVX1 VPMULLW xmm1, xmm2, xmm3/m128 form as multiplying 32-bit dword elements, probably a copy/paste error from pmulld)
And sometimes Intel's naming scheme is limited, like _mm_maddubs_epi16 is a u8 x i8 => 16-bit widening multiply, adding pairs horizontally (with signed saturation). I usually have to look up the intrinsic for pmaddubsw to remind myself that they named it after the output element width, not the inputs. The inputs have different signedness so if they have to pick one, side, I guess it makes sense to name it for the output, with the signed saturation that can happen with some inputs, like for pmaddwd.

C/C++ intrinsic for assembly VMOVD

How (in C/C++) to load 32-bit integer to the low 32 bits of an SSE register, while leaving the rest undefined? I mean something like vmovd xmm0, eax with the same efficiency.
Probably you are looking for the intrinsic _mm_cvtsi32_si128 (int a). This copies the lower 32 bits. The upper bits are set to zero.

What bit-level-changes are made while typecasting a float 32bit variable to unsigned integer 32bit?

Consider I want to typecast float 32 bit data to an unsigned integer 32 bit.
float foo = 5.0f;
uint32 goo;
goo = (uint32)foo;
How does the compiler typecast a variable? Is there any intermediate steps; and if yes what are they?
This is going to depend on the hardware to a large degree (though things like the OS/compiler can and will change how the hardware is used).
For example, on older Intel processors (and current ones, in most 32-bit code) you use the x87 instruction set for (most) floating point. In this case, there's an fistp (floating point integer store and pop, though there's also a non-popping variety, in case you also need to continue using that floating point value) that supports storing to an integer, so that's what's typically used. There's a bit in the floating point control word that controls how that conversion will be done (rounding versus truncating) that has to be set correctly as well.
On current hardware (with a current compiler producing 64-bit code) you're typically going to be using SSE instructions, in which case conversion to a signed int can be done with cvtss2si.
Neither of these directly supports unsigned operands though (at least I'm sure SSE doesn't, and to the best of my recollection x87 doesn't either). For these, the compiler probably has a small function in the library, so it's done (at least partly) in software.
If your compiler supports generating AVX256/AVX512 instructions, it could use VCVTTSS2USI to convert from 32-bit float to 64-bit integer, then store the bottom 32-bits of that integer to your destination (and if the result was negative or too large to fit in a 32-bit integer, well...the standard says that gives undefined behavior, so you got what you deserved).
Floating point and unsigned numbers are stored in completely different formats, so the compiler must change from one to the other, as opposed to casting from unsigned to signed where the format of the number is the same and the software just changes how the number is interpreted.
For example, in 32 bit floats the number 5 is represented as 0x40a00000 in float, and 0x00000005 as an unsigned.
This conversion ends up being significant when you're working with weaker microcontrollers, on anything with a FPU it is a few extra instructions.
Here is the wikipedia write up on floating point
Here is a floating point converter that shows the bit level data
Compiler will take care of this part, for example on Intel architecture with the SSE FPU support, GCC defined some operation in "emmtrin.h"
A simple way to figure out is just compile a small c program with assembly output, you will get something like:
movss -4(%rbp), %xmm0
cvtps2pd %xmm0, %xmm0
movsd .LC1(%rip), %xmm1
addsd %xmm1, %xmm0
cvttsd2si %xmm0, %eax
The FPU related instructions (cvtps2pd/cvttsd2si) is used here, this exactly depends on target machine.

Manage float arrays on x86 assembly

I'm trying to find the two less numbers of an array of floats in x86, but i don't know how to store the elements of the array
To access to an integer I use:
mov eax, 0 ; eax is a counter
mov ecx, vector ; This store the vector on the register ecx
mov esi, [ecx+4*eax] ; Stores the position eax of the vecor on esi
This work with an array of integers , but not with floats, I don't know how to do it. I checked putting 8 instead of 4 but it doesn't work
EDIT: When I say that it doesn't work, I mean that the values are not readed correctly, the number stored on
ESI is 1099956224 that is not correct
Thanks!
You can only compare floating point numbers if they in floating point registers. Trying to interpret a floating point number in memory as an integer is utterly pointless.
Read up on floating point support in x86. It's a whole different instruction set; historically, it used to be handled by a separate chip, one with an 87 model number. Specifically to load floats from memory, use the FLD command, to compare - FCOM*, to conditionally assign - FCMOV. Also, keep in mind the stack-y nature of the floating point subsystem.
EDIT: alternatively, you can use SSE scalar instructions. On a modern CPU, it may perform better than the legacy x87 instructions. Use MOVSS to load floats from memory and to copy between XMM registers, use COMISS to compare and set EFLAGS.
Finally, as Aki points out, floats in IEEE-754 format sort lexicographically; among two valid floats, the bit pattern of the larger one represents a larger integer, too. In real life, that's something you can leverage.

Equivalent of De Bruijn LSB, but for MSB

Does anyone know of an algorithm similar to De Bruijn's LSB, but for MSB? Or alternately the most efficient way of determining the MSB?
I know Log_2(Val) will do this, but I don't know if it's the most efficient method.
The reason I need it is I need to convert little-endian to big-endian. I know the standard algorithm for this. However, the input is 64 bit, but typically the numbers will be 16 or 24 bit, so swapping the whole 8 bytes around is unneeded 99.9% of the time.
Isn't this exactly http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn ?
If you want a fast method and are able/willing to use hardware specific instructions, you should take a look at x86 BSR (Bit Scan Reverse) or similar instruction.
Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is
found, its bit index is stored in the destination operand (first operand).
xor edx, edx
mov eax, 0x12345678
bsr edx, eax
It should store 28 in edx (bit 28 is the MSB). edx would be unchanged if no MSB is found.