How (in C/C++) to load 32-bit integer to the low 32 bits of an SSE register, while leaving the rest undefined? I mean something like vmovd xmm0, eax with the same efficiency.
Probably you are looking for the intrinsic _mm_cvtsi32_si128 (int a). This copies the lower 32 bits. The upper bits are set to zero.
Related
Need to correctly convert YMM with 8 int32_t to XMM with 8 UNSIGNED uint8_t at the bottom, using AVX intrinsics. It should be analogue of static_cast<uint8_t>. It means that C++ standard rules work (modular reduction). So we get truncation of the 2's complement bit-pattern.
For example, (int32_t)(-1) -> (uint8_t)(255), and +200 -> (uint8_t)(200) so we can't use signed or unsigned saturation to 8-bit (or even to 16-bit as an intermediate step).
I have this code as the example:
packssdw xmm0, xmm0
packuswb xmm0, xmm0
movd somewhere, xmm0
But these commands use unsigned saturation, so we get (int32_t)(-1) -> (uint8_t)(0).
I know vcvttss2si and it works correctly but only for one value. For the best performance I want to use vector registers.
Also I know about shuffling but it's enough slow for me.
So Is there another way to convert from int32_t YMM to uint8_t YMM as static_cast<uint8_t>?
UPD: The comment of #chtz is answer of my question.
Consider I want to typecast float 32 bit data to an unsigned integer 32 bit.
float foo = 5.0f;
uint32 goo;
goo = (uint32)foo;
How does the compiler typecast a variable? Is there any intermediate steps; and if yes what are they?
This is going to depend on the hardware to a large degree (though things like the OS/compiler can and will change how the hardware is used).
For example, on older Intel processors (and current ones, in most 32-bit code) you use the x87 instruction set for (most) floating point. In this case, there's an fistp (floating point integer store and pop, though there's also a non-popping variety, in case you also need to continue using that floating point value) that supports storing to an integer, so that's what's typically used. There's a bit in the floating point control word that controls how that conversion will be done (rounding versus truncating) that has to be set correctly as well.
On current hardware (with a current compiler producing 64-bit code) you're typically going to be using SSE instructions, in which case conversion to a signed int can be done with cvtss2si.
Neither of these directly supports unsigned operands though (at least I'm sure SSE doesn't, and to the best of my recollection x87 doesn't either). For these, the compiler probably has a small function in the library, so it's done (at least partly) in software.
If your compiler supports generating AVX256/AVX512 instructions, it could use VCVTTSS2USI to convert from 32-bit float to 64-bit integer, then store the bottom 32-bits of that integer to your destination (and if the result was negative or too large to fit in a 32-bit integer, well...the standard says that gives undefined behavior, so you got what you deserved).
Floating point and unsigned numbers are stored in completely different formats, so the compiler must change from one to the other, as opposed to casting from unsigned to signed where the format of the number is the same and the software just changes how the number is interpreted.
For example, in 32 bit floats the number 5 is represented as 0x40a00000 in float, and 0x00000005 as an unsigned.
This conversion ends up being significant when you're working with weaker microcontrollers, on anything with a FPU it is a few extra instructions.
Here is the wikipedia write up on floating point
Here is a floating point converter that shows the bit level data
Compiler will take care of this part, for example on Intel architecture with the SSE FPU support, GCC defined some operation in "emmtrin.h"
A simple way to figure out is just compile a small c program with assembly output, you will get something like:
movss -4(%rbp), %xmm0
cvtps2pd %xmm0, %xmm0
movsd .LC1(%rip), %xmm1
addsd %xmm1, %xmm0
cvttsd2si %xmm0, %eax
The FPU related instructions (cvtps2pd/cvttsd2si) is used here, this exactly depends on target machine.
I am currently trying to multiply two floats, one that comes from a float vector (address stored in ebx) and against the value I stored in ecx.
I have confirmed that the input values are correct, however, if I multiply 32 and 1, for example, the value in EAX changes to 00000000 and the one in EDX to 105F0000. From my understanding of MUL, this happens because it is storing the high order bits of the result in EDX and the low order ones in EDX. Question is, how do I move the result into an output variable (returnValue) ? Here is the code snippet in question:
AddColumnsIteration:
cmp esi, 4 // If we finished storing the data
jge NextColumn // Move to the next column
mov eax, [ebx][esi * SIZEOF_INT] // Get the current column
mul ecx // Multiply it by tx
add [returnValue][esi * SIZEOF_INT], eax // Add the data pointed at by eax to the running total
inc esi // Element x, y, z, or w of a Vec4
jmp AddColumnsIteration // Go back to check our loop condition
I am aware that if I used x87 commands or SSE instructions, this would be orders of magnitude easier, but the constrains of the problem require pure x86 assembly code. Sorry if this seems kinda basic, but I am still learning the idiosyncracies of assembly.
Thank you in advance for the help and have a nice day
You’re multiplying the representations of the floating-point numbers as integers, rather than the floating-point numbers themselves:
1.0 = 0x3f800000
32.0 = 0x42000000
0x3f800000 * 0x42000000 = 0x105f000000000000
To actually do floating-point arithmetic, you need to do one of the following:
Use x87.
Use SSE.
Write your own software multiplication which separates the encodings into a signbit, exponent, and significand, xors the signbits to get the sign bit of the product, adds the exponents and adjusts the bias to get the exponent of the product, multiplies the significands and rounds to get the significand of the product, then assembles them to produce the result.
Obviously, the first two options are much simpler, but it sounds like they aren’t an option for some reason or another (though I can’t really imagine why not; x87 and SSE are "pure x86 assembly code”, as they’ve been part of the ISA for a very long time now).
I'm trying to find the two less numbers of an array of floats in x86, but i don't know how to store the elements of the array
To access to an integer I use:
mov eax, 0 ; eax is a counter
mov ecx, vector ; This store the vector on the register ecx
mov esi, [ecx+4*eax] ; Stores the position eax of the vecor on esi
This work with an array of integers , but not with floats, I don't know how to do it. I checked putting 8 instead of 4 but it doesn't work
EDIT: When I say that it doesn't work, I mean that the values are not readed correctly, the number stored on
ESI is 1099956224 that is not correct
Thanks!
You can only compare floating point numbers if they in floating point registers. Trying to interpret a floating point number in memory as an integer is utterly pointless.
Read up on floating point support in x86. It's a whole different instruction set; historically, it used to be handled by a separate chip, one with an 87 model number. Specifically to load floats from memory, use the FLD command, to compare - FCOM*, to conditionally assign - FCMOV. Also, keep in mind the stack-y nature of the floating point subsystem.
EDIT: alternatively, you can use SSE scalar instructions. On a modern CPU, it may perform better than the legacy x87 instructions. Use MOVSS to load floats from memory and to copy between XMM registers, use COMISS to compare and set EFLAGS.
Finally, as Aki points out, floats in IEEE-754 format sort lexicographically; among two valid floats, the bit pattern of the larger one represents a larger integer, too. In real life, that's something you can leverage.
Does anyone know of an algorithm similar to De Bruijn's LSB, but for MSB? Or alternately the most efficient way of determining the MSB?
I know Log_2(Val) will do this, but I don't know if it's the most efficient method.
The reason I need it is I need to convert little-endian to big-endian. I know the standard algorithm for this. However, the input is 64 bit, but typically the numbers will be 16 or 24 bit, so swapping the whole 8 bytes around is unneeded 99.9% of the time.
Isn't this exactly http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn ?
If you want a fast method and are able/willing to use hardware specific instructions, you should take a look at x86 BSR (Bit Scan Reverse) or similar instruction.
Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is
found, its bit index is stored in the destination operand (first operand).
xor edx, edx
mov eax, 0x12345678
bsr edx, eax
It should store 28 in edx (bit 28 is the MSB). edx would be unchanged if no MSB is found.