Manage float arrays on x86 assembly

Manage float arrays on x86 assembly - c++

I'm trying to find the two less numbers of an array of floats in x86, but i don't know how to store the elements of the array
To access to an integer I use:
mov eax, 0 ; eax is a counter
mov ecx, vector ; This store the vector on the register ecx
mov esi, [ecx+4*eax] ; Stores the position eax of the vecor on esi
This work with an array of integers , but not with floats, I don't know how to do it. I checked putting 8 instead of 4 but it doesn't work
EDIT: When I say that it doesn't work, I mean that the values are not readed correctly, the number stored on
ESI is 1099956224 that is not correct
Thanks!

You can only compare floating point numbers if they in floating point registers. Trying to interpret a floating point number in memory as an integer is utterly pointless.
Read up on floating point support in x86. It's a whole different instruction set; historically, it used to be handled by a separate chip, one with an 87 model number. Specifically to load floats from memory, use the FLD command, to compare - FCOM*, to conditionally assign - FCMOV. Also, keep in mind the stack-y nature of the floating point subsystem.
EDIT: alternatively, you can use SSE scalar instructions. On a modern CPU, it may perform better than the legacy x87 instructions. Use MOVSS to load floats from memory and to copy between XMM registers, use COMISS to compare and set EFLAGS.
Finally, as Aki points out, floats in IEEE-754 format sort lexicographically; among two valid floats, the bit pattern of the larger one represents a larger integer, too. In real life, that's something you can leverage.

Related

C/C++ intrinsic for assembly VMOVD

How (in C/C++) to load 32-bit integer to the low 32 bits of an SSE register, while leaving the rest undefined? I mean something like vmovd xmm0, eax with the same efficiency.

Probably you are looking for the intrinsic _mm_cvtsi32_si128 (int a). This copies the lower 32 bits. The upper bits are set to zero.

What bit-level-changes are made while typecasting a float 32bit variable to unsigned integer 32bit?

Consider I want to typecast float 32 bit data to an unsigned integer 32 bit.
float foo = 5.0f;
uint32 goo;
goo = (uint32)foo;
How does the compiler typecast a variable? Is there any intermediate steps; and if yes what are they?

This is going to depend on the hardware to a large degree (though things like the OS/compiler can and will change how the hardware is used).
For example, on older Intel processors (and current ones, in most 32-bit code) you use the x87 instruction set for (most) floating point. In this case, there's an fistp (floating point integer store and pop, though there's also a non-popping variety, in case you also need to continue using that floating point value) that supports storing to an integer, so that's what's typically used. There's a bit in the floating point control word that controls how that conversion will be done (rounding versus truncating) that has to be set correctly as well.
On current hardware (with a current compiler producing 64-bit code) you're typically going to be using SSE instructions, in which case conversion to a signed int can be done with cvtss2si.
Neither of these directly supports unsigned operands though (at least I'm sure SSE doesn't, and to the best of my recollection x87 doesn't either). For these, the compiler probably has a small function in the library, so it's done (at least partly) in software.
If your compiler supports generating AVX256/AVX512 instructions, it could use VCVTTSS2USI to convert from 32-bit float to 64-bit integer, then store the bottom 32-bits of that integer to your destination (and if the result was negative or too large to fit in a 32-bit integer, well...the standard says that gives undefined behavior, so you got what you deserved).

Floating point and unsigned numbers are stored in completely different formats, so the compiler must change from one to the other, as opposed to casting from unsigned to signed where the format of the number is the same and the software just changes how the number is interpreted.
For example, in 32 bit floats the number 5 is represented as 0x40a00000 in float, and 0x00000005 as an unsigned.
This conversion ends up being significant when you're working with weaker microcontrollers, on anything with a FPU it is a few extra instructions.
Here is the wikipedia write up on floating point
Here is a floating point converter that shows the bit level data

Compiler will take care of this part, for example on Intel architecture with the SSE FPU support, GCC defined some operation in "emmtrin.h"
A simple way to figure out is just compile a small c program with assembly output, you will get something like:
movss -4(%rbp), %xmm0
cvtps2pd %xmm0, %xmm0
movsd .LC1(%rip), %xmm1
addsd %xmm1, %xmm0
cvttsd2si %xmm0, %eax
The FPU related instructions (cvtps2pd/cvttsd2si) is used here, this exactly depends on target machine.

How do I move the result of a mul of two floats in x86 assembly?

I am currently trying to multiply two floats, one that comes from a float vector (address stored in ebx) and against the value I stored in ecx.
I have confirmed that the input values are correct, however, if I multiply 32 and 1, for example, the value in EAX changes to 00000000 and the one in EDX to 105F0000. From my understanding of MUL, this happens because it is storing the high order bits of the result in EDX and the low order ones in EDX. Question is, how do I move the result into an output variable (returnValue) ? Here is the code snippet in question:
AddColumnsIteration:
cmp esi, 4 // If we finished storing the data
jge NextColumn // Move to the next column
mov eax, [ebx][esi * SIZEOF_INT] // Get the current column
mul ecx // Multiply it by tx
add [returnValue][esi * SIZEOF_INT], eax // Add the data pointed at by eax to the running total
inc esi // Element x, y, z, or w of a Vec4
jmp AddColumnsIteration // Go back to check our loop condition
I am aware that if I used x87 commands or SSE instructions, this would be orders of magnitude easier, but the constrains of the problem require pure x86 assembly code. Sorry if this seems kinda basic, but I am still learning the idiosyncracies of assembly.
Thank you in advance for the help and have a nice day

You’re multiplying the representations of the floating-point numbers as integers, rather than the floating-point numbers themselves:
1.0 = 0x3f800000
32.0 = 0x42000000
0x3f800000 * 0x42000000 = 0x105f000000000000
To actually do floating-point arithmetic, you need to do one of the following:
Use x87.
Use SSE.
Write your own software multiplication which separates the encodings into a signbit, exponent, and significand, xors the signbits to get the sign bit of the product, adds the exponents and adjusts the bias to get the exponent of the product, multiplies the significands and rounds to get the significand of the product, then assembles them to produce the result.
Obviously, the first two options are much simpler, but it sounds like they aren’t an option for some reason or another (though I can’t really imagine why not; x87 and SSE are "pure x86 assembly code”, as they’ve been part of the ISA for a very long time now).

Equivalent of De Bruijn LSB, but for MSB

Does anyone know of an algorithm similar to De Bruijn's LSB, but for MSB? Or alternately the most efficient way of determining the MSB?
I know Log_2(Val) will do this, but I don't know if it's the most efficient method.
The reason I need it is I need to convert little-endian to big-endian. I know the standard algorithm for this. However, the input is 64 bit, but typically the numbers will be 16 or 24 bit, so swapping the whole 8 bytes around is unneeded 99.9% of the time.

Isn't this exactly http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn ?

If you want a fast method and are able/willing to use hardware specific instructions, you should take a look at x86 BSR (Bit Scan Reverse) or similar instruction.
Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is
found, its bit index is stored in the destination operand (first operand).
xor edx, edx
mov eax, 0x12345678
bsr edx, eax
It should store 28 in edx (bit 28 is the MSB). edx would be unchanged if no MSB is found.

Compiler optimization on marking an int unsigned?

For an integer that is never expected to take -ve values, one could unsigned int or int.
From a compiler perspective or purely cpu cycle perspective is there any difference on x86_64 ?

It depends. It might go either way, depending on what you are doing with that int as well as on the properties of the underlying hardware.
An obvious example in unsigned ints favor would be the integer division operation. In C/C++ integer division is supposed to round towards zero, while machine integer division on x86 rounds towards negative infinity. Also, various "optimized" replacements for integer division (shifts, etc.) also generally round towards negative infinity. So, in order to satisfy standard requirements the compiler are forced to adjust the signed integer division results with additional machine instructions. In case of unsigned integer division this problem does not arise, which is why generally integer division works much faster for unsigned types than for signed types.
For example, consider this simple expression
rand() / 2
The code generated for this expression by MSVC complier will generally look as follows
call rand
cdq
sub eax,edx
sar eax,1
Note that instead of a single shift instruction (sar) we are seeing a whole bunch of instructions here, i.e our sar is preceded by two extra instructions (cdq and sub). These extra instructions are there just to "adjust" the division in order to force it to generate the "correct" (from C language point of view) result. Note, that the compiler does not know that your value will always be positive, so it has to generate these instructions always, unconditionally. They will never do anything useful, thus wasting the CPU cycles.
Not take a look at the code for
(unsigned) rand() / 2
It is just
call rand
shr eax,1
In this case a single shift did the trick, thus providing us with an astronomically faster code (for the division alone).
On the other hand, when you are mixing integer arithmetics and FPU floating-point arithmetics, signed integer types might work faster since the FPU instruction set contains immediate instruction for loading/storing signed integer values, but has no instructions for unsigned integer values.
To illustrate this one can use the following simple function
double zero() { return rand(); }
The generated code will generally be very simple
call rand
mov dword ptr [esp],eax
fild dword ptr [esp]
But if we change our function to
double zero() { return (unsigned) rand(); }
the generated code will change to
call rand
test eax,eax
mov dword ptr [esp],eax
fild dword ptr [esp]
jge zero+17h
fadd qword ptr [__real#41f0000000000000 (4020F8h)]
This code is noticeably larger because the FPU instruction set does not work with unsigned integer types, so the extra adjustments are necessary after loading an unsigned value (which is what that conditional fadd does).
There are other contexts and examples that can be used to demonstrate that it works either way. So, again, it all depends. But generally, all this will not matter in the big picture of your program's performance. I generally prefer to use unsigned types to represent unsigned quantities. In my code 99% of integer types are unsigned. But I do it for purely conceptual reasons, not for any performance gains.

Signed types are inherently more optimizable in most cases because the compiler can ignore the possibility of overflow and simplify/rearrange arithmetic in whatever ways it sees fit. On the other hand, unsigned types are inherently safer because the result is always well-defined (even if not to what you naively think it should be).
The one case where unsigned types are better optimizable is when you're writing division/remainder by a power of two. For unsigned types this translates directly to bitshift and bitwise and. For signed types, unless the compiler can establish that the value is known to be positive, it must generate extra code to compensate for the off-by-one issue with negative numbers (according to C, -3/2 is -1, whereas algebraically and by bitwise operations it's -2).

It will almost certainly make no difference, but occasionally the compiler can play games with the signedness of types in order to shave a couple of cycles, but to be honest it probably is a negligible change overall.
For example suppose you have an int x and want to write:
if(x >= 10 && x < 200) { /* ... */ }
You (or better yet, the compiler) can transform this a little to do one less comparison:
if((unsigned int)(x - 10) < 190) { /* ... */ }
This is making an assumption that int is represented in 2's compliment, so that if (x - 10) is less that 0 is becomes a huge value when viewed as an unsigned int. For example, on a typical x86 system, (unsigned int)-1 == 0xffffffff which is clearly bigger than the 190 being tested.
This is micro-optimization at best and best left up the compiler, instead you should write code that expresses what you mean and if it is too slow, profile and decide where it really is necessary to get clever.

I don't imagine it would make much difference in terms of CPU or the compiler. One possible case would be if it enabled the compiler to know that the number would never be negative and optimize away code.
However it IS useful to a human reading your code so they know the domain of the variable in question.

From the ALU's point of view adding (or whatever) signed or unsigned values doesn't make any difference, since they're both represented by a group of bit. 0100 + 1011 is always 1111, but you choose if that is 4 + (-5) = -1 or 4 + 11 = 15.
So I agree with #Mark, you should choose the best data-type to help others understand your code.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js