Does anyone know of an algorithm similar to De Bruijn's LSB, but for MSB? Or alternately the most efficient way of determining the MSB?
I know Log_2(Val) will do this, but I don't know if it's the most efficient method.
The reason I need it is I need to convert little-endian to big-endian. I know the standard algorithm for this. However, the input is 64 bit, but typically the numbers will be 16 or 24 bit, so swapping the whole 8 bytes around is unneeded 99.9% of the time.
Isn't this exactly http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn ?
If you want a fast method and are able/willing to use hardware specific instructions, you should take a look at x86 BSR (Bit Scan Reverse) or similar instruction.
Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is
found, its bit index is stored in the destination operand (first operand).
xor edx, edx
mov eax, 0x12345678
bsr edx, eax
It should store 28 in edx (bit 28 is the MSB). edx would be unchanged if no MSB is found.
Related
Need to correctly convert YMM with 8 int32_t to XMM with 8 UNSIGNED uint8_t at the bottom, using AVX intrinsics. It should be analogue of static_cast<uint8_t>. It means that C++ standard rules work (modular reduction). So we get truncation of the 2's complement bit-pattern.
For example, (int32_t)(-1) -> (uint8_t)(255), and +200 -> (uint8_t)(200) so we can't use signed or unsigned saturation to 8-bit (or even to 16-bit as an intermediate step).
I have this code as the example:
packssdw xmm0, xmm0
packuswb xmm0, xmm0
movd somewhere, xmm0
But these commands use unsigned saturation, so we get (int32_t)(-1) -> (uint8_t)(0).
I know vcvttss2si and it works correctly but only for one value. For the best performance I want to use vector registers.
Also I know about shuffling but it's enough slow for me.
So Is there another way to convert from int32_t YMM to uint8_t YMM as static_cast<uint8_t>?
UPD: The comment of #chtz is answer of my question.
How (in C/C++) to load 32-bit integer to the low 32 bits of an SSE register, while leaving the rest undefined? I mean something like vmovd xmm0, eax with the same efficiency.
Probably you are looking for the intrinsic _mm_cvtsi32_si128 (int a). This copies the lower 32 bits. The upper bits are set to zero.
I am currently trying to multiply two floats, one that comes from a float vector (address stored in ebx) and against the value I stored in ecx.
I have confirmed that the input values are correct, however, if I multiply 32 and 1, for example, the value in EAX changes to 00000000 and the one in EDX to 105F0000. From my understanding of MUL, this happens because it is storing the high order bits of the result in EDX and the low order ones in EDX. Question is, how do I move the result into an output variable (returnValue) ? Here is the code snippet in question:
AddColumnsIteration:
cmp esi, 4 // If we finished storing the data
jge NextColumn // Move to the next column
mov eax, [ebx][esi * SIZEOF_INT] // Get the current column
mul ecx // Multiply it by tx
add [returnValue][esi * SIZEOF_INT], eax // Add the data pointed at by eax to the running total
inc esi // Element x, y, z, or w of a Vec4
jmp AddColumnsIteration // Go back to check our loop condition
I am aware that if I used x87 commands or SSE instructions, this would be orders of magnitude easier, but the constrains of the problem require pure x86 assembly code. Sorry if this seems kinda basic, but I am still learning the idiosyncracies of assembly.
Thank you in advance for the help and have a nice day
You’re multiplying the representations of the floating-point numbers as integers, rather than the floating-point numbers themselves:
1.0 = 0x3f800000
32.0 = 0x42000000
0x3f800000 * 0x42000000 = 0x105f000000000000
To actually do floating-point arithmetic, you need to do one of the following:
Use x87.
Use SSE.
Write your own software multiplication which separates the encodings into a signbit, exponent, and significand, xors the signbits to get the sign bit of the product, adds the exponents and adjusts the bias to get the exponent of the product, multiplies the significands and rounds to get the significand of the product, then assembles them to produce the result.
Obviously, the first two options are much simpler, but it sounds like they aren’t an option for some reason or another (though I can’t really imagine why not; x87 and SSE are "pure x86 assembly code”, as they’ve been part of the ISA for a very long time now).
I'm trying to find the two less numbers of an array of floats in x86, but i don't know how to store the elements of the array
To access to an integer I use:
mov eax, 0 ; eax is a counter
mov ecx, vector ; This store the vector on the register ecx
mov esi, [ecx+4*eax] ; Stores the position eax of the vecor on esi
This work with an array of integers , but not with floats, I don't know how to do it. I checked putting 8 instead of 4 but it doesn't work
EDIT: When I say that it doesn't work, I mean that the values are not readed correctly, the number stored on
ESI is 1099956224 that is not correct
Thanks!
You can only compare floating point numbers if they in floating point registers. Trying to interpret a floating point number in memory as an integer is utterly pointless.
Read up on floating point support in x86. It's a whole different instruction set; historically, it used to be handled by a separate chip, one with an 87 model number. Specifically to load floats from memory, use the FLD command, to compare - FCOM*, to conditionally assign - FCMOV. Also, keep in mind the stack-y nature of the floating point subsystem.
EDIT: alternatively, you can use SSE scalar instructions. On a modern CPU, it may perform better than the legacy x87 instructions. Use MOVSS to load floats from memory and to copy between XMM registers, use COMISS to compare and set EFLAGS.
Finally, as Aki points out, floats in IEEE-754 format sort lexicographically; among two valid floats, the bit pattern of the larger one represents a larger integer, too. In real life, that's something you can leverage.
Many CPUs have single assembly opcodes for returning the high order bits of a 32 bit integer multiplication. Normally multiplying two 32 bit integers produces a 64 bit result, but this is truncated to the low 32 bits if you store it in a 32 bit integer.
For example, on PowerPC, the mulhw opcode returns the high 32 bits of the 64 bit result of a 32x32 bit multiply in one clock. This is exactly what I'm looking for, but more portably. There's a similar opcode, umulhi(), in NVidia CUDA.
In C/C++, is there an efficient way to return the high order bits of the 32x32 multiply?
Currently I compute it by casting to 64 bits, something like:
unsigned int umulhi32(unsigned int x, unsigned int y)
{
unsigned long long xx=x;
xx*=y;
return (unsigned int)(xx>>32);
}
but this is over 11 times slower than a regular 32 by 32 multiply because I'm using overkill 64 bit math even for the multiply.
Is there a faster way to compute the high order bits?
This is clearly not best solved with a BigInteger library (which is overkill and will have huge overhead).
SSE seems to have PMULHUW, a 16x16 -> top 16 bit version of this, but not a 32x32 -> top 32 version like I'm looking for.
gcc 4.3.2, with -O1 optimisation or higher, translated your function exactly as you showed it to IA32 assembly like this:
umulhi32:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %eax
mull 8(%ebp)
movl %edx, %eax
popl %ebp
ret
Which is just doing a single 32 bit mull and putting the high 32 bits of the result (from %edx) into the return value.
That's what you wanted, right? Sounds like you just need to turn up the optimisation on your compiler ;) It's possible you could push the compiler in the right direction by eliminating the intermediate variable:
unsigned int umulhi32(unsigned int x, unsigned int y)
{
return (unsigned int)(((unsigned long long)x * y)>>32);
}
I don't think there's a way to do this in standard C/C++ better than what you already have. What I'd do is write up a simple assembly wrapper that returned the result you want.
Not that you're asking about Windows, but as an example even though Windows has an API that sounds like it does what you want (a 32 by 32 bit multiply while obtaining the full 64 bit result), it implements the multiply as a macro that does what you're doing:
#define UInt32x32To64( a, b ) (ULONGLONG)((ULONGLONG)(DWORD)(a) * (DWORD)(b))
On 32 bit intel, a multiply affects two registers for the output. That is, the 64 bits are fully available, whether you want it or not. Its just a function of whether the compiler is smart enough to take advantage of it.
Modern compilers do amazing things, so my suggestion is to experiment with optimization flags some more, at least on Intel. You would think that the optimizer might know that the processor produces a 64 bit value from 32 by 32 bits.
That said, at some point I tried to get the compiler to use the modulo as well as the dividend on a division result, but the old Microsoft compiler from 1998 was not smart enough to realize the same instruction produced both results.