8-bit FFT for CPU architectures? - c++

I am looking for an FFT engine that can handle 8-bit real to complex transforms (of size 65K). The need for this is to accelerate a real-time signal processing engine. It is currently limited by 8-bit -> FP32 and FP32 -> 8-bit conversions, as well as the actual FFT being memory bandwidth bound (we're using FFTW at the moment).
I thought that the Spiral project might be able to do this http://spiral.net, but the only code that seems to be available on their webpage is for single or double transforms.
Anyone know of any C or C++ libraries that can do this?

Sometimes ago I encountered the same problem. FFTW for my dataframe was executed in 14 ms (forward, some calculations, and backward), while straightforward byte (or short) to float array conversion took 12-19 ms. So I've made SSE function to convert bytes to floats (4 elements per cycle), and have got significant speed gain - now conversion is accomplished in 2.2-5 ms.
If you compiler can use autovectorization, try it first.
If not, write simple conversion function with intrinsics.
I've used inline assembler (MOVD, PUNPCKLBW, PUNPCKLWD, CVTDQ2PS, MOVAPS command sequence).
procedure BytesToSingles(Src, Dst: Pointer; Count: Integer);
//EAX = Src pointer to byte array
//EDX = Dst pointer to float array !!! 16 byte-aligned !!!
//ECX = Count (multiple of four)
SHR ECX, 2 // 4 elements per cycle
JZ ##Exit
PXOR XMM7, XMM7 // zeros
MOVD XMM1, [EAX] // load 4 bytes
PUNPCKLBW XMM1, XMM7 // unpack to words
PUNPCKLWD XMM1, XMM7 // words to int32
CVTDQ2PS XMM0, XMM1 // convert integers to 4 floats
MOVAPS [EDX], XMM0 // store 4 floats to destination array
ADD EAX, 4 // move array pointers
LOOP ##Cycle
Note that FFT implementation on 8-bit data will suffer from numerical error issues, as Paul R wrote in comment.

You do not want to do all the processing in fixed point. You data will turn to mush in an FFT of that size. Technically, you could use 32bit fixed point and keep all your dynamics, but you'd still have to convert the data and it will be slower than using floats (you tagged SSE, so I assume you are on an intel machine having an FPU). I base my opinions on my work creating kissfft
Focus instead on speeding up the type conversion.
I've not run MBo's assembly code, but it looks like the right approach. I think unrolling might make it faster.
If you are not accustomed to assembly, use SSE2 compiler instrinsics instead. It will be just as fast (assuming decent compiler) and it will make your code more readable and maintainable. This answer will give you most of what you need.


Reinterpret casting from __m256i to __m256 [duplicate]

Why does _mm_extract_ps return an int instead of a float?
What's the proper way to read a single float from an XMM register in C?
Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction?
None of the answers appear to actually answer the question, why does it return int.
The reason is, the extractps instruction actually copies a component of the vector to a general register. It does seem pretty silly for it to return an int but that's what's actually happening - the raw floating point value ends up in a general register (which hold integers).
If your compiler is configured to generate SSE for all floating point operations, then the closest thing to "extracting" a value to a register would be to shuffle the value into the low component of the vector, then cast it to a scalar float. This should cause that component of the vector to remain in an SSE register:
/* returns the second component of the vector */
float foo(__m128 b)
return _mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(0, 0, 0, 2)));
The _mm_cvtss_f32 intrinsic is free, it does not generate instructions, it only makes the compiler reinterpret the xmm register as a float so it can be returned as such.
The _mm_shuffle_ps gets the desired value into the lowest component. The _MM_SHUFFLE macro generates an immediate operand for the resulting shufps instruction.
The 2 in the example gets the float from bit 95:64 of the 127:0 register (the 3rd 32 bit component from the beginning, in memory order) and places it in the 31:0 component of the register (the beginning, in memory order).
The resulting generated code will most likely return the value naturally in a register, like any other floating point value return, with no inefficient writing out to memory and reading it back.
If you're generating code that uses the x87 FPU for floating point (for normal C code that isn't SSE optimized), this would probably result in inefficient code being generated - the compiler would probably store out the component of the SSE vector then use fld to read it back into the x87 register stack. In general 64-bit platforms don't use x87 (they use SSE for all floating point, mostly scalar instructions unless the compiler is vectorizing).
I should add that I always use C++, so I'm not sure whether it is more efficient to pass __m128 by value or by pointer in C. In C++ I would use a const __m128 & and this kind of code would be in a header, so the compiler can inline.
Confusingly, int _mm_extract_ps() is not for getting a scalar float element from a vector. The intrinsic doesn't expose the memory-destination form of the instruction (which can be useful for that purpose). This is not the only case where the intrinsics can't directly express everything an instruction is useful for. :(
gcc and clang know how the asm instruction works and will use it that way for you when compiling other shuffles; type-punning the _mm_extract_ps result to float usually results in horrible asm from gcc (extractps eax, xmm0, 2 / mov [mem], eax).
The name makes sense if you think of _mm_extract_ps as extracting an IEEE 754 binary32 float bit pattern out of the FP domain of the CPU into the integer domain (as a C scalar int), instead of manipulating FP bit patterns with integer vector ops. According to my testing with gcc, clang, and icc (see below), this is the only "portable" use-case where _mm_extract_ps compiles into good asm across all compilers. Anything else is just a compiler-specific hack to get the asm you want.
The corresponding asm instruction is EXTRACTPS r/m32, xmm, imm8. Notice that the destination can be memory or an integer register, but not another XMM register. It's the FP equivalent of PEXTRD r/m32, xmm, imm8 (also in SSE4.1), where the integer-register-destination form is more obviously useful. EXTRACTPS is not the reverse of INSERTPS xmm1, xmm2/m32, imm8.
Perhaps this similarity with PEXTRD makes the internal implementation simpler without hurting the extract-to-memory use-case (for asm, not intrinsics), or maybe the SSE4.1 designers at Intel thought it was actually more useful this way than as a non-destructive FP-domain copy-and-shuffle (which x86 seriously lacks without AVX). There are FP-vector instructions that have an XMM source and a memory-or-xmm destination, like MOVSS xmm2/m32, xmm, so this kind of instruction would not be new. Fun fact: the opcodes for PEXTRD and EXTRACTPS differ only in the last bit.
In assembly, a scalar float is just the low element of an XMM register (or 4 bytes in memory). The upper elements of the XMM don't even have to be zeroed for instructions like ADDSS to work without raising any extra FP exceptions. In calling conventions that pass/return FP args in XMM registers (e.g. all the usual x86-64 ABIs), float foo(float a) must assume that the upper elements of XMM0 hold garbage on entry, but can leave garbage in the high elements of XMM0 on return. (More info).
As #doug points out, other shuffle instructions can be used to get a float element of a vector into the bottom of an xmm register. This was already a mostly-solved problem in SSE1/SSE2, and it seems EXTRACTPS and INSERTPS weren't trying to solve it for register operands.
SSE4.1 INSERTPS xmm1, xmm2/m32, imm8 is one of the best ways for compilers to implement _mm_set_ss(function_arg) when the scalar float is already in a register and they can't/don't optimize away zeroing the upper elements. (Which is most of the time for compilers other than clang). That linked question also further discusses the failure of intrinsics to expose the load or store versions of instructions like EXTRACTPS, INSERTPS, and PMOVZX that have a memory operand narrower than 128b (thus not requiring alignment even without AVX). It can be impossible to write safe code that compiles as efficiently as what you can do in asm.
Without AVX 3-operand SHUFPS, x86 doesn't provide a fully efficient and general-purpose way to copy-and-shuffle an FP vector the way integer PSHUFD can. SHUFPS is a different beast unless used in-place with src=dst. Preserving the original requires a MOVAPS, which costs a uop and latency on CPUs before IvyBridge, and always costs code-size. Using PSHUFD between FP instructions costs latency (bypass delays). (See this horizontal-sum answer for some tricks, like using SSE3 MOVSHDUP).
SSE4.1 INSERTPS can extract one element into a separate register, but AFAIK it still has a dependency on the previous value of the destination even when all the original values are replaced. False dependencies like that are bad for out-of-order execution. xor-zeroing a register as a destination for INSERTPS would still be 2 uops, and have lower latency than MOVAPS+SHUFPS on SSE4.1 CPUs without mov-elimination for zero-latency MOVAPS (only Penryn, Nehalem, Sandybridge. Also Silvermont if you include low-power CPUs). The code-size is slightly worse, though.
Using _mm_extract_ps and then type-punning the result back to float (as suggested in the currently-accepted answer and its comments) is a bad idea. It's easy for your code to compile to something horrible (like EXTRACTPS to memory and then load back into an XMM register) on either gcc or icc. Clang seems to be immune to braindead behaviour and does its usual shuffle-compiling with its own choice of shuffle instructions (including appropriate use of EXTRACTPS).
I tried these examples with gcc5.4 -O3 -msse4.1 -mtune=haswell, clang3.8.1, and icc17, on the Godbolt compiler explorer. I used C mode, not C++, but union-based type punning is allowed in GNU C++ as an extension to ISO C++. Pointer-casting for type-punning violates strict aliasing in C99 and C++, even with GNU extensions.
#include <immintrin.h>
// gcc:bad clang:good icc:good
void extr_unsafe_ptrcast(__m128 v, float *p) {
// violates strict aliasing
*(int*)p = _mm_extract_ps(v, 2);
gcc: # others extractps with a memory dest
extractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
// gcc:good clang:good icc:bad
void extr_pun(__m128 v, float *p) {
// union type punning is safe in C99 (and GNU C and GNU C++)
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
*p = fp.f; // compiles to an extractps straight to memory
vextractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
// gcc:good clang:good icc:horrible
void extr_gnu(__m128 v, float *p) {
// gcc uses extractps with a memory dest, icc does extr_store
*p = v[2];
extractps DWORD PTR [rdi], xmm0, 2
vmovups XMMWORD PTR [-24+rsp], xmm0
mov eax, DWORD PTR [-16+rsp] # reload from red-zone tmp buffer
mov DWORD PTR [rdi], eax
// gcc:good clang:good icc:poor
void extr_shuf(__m128 v, float *p) {
__m128 e2 = _mm_shuffle_ps(v,v, 2);
*p = _mm_cvtss_f32(e2); // gcc uses extractps
icc: (others: extractps right to memory)
vshufps xmm1, xmm0, xmm0, 2
vmovss DWORD PTR [rdi], xmm1
When you want the final result in an xmm register, it's up to the compiler to optimize away your extractps and do something completely different. Gcc and clang both succeed, but ICC doesn't.
// gcc:good clang:good icc:bad
float ret_pun(__m128 v) {
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
return fp.f;
unpckhps xmm0, xmm0
shufpd xmm0, xmm0, 1
vextractps DWORD PTR [-8+rsp], xmm0, 2
vmovss xmm0, DWORD PTR [-8+rsp]
Note that icc did poorly for extr_pun, too, so it doesn't like union-based type-punning for this.
The clear winner here is doing the shuffle "manually" with _mm_shuffle_ps(v,v, 2), and using _mm_cvtss_f32. We got optimal code from every compiler for both register and memory destinations, except for ICC which failed to use EXTRACTPS for the memory-dest case. With AVX, SHUFPS + separate store is still only 2 uops on Intel CPUs, just larger code size and needs a tmp register. Without AVX, though, it would cost a MOVAPS to not destroy the original vector :/
According to Agner Fog's instruction tables, all Intel CPUs except Nehalem implement the register-destination versions of both PEXTRD and EXTRACTPS with multiple uops: Usually just a shuffle uop + a MOVD uop to move data from the vector domain to gp-integer. Nehalem register-destination EXTRACTPS is 1 uop for port 5, with 1+2 cycle latency (1 + bypass delay).
I have no idea why they managed to implement EXTRACTPS as a single uop but not PEXTRD (which is 2 uops, and runs in 2+1 cycle latency). Nehalem MOVD is 1 uop (and runs on any ALU port), with 1+1 cycle latency. (The +1 is for the bypass delay between vec-int and general-purpose integer regs, I think).
Nehalem cares a lot of about vector FP vs. integer domains; SnB-family CPUs have smaller (sometimes zero) bypass delay latencies between domains.
The memory-dest versions of PEXTRD and EXTRACTPS are both 2 uops on Nehalem.
On Broadwell and later, memory-destination EXTRACTPS and PEXTRD are 2 uops, but on Sandybridge through Haswell, memory-destination EXTRACTPS is 3 uops. Memory-destination PEXTRD is 2 uops on everything except Sandybridge, where it's 3. This seems odd, and Agner Fog's tables do sometimes have errors, but it's possible. Micro-fusion doesn't work with some instructions on some microarchitectures.
If either instruction had turned out to be extremely useful for anything important (e.g. inside inner loops), CPU designers would build execution units that could do the whole thing as one uop (or maybe 2 for the memory-dest). But that potentially requires more bits in the internal uop format (which Sandybridge simplified).
Fun fact: _mm_extract_epi32(vec, 0) compiles (on most compilers) to movd eax, xmm0 which is shorter and faster than pextrd eax, xmm0, 0.
Interestingly, they perform differently on Nehalem (which cares a lot of about vector FP vs. integer domains, and came out soon after SSE4.1 was introduced in Penryn (45nm Core2)). EXTRACTPS with a register destination is 1 uop, with 1+2 cycle latency (the +2 from a bypass delay between FP and integer domain). PEXTRD is 2 uops, and runs in 2+1 cycle latency.
From the MSDN docs, I believe you can cast the result to a float.
Note from their example, the 0xc0a40000 value is equivalent to -5.125 (a.m128_f32[1]).
Update: I strongly recommend the answers from #doug65536 and #PeterCordes (below) in lieu of mine, which apparently generates poorly performing code on many compilers.
Try _mm_storeu_ps, or any of the variations of SSE store operations.

Arbitrary position 2-input shuffling using SSE

I have two 4 component vectors which I load into two __m128 variables.
Then I need to shuffle those so that the result looks like this:
__m128 mmMin = _mm_load_ps(&glm::vec4(-1.0f,-2.0f,-3.0f,-4.0f)[0]);
__m128 mmMax = _mm_load_ps(&glm::vec4(1.0f,2.0f,3.0f,4.0f)[0]);
I want the result of the shuffle to look like this:
// {mmMin.x,mmMax.x,mmMin.x,mmMax.x}
But I see it is not possible to do with _mm_shuffle_ps.
From SSE docs I see _mm_shuffle_ps masks always
inserts into result 2 values from the lower 2 components of __m128 first,then 2 from the high 2 components.
SPU intrinsics have si_shufb method which allows defining qword based mask and shuffle whatever position I wish. Is there a similar method in SSE?
I am using SSE2, but will be happy also to see how it can be done with other versions, including AVX.
With only SSE2 I think you need 2 shuffles: unpcklps to interleave and then unpcklpd same,same or shufps same,same to broadcast the low 64 bits.
With AVX512F, vpermt2ps can do this in one shuffle (using a control vector); I don't think there are any 2-source shuffles in AVX2 or earlier with fine enough granularity and flexible source locations before that. And no fixed shuffles that duplicate an element along with interleaving.
2-source shuffles are rare until AVX512: mostly fixed shuffles like unpckl/h* and palignr. It's mostly just [v]shufps / [v]shufpd until then. Variable-control shuffles are also rare: until AVX, the only one is pshufb. AVX1/2 added some variable-control dword-element shuffles, but only for 1 source. There are no variable-control 2-source shuffles until AVX512.
Immediate shuffles would need more than 4 groups of 2-bit indices to handle arbitrary indexing into the concatenation of two 4-element vectors. But x86 SIMD instructions always have at most an 8-bit immediate operand. Unfortunately no broadcast-immediate like ARM has that could efficiently create a vector of 1.0f or whatever.
Since you only need 1 element from each vector, instead of loading a whole vector you can use an AVX broadcast-load and then vblendps
Broadcast-loads are the same cost as normal loads on Intel CPUs (don't cost you a uop for the shuffle port, purely handled in the load port). They can't fold into memory operands for ALU instructions until AVX512F, but they do avoid shuffle-port bottlenecks. AMD CPUs may still need an ALU uop but they have more shuffle ALUs so shuffle throughput isn't a bottleneck nearly as much. (https://agner.org/optimize/)
Ryzen vbroadcastss xmm, [mem] is 2 separate uops for the front-end unfortunately, but it still has 2-per-clock throughput.
blend-immediate on dword and later elements is very efficient and can run on any port on Haswell and later, or 2 ports on SnB/IvB and Ryzen. But still single uop / 1c latency even on Nehalem.
#include <immintrin.h>
__m128 broadcast_interleave_scalars_avx(const float *min, const float *max) {
__m128 minx = _mm_broadcast_ss(min);
__m128 maxx = _mm_broadcast_ss(max);
return _mm_blend_ps(minx, maxx, 0b1010);
On Godbolt, clang's asm comments confirm that I got the blend constant right:
vbroadcastss xmm0, dword ptr [rdi]
vbroadcastss xmm1, dword ptr [rsi]
vblendps xmm0, xmm0, xmm1, 10 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
If your data was already in registers, not freshly loaded, you might want to just use 2 shuffles.
With SSE4.1 you might be able to do 2x movddup loads to broadcast 64 bits from memory (including the 32 bits you care about) then blendps. The first load will load 32 bits past the float you care about, the 2nd will load 32 bits before the float you care about.
To get a C++ compiler to emit this for you you'll have to pointer-cast to double* for the __m128d _mm_loaddup_pd (double const* mem_addr) loads, and then use _mm_castpd_ps to get __m128 from __m128d.
https://www.felixcloutier.com/x86/movsldup could also be useful to set up for unpcklps.

Sorting 64-bit structs using AVX?

I have a 64-bit struct which represents several pieces of data, one of which is a floating point value:
struct MyStruct{
uint16_t a;
uint16_t b;
float f;
and I have four of these structs in, lets say an std::array<MyStruct, 4>
is it possible to use AVX to sort the array, in terms of the float member MyStruct::f?
Sorry this answer is messy; it didn't all get written at once and I'm lazy. There is some duplication.
I have 4 separate ideas:
Normal sorting, but moving the struct as a 64bit unit
Vectorized insertion-sort as a building block for qsort
Sorting networks, with a comparator implementation using cmpps / blendvpd instead of minps/maxps. The extra overhead might kill the speedup, though.
Sorting networks: load some structs, then shuffle/blend to get some registers of just floats and some registers of just payload. Use Timothy Furtak's technique of doing a normal minps/maxps comparator and then cmpeqps min,orig -> masked xor-swap on the payload. This sorts twice as much data per comparator, but does require matching shuffles on two registers between comparators. Also requires re-interleaving when you're done (but that's easy with unpcklps / unpckhps, if you arrange your comparators so those in-lane unpacks will put the final data in the right order).
This also avoids potential slowdowns that some CPUs may have when doing FP comparisons on bit patterns in the payload that represent denormals, NaNs, or infinities, without resorting to setting the denormals-are-zero bit in MXCSR.
Furtak's paper suggests doing a scalar cleanup after getting things mostly sorted with vectors, which would reduce the amount of shuffling a lot.
Normal sorting
There's at least a small speedup to be gained when using normal sorting algorithms, by moving the whole struct around with 64bit loads/stores, and doing a scalar FP compare on the FP element. For this idea to work as well as possible, order your struct with the float value first, then you could movq a whole struct into an xmm reg, and the float value would be in the low32 for ucomiss. Then you (or maybe a smart compiler) could store the struct with a movq.
Looking at the asm output that Kerrek SB linked to, compilers seem to do a rather bad job of efficiently copying structs around:
icc seems to movzx the two uint values separately, rather than scooping up the whole struct in a 64b load. Maybe it doesn't pack the struct? gcc 5.1 doesn't seem to have that problem most of the time.
Speeding up insertion-sort
Big sorts usually divide-and-conquer with insertion sort for small-enough problems. Insertion sort copies array elements over by one, stopping only when we find we've reached the spot where the current element belongs. So we need to compare one element to a sequence of packed elements, stopping if the comparison is true for any. Do you smell vectors? I smell vectors.
# RSI points to struct { float f; uint... payload; } buf[];
# RDI points to the next element to be inserted into the sorted portion
# [ rsi to rdi ) is sorted, the rest isn't.
##### PROOF OF CONCEPT: debug / finish writing before using! ######
vbroadcastsd ymm0, [rdi] # broadcast the whole struct
mov rdx, rdi
sub rdx, 32
vmovups ymm1, [rdx] # load some sorted data
vcmplt_oqps ymm2, ymm0, ymm1 # all-ones in any element where ymm0[i] < ymm1[i] (FP compare, false if either is NaN).
vmovups [rdx+8], ymm1 # shuffle it over to make space, usual insertion-sort style
cmp rdx, rsi
jbe .endsearch # below-or-equal (addresses are unsigned)
movmskps eax, ymm2
test al, 0b01010101 # test only the compare results for
jz .search_loop # [rdi] wasn't less than any of the 4 elements
# TODO: scalar loop to find out where the new element goes.
# All we know is that it's less than one of the elements in ymm1, but not which
add rdi, 8
vmovsd [rdx], ymm0
cmp rdi, r8 # pointer to the end of the buf
jle .new_elem
# worse alternative to movmskps / test:
# vtestps ymm2, ymm7 # where ymm7 is loaded with 1s in the odd (float) elements, and 0s in the even (payload) elements.
# vtestps is like PTEST, but only tests the high bit. If the struct was in the other order, with the float high, vtestpd against a register of all-1s would work, as that's more convenient to generate.
This is certainly full of bugs, and I should have just written it in C with intrinsics.
This is an insertion sort with probably more overhead than most, that might lose to a scalar version for very small problem sizes, due to the extra complexity of handling the first few element (don't fill a vector), and of figuring out where to put the new element after breaking out of the vector search loop that checked multiple elements.
Probably pipelining the loop so we haven't stored ymm1 until the next iteration (or after breaking out) would save a redundant store. Doing the compares in registers by shifting / shuffling them, instead of literally doing scalar load/compares would probably be a win. This could end up with way too many unpredictable branches, and I'm not seeing a nice way to end up with the high 4 packed in a reg for vmovups, and the low one in another reg for vmovsd.
I may have invented an insertion sort that's the worst of both worlds: slow for small arrays because of more work after breaking out of the search loop, but it's still insertion sort: slow for large arrays because of O(n^2). However, if the code outside the searchloop can be made non-horrible, this could be a useful as the small-array endpoint for qsort / mergesort.
Anyway, if anyone does develop this idea into actual debugged and working code, let us know.
update: Timothy Furtak's paper describes an SSE implementation for sorting short arrays (for use as a building block for bigger sorts, like this insertion sort). He suggests producing a partially-ordered result with SSE, and then doing a cleanup with scalar ops. (insertion-sort on a mostly-sorted array is fast.)
Which leads us to:
Sorting Networks
There might not be any speedup here. Xiaochen, Rocki, and Suda only report a 3.7x speedup from scalar -> AVX-512 for 32bit (int) elements, for single-threaded mergesort, on a Xeon Phi card. With wider elements, fewer fit in a vector reg. (That's a factor of 4 for us: 64b elements in 256b, vs. 32b elements in 512b.) They also take advantage of AVX512 masks to only compare some lanes, a feature not available in AVX. Plus, with a slower comparator function that competes for the shuffle/blend unit, we're already in worse shape.
Sorting networks can be constructed using SSE/AVX packed-compare instructions. (More usually, with a pair of min/max instructions that effectively do a set of packed 2-element sorts.) Larger sorts can be built up out of an operation that does pairwise sorts. This paper by Tian Xiaochen, Kamil Rocki and Reiji Suda at U of Tokyo has some real AVX code for sorting (without payloads), and discussion of how it's tricky with vector registers because you can't compare two elements that are in the same register (so the sorting network has to be designed to not require that). They use pshufd to line up elements for the next comparison, to build up a larger sort out of sorting just a few registers full of data.
Now, the trick is to do a sort of pairs of packed 64b elements, based on the comparison of only half an element. (i.e. Keeping the payload with the sort key.) We could potentially sort other things this way, by sorting an array of (key, payload) pairs, where the payload can be an index or 32bit pointer (mmap(MAP_32bit), or x32 ABI).
So let's build ourselves a comparator. In sorting-network parlance, that's an operation that sorts a pair of inputs. So it either swaps an elements between registers, or not.
# AVX comparator for SnB/IvB
# struct { uint16_t a, b; float f; } inputs in ymm0, ymm1
# NOTE: struct order with f second saves a shuffle to extend the mask
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
# vblendvpd checks the high bit of the 64b element, so mask *doesn't* need to be extended to the low32
vblendvpd ymm2, ymm1, ymm0, ymm7
vblendvpd ymm3, ymm0, ymm1, ymm7
# result: !(ymm2[i] > ymm3[i]) (i.e. ymm2[i] < ymm3[i], or they're equal or unordered (NaN).)
You might need to set the MXCSR to make sure that int bits don't slow down your FP ops if they happen to represent a denormal or NaN float. I'm not sure if that happens only for mul/div, or if it would affect compare.
Intel Haswell: Latency: 5 cycles for ymm2 to be ready, 7 cycles for ymm3. Throughput: one per 4 cycles. (p5 bottleneck).
Intel Sandybridge/Ivybridge: Latency: 5 cycles for ymm2 to be ready, 6 cycles for ymm3. Throughput: one per 2 cycles. (p0/p5 bottleneck).
AMD Bulldozer/Piledriver: (vblendvpd ymm: 2c lat, 2c recip tput): lat: 4c for ymm2, 6c for ymm3. Or worse, with bypass delays between cmpps and blend. tput: one per 4c. (bottleneck on vector P1)
AMD Steamroller: (vblendvpd ymm: 2c lat, 1c recip tput): lat: 4c for ymm2, 5c for ymm3. or maybe 1 higher because of bypass delays. tput: one per 3c (bottleneck on vector ports P0/1, for cmp and blend).
VBLENDVPD is 2 uops. (It has 3 reg inputs, so it can't be 1 uop :/). Both uops can only run on shuffle ports. On Haswell, that's only port5. On SnB, that's p0/p5. (IDK why Haswell halved the shuffle / blend throughput compared to SnB/IvB.)
If AMD designs had 256b-wide vector units, their lower-latency FP compare and single-macro-op decoding of 3-input instructions would put them ahead.
The usual minps/maxps pair is 3 and 4 cycles latency (ymm2/3), and one per 2 cycles throughput (Intel). (p1 bottleneck on the FP add/sub/compare unit). The most fair comparison is probably to sorting 64bit doubles. The extra latency, may hurt if there aren't multiple pairs of independent registers to be compared. The halved throughput on Haswell will cut into any speedups pretty heavily.
Also keep in mind that shuffles are needed between comparator operations to get the right elements lined up for comparison. min/maxps leave the shuffle ports unused, but my cmpps/blendv version saturates them, meaning the shuffling can't overlap with comparing, except as something to fill gaps left by data dependencies.
With hyperthreading, another thread that can keep the other ports busy (e.g. port 0/1 fp mul/add units, or integer code) would share a core quite nicely with this blend-bottlenecked version.
I attempted another version for Haswell, which does the blends "manually" using bitwise AND/OR operations. It ended up slower, though, because both sources have to get masked both ways before combining.
# AVX2 comparator for Haswell
# struct { float f; uint16_t a, b; } inputs in ymm0, ymm1
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
vshufps ymm7, ymm7, ymm7, mask(0, 0, 2, 2) # extend the mask to the payload part. There's no mask function, I just don't want to work out the result in my head.
vpand ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where ymm0[i] < ymm1[i]
vpandn ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where !(ymm0[i] < ymm1[i])
vpor ymm2, ymm10, ymm11 # ymm2 = min_packed_mystruct(ymm0, ymm1)
vpandn ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where !(ymm0[i] < ymm1[i])
vpand ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where ymm0[i] < ymm1[i]
vpor ymm3, ymm10, ymm11 # ymm2 = max_packed_mystruct(ymm0, ymm1)
# result: !(ymm2[i] > ymm3[i])
This is 8 uops, compared to 5 for the blendv version. There's a lot of parallelism in the last 6 and/andn/or instructions. cmpps has 3 cycle latency, though. I think ymm2 will be ready in 6 cycles, while ymm3 is ready in 7. (And can overlap with operations on ymm2). The insns following a comparator op will probably be shuffles, to put the data in the right elements for the next compare. There's no forwarding delay to/from the shuffle unit for integer-domain logicals, even for a vshufps, but the result should come out in the FP domain, ready for a vcmpps. Using vpand instead of vandps is essential for throughput.
Timothy Furtak's paper suggests an approach for sorting keys with a payload: don't pack the payload pointers with the keys, but instead generate a mask from the compare, and use it on both the keys and the payload the same way. This means you have to separate the payload from the keys either in your data structure, or every time you load a struct.
See the appendix of his paper (Fig. 12). He uses the standard min/max on the keys, and then uses cmpps to see which elements CHANGED. Then he ANDs that mask in the middle of an xor-swap to end up only swapping the payloads for the keys that swapped.
Unfortunately, original AVX has very limited shuffling across its 128-bit halves (i.e. lanes), so it is hard to sort contents of a full 256-bit register. However, AVX2 has shuffling operations without such limitations, so we can perform a sort of 4 structs in vectorized way.
I'll use the idea of this solution. In order to sort an array we have to do enough element comparisons to surely determine the permutation we need to apply. Given that no element is NaN, it is enough to check for each pair of different elements a and b whether a < b and whether a > b. Having this information, we can fully compare any two elements, which must be enough to determine final sorting order. This is 6 pairs of 32-bit elements and two comparison modes, so we can end up doing two shuffles and two comparisons in AVX. If you are absolutely sure that all the elements are distinct, then you can avoid a > b comparisons and reduce size of LUT.
For repacking of elements within register we can use _mm256_permutevar8x32_ps. One instruction allows to do arbitrary shuffle on 32-bit granularity. Note that in the code I assume that sorting key f is the first member of your struct (just as #PeterCordes proposed), but you can trivially use this solution for you current struct if you change shuffling mask accordingly.
After we perform the comparisons, we have a two AVX registers containing boolean results as 32-bit masks. The first six masks in each register are important, the last two are not. Then we want to convert these masks to a small integer in general-purpose register to be used as index in a lookup table. In general case we may have to create perfect hashing for it, but it is not necessary here. We can use _mm256_movemask_ps to get a 8-bit integer mask in general purpose register from AVX register. Since the last two masks per register are not important, we can ensure that they are always zero. Then the resulting index would be in range [0..2^12).
Finally, we load a shuffling mask from precomputed LUT with 4096 elements and pass it to _mm256_permutevar8x32_ps. As a result we obtain an AVX register with 4 properly sorted structs of your type. Precomputing the LUT is your home assignment =)
Here is the final code:
__m256i lut[4096]; //LUT of 128Kb size must be precomputed
__m256 Sort4(__m256 val) {
__m256 aaabbcaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(0, 0, 0, 2, 2, 4, 0, 0));
__m256 bcdcddaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(2, 4, 6, 4, 6, 6, 0, 0));
__m256 cmpLt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_LT_OQ);
__m256 cmpGt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_GT_OQ);
int idxLt = _mm256_movemask_ps(cmpLt);
int idxGt = _mm256_movemask_ps(cmpGt);
__m256i shuf = lut[idxGt * 64 + idxLt];
__m256 res = _mm256_permutevar8x32_ps(val, shuf);
return res;
Here you can see generated assembly. There are 14 instructions in total, 2 of them are for loading constant shuffling masks, and one of them is due to useless 32-bit->64-bit conversion of movemask results. So in a tight loop it would be 11-12 instructions. IACA says that four calls in a loop have 16.40 cycles throughput on Haswell, so it seems to achieve throughput 4.1 cycles per call.
Of course 128 Kb lookup table is too much unless you are going to process even more input data in one batch. It may be possible to reduce LUT size with adding perfect hashing (sacrificing speed of course). It is hard to say how much orderings are possible on four elements, but clearly less than 4! * 2^3 = 192. I think 256-element LUT is possible, maybe even 128-element LUT. With perfect hashing it may be faster to combine two AVX registers into one with shift and xor, then do _mm256_movemask_epi8 once (instead of doing two _mm256_movemask_ps and combining them afterwards).

x86 assembly instructions optimisation

I'm trying to optimize a block of instructions in a loop, called thousands of time, which is the bottleneck in my algorithm.
This block of code compute the multiplication of a N matrices 3x3 (iA array) against N vectors 3 (iV array) and store the N results in oV array. (N is not fix and is usually between 3000 and 15000)
Each line of matrices and vectors are 128-bits aligned (4 floats) to exploit SSE optimisation (the 4th floating value is ignored).
C++ code :
__m128* ip = (__m128*)iV;
__m128* op = (__m128*)oV;
__m128* A = (__m128*)iA;
__m128 res1, res2, res3;
int i;
for (i=0; i<N; i++)
res1 = _mm_dp_ps(*A++, *ip, 0x71);
res2 = _mm_dp_ps(*A++, *ip, 0x72);
res3 = _mm_dp_ps(*A++, *ip++, 0x74);
*op++ = _mm_or_ps(res1, _mm_or_ps(res2, res3));
The compiler generates these instructions :
000007FEE7DD4FE0 movaps xmm2,xmmword ptr [rsi] //move "ip" in register
000007FEE7DD4FE3 movaps xmm1,xmmword ptr [rdi+10h] //move second line of A in register
000007FEE7DD4FE7 movaps xmm0,xmmword ptr [rdi+20h] //move third line of A in register
000007FEE7DD4FEB inc r11d //i++
000007FEE7DD4FEE add rbp,10h //op++
000007FEE7DD4FF2 add rsi,10h //ip++
000007FEE7DD4FF6 dpps xmm0,xmm2,74h //dot product of 3rd line of A against ip
000007FEE7DD4FFC dpps xmm1,xmm2,72h //dot product of 2nd line of A against ip
000007FEE7DD5002 orps xmm0,xmm1 //"merge" of the result of the two dot products
000007FEE7DD5005 movaps xmm3,xmmword ptr [rdi] //move first line of A in register
000007FEE7DD5008 add rdi,30h //A+=3
000007FEE7DD500C dpps xmm3,xmm2,71h //dot product of 1st line of A against ip
000007FEE7DD5012 orps xmm0,xmm3 //"merge" of the result
000007FEE7DD5015 movaps xmmword ptr [rbp-10h],xmm0 //move result in memory (op)
000007FEE7DD5019 cmp r11d,dword ptr [rbx+28h] //compare i
000007FEE7DD501D jl MyFunction+370h (7FEE7DD4FE0h) //loop
I'm not very familiar with low-level optimisations, so the question is : Do you see some possible optimisations if I write assembly code myself ?
For example, will it run faster if I change :
add rbp,10h
movaps xmmword ptr [rbp-10h],xmm0
movaps xmmword ptr [rbp],xmm0
add rbp,10h
I have also read that ADD instruction is faster than INC...
Calculating indirect address with offset, such as rbp-10 is very cheap, because there is special hardware for these sort of calculations in the "effective address calculation" unit [which I think has a different name, but can't think of or have any success with google to find it's name].
There is, however, a dependency between the add rbp,10h and [rbp-10h], which could possibly cause a problem - but I doubt it in this particular case. In your case, there is a long distance between rbp-10 and using it, so it's not an issue. The compiler is probably putting it that far up because it's "free" at that point, since the processor will be waiting for the data to come in from the outside into the xmm registers that has been read earlier. In other words, any work we can stick between the reads of xmm0, xmm1 and xmm2 at the beginning of the loop, and the dpps instructions using xmm0, xmm1 and xmm2 will be beneficial, because the processor will be waiting for that data to "arrive" before it can compute the dpps result.
I've done lots of x86 assembly optimizations, and I can tell you it was a great learning experience. It taught me a lot about how compilers work as well, and the biggest thing I learned was that compilers are in general pretty good at what they do. I know that's a flippant comment, but it is true...
I also learned that optimizations you make can have a positive result on one processor family, and a negative result on another processor family. Things like pipelining, branch prediction, and processor cache play a huge role... so unless you're targeting a very specific hardware configuration, be careful about assumptions regarding improvements you make.
To your specific question about reordering the add to remove the rbp-10h offset... it look like an obvious improvement, and you'd have to verify by reading the instruction manual, but I'd guess the -10h memory offset comes for free in that instruction. And moving the add may throw off a pipelined instruction and actually cause a clock cycle loss. You'd have to experiment.
There are a few things you could do to the above code to improve it. Generally, using a value after it has been altered incurs a processor stall as it waits for the result. So these lines would incur a penalty:-
add rbp,10h
movaps xmmword ptr [rbp-10h],xmm0
but in the code snippet above those two lines a quite far apart, so that isn't really an issue. As others have already said, the rbp-10h is 'free' in that the address calculation hardware handles it.
You could move the movaps xmm3,xmmword ptr [rdi] up a line and maybe rearrange a couple of other lines.
Would it be worth it?
You'd be lucky to see any real performance gain from any of this because your algorithm is
<blink> memory bandwidth limited </blink>*
which means that the the time taken to read the data from RAM into the CPU is greater than the time it takes the CPU to do its processing. At worst, reading a memory address can involve a page fault and a disk read. The prefetch instructions won't help either, it's called 'Streaming SIMD Extension' because it's optimised to stream data into the CPU (the memory interface can handle four separate streams IIRC).
If you were doing a lot of computation on a small set of data (an FFT perhaps) then you could gain a lot from hand-crafting the assembler. But your algorithm is quite simple so the CPU is idling a lot of the time waiting for the data to arrive. If you know the speed of your RAM you could work out the maximum throughput for the algorithm and use that to compare against what it's currently achieving (you'll never reach the maximum theoretical throughput though).
There are things you can do to minimise the memory stalling, and it's a higher level change rather than fiddling with individual instructions (often, optimising the algorithms gets better results). The simplest is to double buffer the input data. Divide the register set into two groups (possible to do here as you're only using four of the SIMD registers):-
load set 1
load set 2
do processing on set 1
save set 1 result
load set 1
do processing on set 2
save set 2 result
goto mainloop
Hopefully that's given you some ideas. Even if it doesn't go much faster, it's still an interesting exercise and you can learn a lot from it.
RIP blink.

x86 MUL Instruction from VS 2008/2010

Will modern (2008/2010) incantations of Visual Studio or Visual C++ Express produce x86 MUL instructions (unsigned multiply) in the compiled code? I cannot seem to find or contrive an example where they appear in compiled code, even when using unsigned types.
If VS does not compile using MUL, is there a rationale why?
imul (signed) and mul (unsigned) both have a one-operand form that does edx:eax = eax * src. i.e. a 32x32b => 64b full multiply (or 64x64b => 128b).
186 added an imul dest(reg), src(reg/mem), immediate form, and 386 added an imul r32, r/m32 form, both of which which only compute the lower half of the result. (According to NASM's appendix B, see also the x86 tag wiki)
When multiplying two 32-bit values, the least significant 32 bits of the result are the same, whether you consider the values to be signed or unsigned. In other words, the difference between a signed and an unsigned multiply becomes apparent only if you look at the "upper" half of the result, which one-operand imul/mul puts in edx and two or three operand imul puts nowhere. Thus, the multi-operand forms of imul can be used on signed and unsigned values, and there was no need for Intel to add new forms of mul as well. (They could have made multi-operand mul a synonym for imul, but that would make disassembly output not match the source.)
In C, results of arithmetic operations have the same type as the operands (after integer promotion for narrow integer types). If you multiply two int together, you get an int, not a long long: the "upper half" is not retained. Hence, the C compiler only needs what imul provides, and since imul is easier to use than mul, the C compiler uses imul to avoid needing mov instructions to get data into / out of eax.
As a second step, since C compilers use the multiple-operand form of imul a lot, Intel and AMD invest effort into making it as fast as possible. It only writes one output register, not e/rdx:e/rax, so it was possible for CPUs to optimize it more easily than the one-operand form. This makes imul even more attractive.
The one-operand form of mul/imul is useful when implementing big number arithmetic. In C, in 32-bit mode, you should get some mul invocations by multiplying unsigned long long values together. But, depending on the compiler and OS, those mul opcodes may be hidden in some dedicated function, so you will not necessarily see them. In 64-bit mode, long long has only 64 bits, not 128, and the compiler will simply use imul.
There's three different types of multiply instructions on x86. The first is MUL reg, which does an unsigned multiply of EAX by reg and puts the (64-bit) result into EDX:EAX. The second is IMUL reg, which does the same with a signed multiply. The third type is either IMUL reg1, reg2 (multiplies reg1 with reg2 and stores the 32-bit result into reg1) or IMUL reg1, reg2, imm (multiplies reg2 by imm and stores the 32-bit result into reg1).
Since in C, multiplies of two 32-bit values produce 32-bit results, compilers normally use the third type (signedness doesn't matter, the low 32 bits agree between signed and unsigned 32x32 multiplies). VC++ will generate the "long multiply" versions of MUL/IMUL if you actually use the full 64-bit results, e.g. here:
unsigned long long prod(unsigned int a, unsigned int b)
return (unsigned long long) a * b;
The 2-operand (and 3-operand) versions of IMUL are faster than the one-operand versions simply because they don't produce a full 64-bit result. Wide multipliers are large and slow; it's much easier to build a smaller multiplier and synthesize long multiplies using Microcode if necessary. Also, MUL/IMUL writes two registers, which again is usually resolved by breaking it into multiple instructions internally - it's much easier for the instruction reordering hardware to keep track of two dependent instructions that each write one register (most x86 instructions look like that internally) than it is to keep track of one instruction that writes two.
According to http://gmplib.org/~tege/x86-timing.pdf, the IMUL instruction has a lower latency and higher throughput (if I'm reading the table correctly). Perhaps VS is simply using the faster instruction (that's assuming that IMUL and MUL always produce the same output).
I don't have Visual Studio handy, so I tried to get something else with GCC. I also always get some variation of IMUL.
unsigned int func(unsigned int a, unsigned int b)
return a * b;
Assembles to this (with -O2):
pushq %rbp
movq %rsp, %rbp
movl %esi, %eax
imull %edi, %eax
movzbl %al, %eax
My intuition tells me that the compiler chose IMUL arbitrarily (or whichever was faster of the two) since the bits will be the same whether it uses unsigned MUL or signed IMUL. Any 32-bit integer multiplication will be 64-bits spanning two registers, EDX:EAX. The overflow goes into EDX which is essentially ignored since we only care about the 32-bit result in EAX. Using IMUL will sign-extend into EDX as necessary but again, we don't care since we're only interested in the 32-bit result.
Right after I looked at this question I found MULQ in my generated code when dividing.
The full code is turning a large binary number into chunks of a billion so that it can be easily converted to a string.
C++ code:
for_each(TempVec.rbegin(), TempVec.rend(), [&](Short & Num){
Remainder <<= 32;
Remainder += Num;
Num = Remainder / 1000000000;
Remainder %= 1000000000;//equivalent to Remainder %= DecimalConvert
Optimized Generated Assembly
00007FF7715B18E8 lea r9,[rsi-4]
00007FF7715B18EC mov r13,12E0BE826D694B2Fh
00007FF7715B18F6 nop word ptr [rax+rax]
00007FF7715B1900 shl r8,20h
00007FF7715B1904 mov eax,dword ptr [r9]
00007FF7715B1907 add r8,rax
00007FF7715B190A mov rax,r13
00007FF7715B190D mul rax,r8
00007FF7715B1910 mov rcx,r8
00007FF7715B1913 sub rcx,rdx
00007FF7715B1916 shr rcx,1
00007FF7715B1919 add rcx,rdx
00007FF7715B191C shr rcx,1Dh
00007FF7715B1920 imul rax,rcx,3B9ACA00h
00007FF7715B1927 sub r8,rax
00007FF7715B192A mov dword ptr [r9],ecx
00007FF7715B192D lea r9,[r9-4]
00007FF7715B1931 lea rax,[r9+4]
00007FF7715B1935 cmp rax,r14
00007FF7715B1938 jne NumToString+0D0h (07FF7715B1900h)
Notice the MUL instruction 5 lines down.
This generated code is extremely unintuitive, I know, in fact it looks nothing like the compiled code but DIV is extremely slow ~25 cycles for a 32 bit div, and ~75 according to this chart on modern PCs compared with MUL or IMUL (around 3 or 4 cycles) and so it makes sense to try to get rid of DIV even if you have to add all sorts of extra instructions.
I don't fully understand the optimization here, but if you would like to see a rational and a mathematical explanation of using compile time and multiplication to divide constants, see this paper.
This is an example of is the compiler making use of the performance and capability of the full 64 by 64 bit untruncated multiply without showing the c++ coder any sign of it.
As already explained C/C++ does not do word*word to double-word operations which is what the mul instruction is best for. But there are cases where you want word*word to double-word so you need an extension to C/C++.
GCC, Clang, and ICC provide provide a builtin type __int128 which you can use to indirectly get the mul instruction.
With MSVC it provides the _umul128 intrinsic (since at least VS 2010) which generates the mul instruction. With this intrinsic along with the _addcarry_u64 intrinsic you could build your own efficient __int128 type with MSVC.