I have a question regarding the AVX _mm256_blend_pd function.
I want to optimize my code where I use heavily the _mm256_blendv_pd function. This unfortunately has a pretty high latency and low throughput. This function takes as input three __m256d variables where the last one represents the mask that is used to select from the first 2 variables.
I found another function (_mm256_blend_pd) which takes a bit mask instead of a __m256d variable as mask. When the mask is static I could simply pass something like 0b0111 to take the first element from the first variable and the last 3 elements of the second variable. However in my case the mask is computed using _mm_cmp_pd function which returns a __m256d variable. I found out that I can use _mm256_movemask_pd to return an int from the mask, however when passing this into the function _mm256_blend_pd I get an error error: the last argument must be a 4-bit immediate.
Is there a way to pass this integer using its first 4 bits? Or is there another function similar to movemask that would allow me to use _mm256_blend_pd? Or is there another approach I can use to avoid having a cmp, movemask and blend that would be more efficient for this use case?
_mm256_blend_pd is the intrinsic for vblendpd which takes its control operand as an immediate constant, embedded into the machine code of the instruction. (That's what "immediate" means in assembly / machine code terminology.)
In C++ terms, the control arg must be constexpr so the compiler can embed it into the instruction at compile time. You can't use it for runtime-variable blends.
It's unfortunate that variable-blend instructions like vblendvpd are slower, but they're "only" 2 uops on Skylake, with 1 or 2 cycle latency (depending on which input you're measuring the critical path through). (uops.info). And on Skylake those uops can run on any of the 3 vector ALU ports. (Worse on Haswell/Broadwell, though, limited to port 5 only, competing for it with shuffles). Zen can even run them as a single uop.
There's nothing better for the general case until AVX512 makes masking a first-class operation you can do as part of other instructions, and gives us single-uop blend instructions like vblendmpd ymm0{k1}, ymm1, ymm2 (blend according to a mask register).
In some special cases you can usefully _mm256_and_pd to conditionally zero instead of blending, e.g. to zero an input before an add instead of blending after.
TL:DR: _mm256_blend_pd lets you use a faster instruction for the special case where the control is a compile-time constant.
Related
I think the SIMD shuffle fucntion is not real shuffle for int32_t case the left and right part would be shuffled separately.
I want a real shuffle function as following:
Assumed we got __m256i and we want to shuffle 8 int32_t.
__m256i to_shuffle = _mm256_set_epi32(17, 18, 20, 21, 25, 26, 29, 31);
const int imm8 = 0b10101100;
__m256i shuffled _mm256_shuffle(to_shuffle, imm8);
I hope the shuffled = {17, 20, 25, 26, -, -, -, -}, where the - represents the not relevant value and they can be anything.
So I hope the int at the position with set bit with 1 would be placed in shuffled.
(In our case: 17, 20, 25, 26 are sitting at the positions with a 1 in the imm8).
Is such function offered by the Intel?
How could such function be implemented efficiently?
EDIT: - could be ignored. Only the int with set bit 1 is needed.
(I'm assuming you got your immediate backwards (selector for 17 should be the low bit, not high bit) and your vectors are actually written in low-element-first order).
How could such function be implemented efficiently?
In this case with AVX2 vpermd ( _mm256_permutevar8x32_epi32 ). It needs a control vector not an immediate, to hold 8 selectors for the 8 output elements. So you'd have to load a constant and use that as the control operand.
Since you only care about the low half of your output vector, your vector constant can be only __m128i, saving space. vmovdqa xmm, [mem] zero-extends into the corresponding YMM vector. It's probably inconvenient to write this in C with intrinsics but _mm256_castsi128_si256 should work. Or even _mm256_broadcastsi128_si256 because a broadcast-load would be just as cheap. Still, some compilers might pessimize it to an actual 32-byte constant in memory by doing constant-propagation. If you know assembly, compiler output is frequently disappointing.
If you want to take an actual integer bitmap in your source, you could probably use C++ templates to convert that at compile time into the right vector constant. Agner Fog's Vector Class Library (now Apache-licensed, previously GPL) has some related things like that, turning integer constants into a single blend or sequence of blend instructions depending on the constant and what target ISA is supported, using C++ templates. But its shuffle template takes a list of indices, not a bitmap.
But I think you're trying to ask about why / how x86 shuffles are designed the way they are.
Is such function offered by the Intel?
Yes, in hardware with AVX512F (plus AVX512VL to use it on 256-bit vectors).
You're looking for vpcompressd, the vector-element equivalent of BMI2 pext. (But it takes the control operand as a mask register value, not an immediate constant.) The intrinsic is
__m256i _mm256_maskz_compress_epi32( __mmask8 c, __m256i a);
It's also available in a version that merges into the bottom of an existing vector instead of zeroing the top elements.
As an immediate shuffle, no.
All x86 shuffles use a control operand that has indices into the source, not a bitmap of which elements to keep. (Except vpcompressd/q and vpexpandd/q). Or they use an implicit control, like _mm256_unpacklo_epi32 for example which interleaves 32-bit elements from 2 inputs (in-lane in the low and high halves).
If you're going to provide a shuffle with a control operand at all, it's usually most useful if any element can end up at any position. So the output doesn't have to be in the same order as the input. Your compress shuffle doesn't have that property.
Also, having a source index for each output element is what shuffle hardware naturally wants. My understanding is that each output element is fed by its own MUX (multiplexer), where the MUX takes N input elements and one binary selector to select which one to output. (And is as wide as the element width of course.) See Where is VPERMB in AVX2? for more discussion of building muxers.
Having the control operand in some format other than a list of selectors would require preprocessing before it could be fed to shuffle hardware.
For an immediate, the format is either 2x1-bit or 4x2-bit fields, or a byte-shift count for _mm_bslli_si128 and _mm_alignr_epi8. Or index + zeroing bitmask for insertps. There are no SIMD instructions with an immediate wider than 8 bits. Presumably this keeps the hardware decoders simple.
(Or 1x1-bit for vextractf128 xmm, ymm, 0 or 1, which in hindsight would be better with no immediate at all. Using it with 0 is always worse than vmovdqa xmm, xmm. Although AVX512 does use the same opcode for vextractf32x4 with an EVEX prefix for the 1x2-bit immediate, so maybe this had some benefit for decoder complexity. Anyway, there are no immediate shuffles with selector fields wider than 2 bits because 8x 3-bit would be 24 bits.)
For wider 4x2 in-lane shuffles like _mm256_shuffle_ps (vshufps ymm, ymm, ymm, imm8), the same 4x2-bit selector pattern is reused for both lanes. For wider 2x1 in-lane shuffles like _mm256_shuffle_pd (vshufpd ymm, ymm, ymm, imm8), we get 4x 1-bit immediate fields that still select in-lane.
There are lane-crossing shuffles with 4x 2-bit selectors, vpermq and vpermpd. Those work exactly like pshufd xmm (_mm_shuffle_epi32) but with 4x qword elements across a 256-bit register instead of 4x dword elements across a 128-bit register.
As far as narrowing / only caring about part of the output:
A normal immediate would need 4x 3-bit selectors to each index one of the 8x 32-bit source elements. But much more likely 8x 3-bit selectors = 24 bits, because why design a shuffle instruction that can only ever write half a half-width output? (Other than vextractf128 xmm, ymm, 1).
General the paradigm for more-granular shuffles is to take a control vector, rather than some funky immediate encoding.
AVX512 did add some narrowing shuffles like VPMOVDB xmm/[mem], x/y/zmm that truncate (or signed/unsigned saturate) 32-bit elements down to 8-bit. (And all other combinations of sizes are available).
They're interesting because they're available with a memory destination. Perhaps this is motivated by some CPUs (like Xeon Phi KNL / KNM) not having AVX512VL, so they can only use AVX512 instructions with ZMM vectors. Still, they have AVX1 and 2 so you could compress into an xmm reg and use a normal VEX-encoded store. But it does allow doing a narrow byte-masked store with AVX512F, which would only be possible with AVX512BW if you had the packed data in an XMM register.
There are some 2-input shuffles like shufps that treat the low and high half of the output separately, e.g. the low half of the output can select from elements of the first source, the high half of the output can select from elements of the second source register.
With sse2 or avx comparison operations were returning bitmasks of all zeros or all ones (e.g. _mm_cmpge_pd returns a __m128d.
I cannot find an equivalent with avx512. Comparison operations seems to only return short bitmasks. Has there been a fundamental change in semantic or am I missing something?
Yes, the semantics are a bit different in AVX512. The comparison instructions return the results in mask registers. This has a couple advantages:
The (8) mask registers are entirely separate from the [xyz]mm register set, so you don't waste a vector register for the comparison result.
There are masked versions of nearly the entire AVX512 instruction set, allowing you a lot of flexibility with how you use the comparison results.
It does require slightly different code versus a legacy SSE/AVX implementation, but it isn't too bad.
Edit: If you want to emulate the old behavior, you could do something like this:
// do comparison, store results in mask register
__mmask8 k = _mm512_cmp_pd_mask(...);
// broadcast a mask of all ones to a vector register, then use the mask
// register to zero out the elements that have a mask bit of zero (i.e.
// the corresponding comparison was false)
__m512d k_like_sse = _mm512_maskz_mov_pd(k,
(__m512d) _mm512_maskz_set1_epi64(0xFFFFFFFFFFFFFFFFLL));
There might be a more optimal way to do this, but I'm relatively new to using AVX512 myself. The mask of all ones could be precalculated and reused, so you're essentially just adding in one extra masked move instruction to generate the vector result that you're looking for.
Edit 2: As suggested by Peter Cordes in the comment below, you can use _mm512_movm_epi64() instead to simplify the above even more:
// do comparison, store results in mask register
__mmask8 k = _mm512_cmp_pd_mask(...);
// expand the mask to all-0/1 masks like SSE/AVX comparisons did
__m512d k_like_sse = (__m512d) _mm512_movm_epi64(k);
Edit 3: The images are links to the full-size versions. Sorry for the pictures-of-text, but the graphs would be hard to copy/paste into a text table.
I have the following VTune profile for a program compiled with icc --std=c++14 -qopenmp -axS -O3 -fPIC:
In that profile, two clusters of instructions are highlighted in the assembly view. The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. Both clusters are located inside the same function and are obviously both called n times. This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now).
What am I missing?
Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine.
I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior.
Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...
Same profile, albeit with a slightly lower CPI this time.
HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs.
The inputs for your first group comes from your gather. This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. Their inputs come directly from vector loads of voxel[], so store-forwarding failure will cause some latency. But that's only ~10 cycles IIRC, small compared to cache misses during the gather. (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted)
The inputs for your second group come directly from loads that can miss in cache. In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0], which has a really large bar.
But in the second group, the time for the cache misses in a_transfer[] is getting attributed to the group you've highlighted. Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready.
It looks like there's a lot you could optimize here.
instead of store/reload for a_pointf, just keep it hot across loop iterations in a __m128 variable. Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers).
calculate vi with _mm_cvttps_epi32(vf), so the ROUNDPS isn't part of the dependency chain for the gather indices.
Do the voxel gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki).
It might be worth it to partially vectorize the address math (calculation of base_0, using PMULDQ with a constant vector), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.)
Use MOVD to load two adjacent short values, and merge another pair into the second element with PINSRD. You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0), except that pointer aliasing is undefined behaviour. You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0).
Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float in parallel with _mm_cvtepi32_ps (CVTDQ2PS).
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway).
Calling floorf() is silly, and a function call forces the compiler to spill all xmm registers to memory. Compile with -ffast-math or whatever to let it inline to a ROUNDSS, or do that manually. Especially since you go ahead and load the float that you calculate from that into a vector!
Use a vector compare instead of scalar prev_x / prev_y / prev_z. Use MOVMASKPS to get the result into an integer you can test. (You only care about the lower 3 elements, so test it with compare_mask & 0b0111 (true if any of the low 3 bits of the 4-bit mask are set, after a compare for not-equal with _mm_cmpneq_ps. See the double version of the instruction for more tables on how it all works: http://www.felixcloutier.com/x86/CMPPD.html).
Well, analyzing assembly code please note that running time is attributed to the next instruction - so, the data you're looking by instructions need to be interpreted carefully. There is a corresponding note in VTune Release Notes:
Running time is attributed to the next instruction (200108041)
To collect the data about time-consuming running regions of the
target, the Intel® VTune™ Amplifier interrupts executing target
threads and attributes the time to the context IP address.
Due to the collection mechanism, the captured IP address points to an
instruction AFTER the one that is actually consuming most of the time.
This leads to the running time being attributed to the next
instruction (or, rarely to one of the subsequent instructions) in the
Assembly view. In rare cases, this can also lead to wrong attribution
of running time in the source - the time may be erroneously attributed
to the source line AFTER the actual hot line.
In case the inline mode is ON and the program has small functions
inlined at the hotspots, this can cause the running time to be
attributed to a wrong function since the next instruction can belong
to a different function in tightly inlined code.
The CUDA PTX Guide describes the instructions 'atom' and 'red', which perform atomic and non-atomic reductions. This is news to me (at least with respect to non-atomic reductions)... I remember learning how to do reductions with SHFL a while back. Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?
Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?
Most of these instructions are reflected in atomic operations (built-in intrinsics) described in the programming guide. If you compile any of those atomic intrinsics, you will find atom or red instructions emitted by the compiler at the PTX or SASS level in your generated code.
The red instruction type will generally be used when you don't explicitly use the return value from from one of the atomic intrinsics. If you use the return value explicitly, then the compiler usually emits the atom instruction.
Thus, it should be clear that this instruction by itself does not perform a complete classical parallel reduction, but certainly could be used to implement one if you wanted to depend on atomic hardware (and associated limitations) for your reduction operations. This is generally not the fastest possible implementation for parallel reductions.
If you want direct access to these instructions, the usual advice would be to use inline PTX where desired.
As requested, to elaborate using atomicAdd() as an example:
If I perform the following:
atomicAdd(&x, data);
perhaps because I am using it for a typical atomic-based reduction into the device variable x, then the compiler would emit a red (PTX) or RED (SASS) instruction taking the necessary arguments (the pointer to x and the variable data, i.e. 2 logical registers).
If I perform the following:
int offset = atomicAdd(&buffer_ptr, buffer_size);
perhaps because I am using it not for a typical reduction but instead to reserve a space (buffer_size) in a buffer shared amongst various threads in the grid, which has an offset index (buffer_ptr) to the next available space in the shared buffer, then the compiler would emit a atom (PTX) or ATOM (SASS) instruction, including 3 arguments (offset, &buffer_ptr, and buffer_size, in registers).
The red form can be issued by the thread/warp which may then continue and not normally stall due to this instruction issue which will normally have no dependencies for subsequent instructions. The atom form OTOH will imply modification of one of its 3 arguments (one of 3 logical registers). Therefore subsequent use of the data in that register (i.e. the return value of the intrinsic, i.e. offset in this case) can result in a thread/warp stall, until the return value is actually returned by the atomic hardware.
When I compile an HLSL shader with pow(foo, 6) or pow(foo, 8), the compiler creates assembly that has about 10 more instructions than if I create the same shader with pow(foo, 9) or pow(foo,10) or pow(foo,7).
Why is that?
Instructions or instruction slots?
The pow instruction takes three 3 slots, whereas the mul instruction takes only 1.
(Reference: instruction sets for: vs_2_0, ps_2_0, vs_3_0, ps_3_0)
When you are writing a shader you generally want to keep the instruction slot count down, because you have a limited number of instruction slots as defined by the shader model. It is also a reasonable way to approximate the computational complexity of your shader (ie: how fast it will run).
A power of 1 is obviously a no-op. A power of 2 requires one mul instruction. Powers of 3 and 4 can be done with two mul instructions. Powers of 5, 6, and 8 can be done with three mul instructions.
(I imagine the math behind this optimisation is explained by the link that Jim Lewis posted.)
The likely reason the compiler is choosing three mul instructions over a single pow instruction (both use the same number of instruction slots) is that the pow instruction with a constant exponent will also require the allocation of a constant register to store that exponent. Obviously using up three instruction slots and no constant registers is better than using up three slots and one constant register.
(Why you are getting 10 more instructions? I am not sure, it will depend on your shader code. The HLSL compiler does many weird and wonderful things in the name of optimisation.)
If you use the shader compiler (fxc) in the DirectX SDK with the options /Cc /Fc output.html, it will give you a nice assembly reading you can examine, including a count of the number of instruction slots used.
It might be doing some sort of exponentiation by squaring optimization, where the number of operations depends on the number of bits set to 1 in the exponent, and their positions. (That doesn't quite match what you're describing, though: you'd expect powers of two to be more efficient than exponents with more bits set, in a pure square-and-multiply implementation.)