AVX equivalent for _mm_movelh_ps

AVX equivalent for _mm_movelh_ps - c++

since there is no AVX version of _mm_movelh_ps I usually used _mm256_shuffle_ps(a, b, 0x44) for AVX registers as a replacement. However, I remember reading in other questions, that swizzle instructions without a control integer (like _mm256_unpacklo_ps or _mm_movelh_ps) should be preferred if possible (for some reason I don't know). Yesterday, it occurred to me, that another alternative might be using the following:
_mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(a), _mm256_castps_pd(b)));
Since the casts are supposed to be no-ops, is this better\equal\worse than using _mm256_shuffle_ps regarding performance?
Also, if it is truly the case, it would be nice if somebody could explain in simple words (I have very limited understanding of assembly and microarchitecture) why one should prefer instructions without a control integer.
Thanks in advance
Additional note:
Clang actually optimizes the shuffle to vunpcklpd: https://godbolt.org/z/9XFP8D
So it seems that my idea is not too bad. However, GCC and ICC create a shuffle instruction.

Avoiding an immediate saves 1 byte of machine-code size; that's all. It's at the bottom of the list for performance considerations, but all else equal shuffles like _mm256_unpacklo_pd with an implicit "control" are very slightly better than an immediate control byte for that reason.
(But taking the control operand in another vector like vpermilps can or vpermd requires is usually worse, unless you have some weird front-end bottleneck in a long-running loop, and can load the shuffle control outside the loop. Not very plausible and at this point you'd have to be writing by hand in asm to be caring that much about code size/alignment; in C++ that's still not something you can really control directly.)
Since the casts are supposed to be no-ops, is this better\equal\worse than using _mm256_shuffle_ps regarding performance?
Ice Lake has 2/clock vshufps vs. 1/clock vunpcklpd, according to testing by uops.info on real hardware, running on port 1 or port 5. Definitely use _mm256_shuffle_ps. The trivial extra code-size cost probably doesn't actually hurt at all on earlier CPUs, and is probably worth it for the future benefit on ICL, unless you're sure that port 5 won't be a bottleneck.
Ice Lake has a 2nd shuffle unit on port 1 that can handle some common XMM and in-lane YMM shuffles, including vpshufb and apparently some 2-input shuffles like vshufps. I have no idea why it doesn't just decode vunpcklpd as a vshufps with that control vector, or otherwise manage to run that shuffle on port 1. We know the shuffle HW itself can do the shuffle so I guess it's just a matter of control hardware to set up implicit shuffles, mapping an opcode to a shuffle control somehow.
Other than that, it's equal or better on older AVX CPUs; no CPUs have penalties for using PD shuffles between other PS instructions. The only different on any existing CPUs is code-size. Old CPUs like K8 and Core 2 had faster pd shuffles than ps, but no CPUs with AVX have shuffle units with that weakness. Also, AVX non-destructive instructions level differences between which operand has to be the destination.
As you can see from the Godbolt link, there are zero extra instructions before/after the shuffle. The "cast" intrinsics aren't doing conversion, just reinterpret to keep the C++ type system happy because Intel decided to have separate types for __m256 vs. __m256d (vs. __m256i), instead of having one generic YMM type. They chose not to have separate uint8x16 vs. uint32x4 vectors the way ARM did, though; for integer SIMD just __m256i.
So there's no need for compilers to emit extra instructions for casts, and in practice that's true; they don't introduce extra vmovaps/apd register copies or anything like that.
If you're using clang you can just write it conveniently and let clang's shuffle optimizer emit vunpcklpd for you. Or in other cases, do whatever it's going to do anyway; sometimes it makes worse choices than the source, often it does a good job.
Clang gets this wrong with -march=icelake-client, still using vunpcklpd even if you write _mm256_shuffle_ps. (Or depending on surrounding code, might optimize that shuffle into part of something else.)
Related bug report.

Related

How to generate a KSHIFTRW (Shift Right Mask Registers) [duplicate]

Intel's intrinsics guide lists a number of intrinsics for the AVX-512 K* mask instructions, but there seem to be a few missing:
KSHIFT{L/R}
KADD
KTEST
The Intel developer manual claims that intrinsics are not necessary as they are auto generated by the compiler. How does one do this though? If it means that __mmask* types can be treated as regular integers, it would make a lot of sense, but testing something like mask << 4 seems to cause the compiler to move the mask to a regular register, shift it, then move back to a mask. This was tested using Godbolt's latest GCC and ICC with -O2 -mavx512bw.
Also interesting to note that the intrinsics only deal with __mmask16 and not other types. I haven't tested much, but it looks like ICC doesn't mind taking in an incorrect type, but GCC does seem to try and ensure that there are only 16-bits in the mask, if you use the intrinsics.
Am I not looking past the correct intrinsics for the above instructions, as well as other __mmask* type variants, or is there another way to achieve the same thing without resorting to inline assembly?

Intel's documentation saying, "not necessary as they are auto generated by the compiler" is in fact correct. And yet, it's unsatisfying.
But to understand why it is the way it is, you need to look at the history of the AVX512. While none of this information is official, it's strongly implied based on evidence.
The reason the state of the mask intrinsics got into the mess they are now is probably because AVX512 got "rolled out" in multiple phases without sufficient forward planning to the next phase.
Phase 1: Knights Landing
Knights Landing added 512-bit registers which only have 32-bit and 64-bit data granularity. Therefore the mask registers never needed to be wider than 16 bits.
When Intel was designing these first set of AVX512 intrinsics, they went ahead and added intrinsics for almost everything - including the mask registers. This is why the mask intrinsics that do exist are only 16 bits. And they only cover the instructions that exist in Knights Landing. (though I can't explain why KSHIFT is missing)
On Knights Landing, mask operations were fast (2 cycles). But moving data between mask registers and general registers was really slow (5 cycles). So it mattered where the mask operations were being done and it made sense to give the user finer-grained control about moving stuff back-and-forth between mask registers and GPRs.
Phase 2: Skylake Purley
Skylake Purley extends the AVX512 to cover byte-granular lanes. And this increased the width of the mask registers to the full 64 bits. This second round also added KADD and KTEST which didn't exist in the Knights Landing.
These new mask instructions (KADD, KTEST, and 64-bit extensions of existing ones) are the ones that are missing their intrinsic counterparts.
While we don't know exactly why they are missing, there is some strong evidence in support of it:
Compiler/Syntax:
On Knights Landing, the same mask intrinsics were used for both 8-bit and 16-bit masks. There was no way to distinguish between them. By extended them to 32-bit and 64-bit, it made the mess worse. In other words, Intel didn't design the mask intrinsics correctly to begin with. And they decided to drop them completely rather than fix them.
Performance Inconsistencies:
Bit-crossing mask instructions on Skylake Purley are slow. While all bit-wise instructions are single-cycle, KADD, KSHIFT, KUNPACK, etc... are all 4 cycles. But moving between mask and GPR is only 2 cycles.
Because of this, it's often faster to move them into GPRs to do them and move them back. But the programmer is unlikely to know this. So rather than giving the user full control of the mask registers, Intel opted just have the compiler make this decision.
By making the compiler make this decision, it means that the compiler needs to have such logic. The Intel Compiler currently does as it will generate kadd and family in certain (rare) cases. But GCC does not. On GCC, all but the most trivial mask operations will be moved to GPRs and done there instead.
Final Thoughts:
Prior to the release of Skylake Purley, I personally had a lot of AVX512 code written up which includes a lot of AVX512 mask code. These were written with certain performance assumptions (single-cycle latency) that turned out to be false on Skylake Purley.
From my own testing on Skylake X, some of my mask-intrinsic code which relied on bit-crossing operations turned out to be slower than the compiler-generated versions that moved them to GPRs and back. The reason of course is that KADD and KSHIFT was 4 cycles instead of 1.
Of course, I prefer if Intel did provide the intrinsics to give us the control that I want. But it's very easy to go wrong here (in terms of performance) if you don't know what you're doing.
Update:
It's unclear when this happened, but the latest version of the Intel Intrinsics Guide has a new set of mask intrinsics with a new naming convention that covers all the instructions and widths. These new intrinsics supercede the old ones.
So this solves the entire problem. Though the extent of compiler support is still uncertain.
Examples:
_kadd_mask64()
_kshiftri_mask32()
_cvtmask16_u32() supercedes _mm512_mask2int()

How to keep input-dependent hot data in registers when using SIMD intrinsics

I am trying to use Intel SIMD intrinsics to accelerate a query-answer program. Suppose query_cnt is input dependent but is always smaller than SIMD register count (i.e. there is enough SIMD registers to hold them). Since queries are the hot data in my application, instead of loading them each time when needed, may I load them at first and keep them always in registers?
Suppose queries are float type, and AVX256 is supported. Now I have to use something like:
std::vector<__m256> vec_queries(query_cnt / 8);
for (int i = 0; i < query_cnt / 8; ++i) {
vec_queries[i] = _mm256_loadu_ps((float const *)(curr_query_ptr));
curr_query_ptr += 8;
}
I know it is not a good practice since there is potential load/store overhead, but at least there is a slight chance that vec_queries[i] can be optimized so that they can be kept in registers, but I still think it is not a good way.
Any better ideas?

From the code sample you posted, it looks like you're just doing a variable-length memcpy. Depending on what the compiler does, and the surrounding code, you might get better results from just actually calling memcpy. e.g. for aligned copies of with a size that's a multiple of 16B, the break even point between a vector loop and rep movsb is maybe as low as ~128 bytes on Intel Haswell. Check Intel's optimization manual for some implementation notes on memcpy, and a graph of size vs. cycles for a couple different strategies. (Links in the x86 tag wiki).
You didn't say what CPU, so I'm just assuming recent Intel.
I think you're too worried about registers. Loads that hit in L1 cache are extremely cheap. Haswell (and Skylake) can do two __m256 loads per clock (and a store in the same cycle). Previous to that, Sandybridge/IvyBridge can do two memory operations per clock, with a max of one of them being a store. Or under ideal conditions (256b loads/stores), they can manage 2x 16B loaded and 1x 16B stored per clock. So loading/storing 256b vectors is more expensive than on Haswell, but still very cheap if they're aligned and hot in L1 cache.
I mentioned in comments that GNU C global register variables might be a possibility, but mostly in a "this is technically possible in theory" sense. You probably don't want multiple vector registers dedicated to this purpose for the entire run-time of your program (including library function calls, so you'd have to recompile them).
In reality, just make sure the compiler can inline (or at least see while optimizing) the definitions for every function you use inside any important loops. That way it can avoid having to spill/reload vector regs across function calls (since both the Windows and System V x86-64 ABIs have no call-preserved YMM (__m256) registers).
See Agner Fog's microarch pdf to learn even more about the microarchitectural details of modern CPUs, at least the details that are possible to measure by experiment and tune for.

C++ techniques for reducing CPU instruction sizes?

Each CPU instruction consumes a number of bytes. The smaller the size, the most instructions which can be held in the CPU cache.
What techniques are available when writing C++ code which allow you to reduce CPU instruction sizes?
One example could be reducing the number of FAR jumps (literally, jumps to code across larger addresses). Because the offset is a smaller number, the type used is smaller and the overall instruction is smaller.
I thought GCC's __builtin_expect may reduce jump instruction sizes by putting unlikely instructions further away.
I think I have seen somewhere that its better to use an int32_t rather than int16_t due to being the native CPU integer size and therefore more efficient CPU instructions.
Or is something which can only be done whilst writing assembly?

Now that we've all fought over micro/macro optimization, let's try to help with the actual question.
I don't have a full, definitive answer, but you might be able to start here. GCC has some macro hooks for describing performance characteristics of the target hardware. You could theoretically set up a few key macros to help gcc favor "smaller" instructions while optimizing.
Based on very limited information from this question and its one reply, you might be able to get some gain from the TARGET_RTX_COSTS costs hook. I haven't yet done enough follow up research to verify this.
I would guess that hooking into the compiler like this will be more useful than any specific C++ idioms.
Please let us know if you manage any performance gain. I'm curious.

If a processor has various length (multi-byte) instructions, the best you can do is to write your code to help the compiler make use of the smaller instruction sizes.
Get The Code Working Robustly & Correct first.
Debugging optimized code is more difficult than debugging code that is not optimized. The symbols used by the debugger line up with the source code better. During optimization, the compiler can eliminate code, which gets your code out-of-sync with the source listing.
Know Your Assembly Instructions
Not all processors have variable length instructions. Become familiar with your processors instruction set. Find out which instructions are small (one byte) versus multi-byte.
Write Code to Use Small Assembly Instructions
Help out your compiler and write your code to take advantage of the small length instructions.
Print out the assembly language code to verify that the compiler uses the small instructions.
Change your code if necessary to help out the compiler.
There is no guarantee that the compiler will use small instructions. The compiler emits instructions that it thinks will have the best performance according to the optimization settings.
Write Your Own Assembly Language Function
After generating the assembly language source code, you are now better equipped to replace the high level language with an assembly language version. You have the freedom to use small instructions.
Beware the Jabberwocky
Smaller instructions may not be the best solution in all cases. For example, the Intel Processors have block instructions (perform operations on blocks of data). These block instructions perform better than loops of small instructions. However, the block instructions take up more bytes than the smaller instructions.
The processor will fetch as many bytes as necessary, depending on the instruction, into its instruction cache. If you can write loops or code that fits into the cache, the instruction sizes become less of a concern.
Also, many processors will use large instructions to communicate with other processors, such as a floating point processor. Reduction of floating point math in your program may reduce the quanitity of these instructions.
Trim the Code Tree & Reduce the Branches
In general, branching slows down processing. Branches are the change of execution to a new location, such as loops and function calls. Processors love to data instructions, because they don't have to reload the instruction pipeline. Increasing the amount of data instructions and reducing the quantity of branches will improve performance, usually without regards to the instruction sizes.

Array C[]=A[]*B[] in high-performance calculation

I believe it is usual to have such code in C++
for(size_t i=0;i<ARRAY_SIZE;++i)
A[i]=B[i]*C[i];
One commonly advocated alternation is:
double* pA=A,pB=B,pC=C;
for(size_t i=0;i<ARRAY_SIZE;++i)
*pA++=(*pB++)*(*pC++);
What I am wondering is, the best way of improving this code, as IMO following things needed to be considered:
CPU cache. How CPUs fill up their caches to gain best hit rate?
I suppose SSE could improve this?
The other thing is, what if the code could be parallelized? E.g. using OpenMP. In this case, pointer trick may not be available.
Any suggestions would be appreciated!

My g++ 4.5.2 produces absolutely identical code for both loops (having fixed the error in double *pA=A, *pB=B, *pC=C;, and it is
.L3:
movapd B(%rax), %xmm0
mulpd C(%rax), %xmm0
movapd %xmm0, A(%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
(where my ARRAY_SIZE was 10000)
The compiler authors know these tricks already. OpenMP and other concurrent solutions are worth investigating, though.

The rule for performance are
not yet
get a target
measure
get an idea of how much improvement is possible and verify it is worthwhile to spend time to get it.
This is even more true for modern processors. About your questions:
simple index to pointer mapping is often done by the compilers, and when they don't do it they may have good reasons.
processors are already often optimized to sequential access to the cache: simple code generation will often give the best performance.
SSE can perhaps improve this. But not if you are already bandwidth limited. So we are back to the measure and determine bounds stage
parallelization: same thing as SSE. Using the multiple cores of a single processor won't help if you are bandwidth limited. Using different processor may help depending on the memory architecture.
manual loop unwinding (suggested in a now deleted answer) is often a bad idea. Compilers know how to do this when it is worth-wise (for instance if it can do software pipelining), and with modern OOO processors it is often not the case (it increase the pressure on instruction and trace caches while OOO execution, speculation over jumps and register renaming will automatically brings most of the benefit of unwinding and software pipelining).

The first form is exactly the sort of structure that your compiler will recognize and optimize, almost certainly emitting SSE instructions automatically.
For this kind of trivial inner loop, cache effects are irrelevant, because you are iterating through everything. If you have nested loops, or a sequence of operations (like g(f(A,B),C)), then you might try to arrange to access small blocks of memory repeatedly to be more cache-friendly.
Do not unroll the loop by hand. Your compiler will already do that, too, if it is a good idea (which it may not be on a modern CPU).
OpenMP can maybe help if the loop is huge and the operations within are complicated enough that you are not already memory-bound.
In general, write your code in a natural and straightforward way, because that is what your optimizing compiler is most likely to understand.

When to start considering SSE or OpenMP? If both of these are true:
If you find that code similar to yours appear 20 times or more in your project:
for (size_t i = 0; i < ARRAY_SIZE; ++i)A[i] = B[i] * C[i];
or some similar operations
If ARRAY_SIZE is routinely bigger than 10 million, or, if profiler tells you that this operation is becoming a bottleneck
Then,
First, make it into a function: void array_mul(double* pa, const double* pb, const double* pc, size_t count){ for (...) }
Second, if you can afford to find a suitable SIMD library, change your function to use it.
Good portable SIMD library
SIMD C++ library
As a side note, if you have a lot of operations that are only slightly more complicated than this, e.g. A[i] = B[i] * C[i] + D[i] then a library which supports expression template will be useful too.

You can use some easy parallelization method. Cuda will be hardware dependent but SSE is almost standard in every CPU. Also you can use multiple threads. In multiple threads you can still use pointer trick which is not very important. Those simple optimizations can be done by compiler as well. If you are using Visual Studio 2010 you can use parallel_invoke to execute functions in parallel without dealing with windows threads. In Linux pThread library is quite easy to use.

I think using valarrays are specialised for such calculations. I am not sure if it will improve the performance.

Performance of C++ Operators

Is there any sort of performance difference between the arithmetic operators in c++, or do they all run equally fast? E.g. is "++" faster than "+=1"? What about "+=10000"? Does it make a significant difference if the numbers are floats instead of integers? Does "*" take appreciably longer than "+"?
I tried performing 1 billion each of "++", "+=1", and "+=10000". The strange thing is that the number of clock cycles (according to time.h) is actually counterintuitive. One might expect that if any of them are the fastest, it is "++", followed by "+=1", then "+=10000", but the data shows a slight trend in the opposite direction. The difference is more pronounced on 10 billion operations. This is all for integers.
I am dabbling in scientific computing, so I wanted to test the performance of operators. If any of the operators operated in time that was linear in terms of the inputs, for example.

About your edit, the language says nothing about the architecture it's running on. Your question is platform dependent.
That said, typically all fundamental data-type operations have a one-to-one correspondence to assembly.
x86 for example has an instruction which increments a value by 1, which i++ or i += 1 would translate into. Addition and multiplication also have single instructions.
Hardware-wise, it's fairly obvious that adding or multiplying numbers is at least linear in the number of bits in the numbers. Because the hardware has a constant number of bits, it's O(1).
Floats have their own processing unit, usually, which also has single instructions for operations.
Does it matter?
Why not write the code that does what you need it to do. If you want to add one, use ++. If you want to add a large number, add a large number. If you need floats, use floats. If you need to multiply two numbers, then multiply them.
The compiler will figure out the best way to do what you want, so instead of trying to be tricky, do what you need and let it do the hard work.
After you've written your working code, and you decide it's too slow, profile it and find out why. You'll find it's not silly things like multiplying versus adding, but rather going about the entire (sub-)problem in the wrong way.
Practically, all of the operators you listed will be done in a single CPU instruction anyway, on desktop platforms.

No, no, yes*, yes*, respectively.
* but do you really care?
EDIT: to give some kind of idea with a modern processor, you may be able to do 200 integer additions in the time it takes to make one memory access, and only 50 integer multiplications. If you think about it, you're still going to be bound by the memory accesses most of the time.

What you are asking is: What basic operations get transformed into which assembly instructions and what is the performance of those instructions on my specific architecture. And this is also your answer: The code they get translated to is dependant on your compiler and it's knowledge of your architecture, their performance depends on your architecture.
Mind you: in C++ operators can be overloaded for user defined types. They can behave differently from built-in types and the implementation of the overload can be non-trivial (no just one instruction).
Edit: A hint for testing. Most compilers support outputting the generated assembly code. The option for gcc is -S. If you use some other compiler have a look at their documentation.

The best answer is to time it with your compiler.

Look up the optimization manuals for your CPU. That's the only place you're going to find answers.
Get your compiler to output the generated assembly. Download the manuals for your CPU. Look up the instructions used by the compiler in the manual, and you know how they perform.
Of course, this presumes that you already know the basics of how a pipelined, superscalar out-of-order CPU operates, what branch prediction, instruction and data cache and everything else means. Do your homework.
Performance is a ridiculously complicated subject. Depending on context, floating-point code may be as fast as (or faster than) integer code, or it may be four times slower. Usually branches carry almost no penalty, but in special cases, they can be crippling. Sometimes, recomputing data is more efficient than caching it, and sometimes not.
Understand your programming language. Understand your compiler. Understand your CPU. And then examine exactly what the compiler is doing in your case, by profiling/timing, and on when necessary by examining the individual instructions. (and when timing your code, be aware of all the caveats and gotchas that can invalidate your benchmarks: Make sure optimizations are enabled, but also that the code you're trying to measure isn't optimized away. Take the cache into account (if the data is already in the CPU cache, it'll run much faster. If it has to read from physical memory to begin with, it'll take extra time. Both can invalidate your measurements if you're not careful. Keep in mind what you want to measure exactly)
For your specific examples, why should ++i be faster than i += 1? They do the exact same thing? Sometimes, it may make a difference whether you're adding a constant or a variable, but in this case, you're adding the constant one in both cases.
And in general, instructions take a fixed constant time regardless of their operands. adding one to something takes just as long as adding -2000 or 1772051912. The same goes for multiplication or division.
But if you care about performance, you need to understand how the entire technology stack works, not just rely on a few simple rules of thumb like "integer is faster than floating point, and ++ is faster than +=" (Apart from anything else, such simple rules of thumb are almost never true, at least not in every case)

Here is a twist on your evaluations: try Loop Unrolling. Loop unrolling is where you repeat the same statements in a loop to reduce the number of iterations in the loop.
Most modern processors hate branch instructions. The processors have a queue of pre-fetched instructions, which speeds up processing. They really hate branch instructions, because the processor has to clear out the queue and reload it after a branch. This takes more time than just processing sequential instructions.
When coding for processing time, try to minimize the number of branches, which can occur in loop constructs and decision constructs.

Depends on architecture, the built in operators for integer arithmetic translate directly to assembly (as I understand it) ++, +=1, and += 10000 are probably equally fast, multiplication would depend on the platform, overloaded operators would depend on you

Donald Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
unless you are writing extremely time critical software, you should probably worry about other things

Short answer: you should turn optimizations on before measuring.
The long answer: If you turned optimizations on, you're performing the operations on integers, and still you get different times for ++i; and i += 1;, then it's probably time to get a better compiler -- the two statements have exactly the same semantics and a competent compiler should translate them into the same instruction sequence.

"Does it make a significant difference if the numbers are floats instead of integers?"
-It depends on what kind of processor you are running on. Integer operations are faster on current x86 compatible CPUs.
About i++ and i+=1: there shouldn't be a difference with any good compiler, while you may expect i+=10000 to be slightly slower on x86 CPUs.
"Does "*" take appreciably longer than "+"?"
-Typically yes.
Note that you may run into all sorts of bottlenecks, in which case the speed difference between the operations doesn't show up. Eg. memory bandwidth, CPU pipeline stall due to data dependencies, etc...

The performance problems caused by C++ operators do not come from the operators and not from the operators implementation. It comes from the syntax, from hidden code being run without you knowing.
The best example, is implementing quick sort, on an object which has the operator[] implemented, but internally it's using a linked list. Now instead of O(nlogn) [1] you will get O(n^2logn).
The problem with performance is that you cannot know exactly what your code will eventually be.
[1] I know that quick sort is actually O(n^2), but it rarely gets to it, the average distribution will give you O(nlogn).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js