How to generate a KSHIFTRW (Shift Right Mask Registers) [duplicate] - c++

Intel's intrinsics guide lists a number of intrinsics for the AVX-512 K* mask instructions, but there seem to be a few missing:
KSHIFT{L/R}
KADD
KTEST
The Intel developer manual claims that intrinsics are not necessary as they are auto generated by the compiler. How does one do this though? If it means that __mmask* types can be treated as regular integers, it would make a lot of sense, but testing something like mask << 4 seems to cause the compiler to move the mask to a regular register, shift it, then move back to a mask. This was tested using Godbolt's latest GCC and ICC with -O2 -mavx512bw.
Also interesting to note that the intrinsics only deal with __mmask16 and not other types. I haven't tested much, but it looks like ICC doesn't mind taking in an incorrect type, but GCC does seem to try and ensure that there are only 16-bits in the mask, if you use the intrinsics.
Am I not looking past the correct intrinsics for the above instructions, as well as other __mmask* type variants, or is there another way to achieve the same thing without resorting to inline assembly?

Intel's documentation saying, "not necessary as they are auto generated by the compiler" is in fact correct. And yet, it's unsatisfying.
But to understand why it is the way it is, you need to look at the history of the AVX512. While none of this information is official, it's strongly implied based on evidence.
The reason the state of the mask intrinsics got into the mess they are now is probably because AVX512 got "rolled out" in multiple phases without sufficient forward planning to the next phase.
Phase 1: Knights Landing
Knights Landing added 512-bit registers which only have 32-bit and 64-bit data granularity. Therefore the mask registers never needed to be wider than 16 bits.
When Intel was designing these first set of AVX512 intrinsics, they went ahead and added intrinsics for almost everything - including the mask registers. This is why the mask intrinsics that do exist are only 16 bits. And they only cover the instructions that exist in Knights Landing. (though I can't explain why KSHIFT is missing)
On Knights Landing, mask operations were fast (2 cycles). But moving data between mask registers and general registers was really slow (5 cycles). So it mattered where the mask operations were being done and it made sense to give the user finer-grained control about moving stuff back-and-forth between mask registers and GPRs.
Phase 2: Skylake Purley
Skylake Purley extends the AVX512 to cover byte-granular lanes. And this increased the width of the mask registers to the full 64 bits. This second round also added KADD and KTEST which didn't exist in the Knights Landing.
These new mask instructions (KADD, KTEST, and 64-bit extensions of existing ones) are the ones that are missing their intrinsic counterparts.
While we don't know exactly why they are missing, there is some strong evidence in support of it:
Compiler/Syntax:
On Knights Landing, the same mask intrinsics were used for both 8-bit and 16-bit masks. There was no way to distinguish between them. By extended them to 32-bit and 64-bit, it made the mess worse. In other words, Intel didn't design the mask intrinsics correctly to begin with. And they decided to drop them completely rather than fix them.
Performance Inconsistencies:
Bit-crossing mask instructions on Skylake Purley are slow. While all bit-wise instructions are single-cycle, KADD, KSHIFT, KUNPACK, etc... are all 4 cycles. But moving between mask and GPR is only 2 cycles.
Because of this, it's often faster to move them into GPRs to do them and move them back. But the programmer is unlikely to know this. So rather than giving the user full control of the mask registers, Intel opted just have the compiler make this decision.
By making the compiler make this decision, it means that the compiler needs to have such logic. The Intel Compiler currently does as it will generate kadd and family in certain (rare) cases. But GCC does not. On GCC, all but the most trivial mask operations will be moved to GPRs and done there instead.
Final Thoughts:
Prior to the release of Skylake Purley, I personally had a lot of AVX512 code written up which includes a lot of AVX512 mask code. These were written with certain performance assumptions (single-cycle latency) that turned out to be false on Skylake Purley.
From my own testing on Skylake X, some of my mask-intrinsic code which relied on bit-crossing operations turned out to be slower than the compiler-generated versions that moved them to GPRs and back. The reason of course is that KADD and KSHIFT was 4 cycles instead of 1.
Of course, I prefer if Intel did provide the intrinsics to give us the control that I want. But it's very easy to go wrong here (in terms of performance) if you don't know what you're doing.
Update:
It's unclear when this happened, but the latest version of the Intel Intrinsics Guide has a new set of mask intrinsics with a new naming convention that covers all the instructions and widths. These new intrinsics supercede the old ones.
So this solves the entire problem. Though the extent of compiler support is still uncertain.
Examples:
_kadd_mask64()
_kshiftri_mask32()
_cvtmask16_u32() supercedes _mm512_mask2int()

Related

AVX equivalent for _mm_movelh_ps

since there is no AVX version of _mm_movelh_ps I usually used _mm256_shuffle_ps(a, b, 0x44) for AVX registers as a replacement. However, I remember reading in other questions, that swizzle instructions without a control integer (like _mm256_unpacklo_ps or _mm_movelh_ps) should be preferred if possible (for some reason I don't know). Yesterday, it occurred to me, that another alternative might be using the following:
_mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(a), _mm256_castps_pd(b)));
Since the casts are supposed to be no-ops, is this better\equal\worse than using _mm256_shuffle_ps regarding performance?
Also, if it is truly the case, it would be nice if somebody could explain in simple words (I have very limited understanding of assembly and microarchitecture) why one should prefer instructions without a control integer.
Thanks in advance
Additional note:
Clang actually optimizes the shuffle to vunpcklpd: https://godbolt.org/z/9XFP8D
So it seems that my idea is not too bad. However, GCC and ICC create a shuffle instruction.
Avoiding an immediate saves 1 byte of machine-code size; that's all. It's at the bottom of the list for performance considerations, but all else equal shuffles like _mm256_unpacklo_pd with an implicit "control" are very slightly better than an immediate control byte for that reason.
(But taking the control operand in another vector like vpermilps can or vpermd requires is usually worse, unless you have some weird front-end bottleneck in a long-running loop, and can load the shuffle control outside the loop. Not very plausible and at this point you'd have to be writing by hand in asm to be caring that much about code size/alignment; in C++ that's still not something you can really control directly.)
Since the casts are supposed to be no-ops, is this better\equal\worse than using _mm256_shuffle_ps regarding performance?
Ice Lake has 2/clock vshufps vs. 1/clock vunpcklpd, according to testing by uops.info on real hardware, running on port 1 or port 5. Definitely use _mm256_shuffle_ps. The trivial extra code-size cost probably doesn't actually hurt at all on earlier CPUs, and is probably worth it for the future benefit on ICL, unless you're sure that port 5 won't be a bottleneck.
Ice Lake has a 2nd shuffle unit on port 1 that can handle some common XMM and in-lane YMM shuffles, including vpshufb and apparently some 2-input shuffles like vshufps. I have no idea why it doesn't just decode vunpcklpd as a vshufps with that control vector, or otherwise manage to run that shuffle on port 1. We know the shuffle HW itself can do the shuffle so I guess it's just a matter of control hardware to set up implicit shuffles, mapping an opcode to a shuffle control somehow.
Other than that, it's equal or better on older AVX CPUs; no CPUs have penalties for using PD shuffles between other PS instructions. The only different on any existing CPUs is code-size. Old CPUs like K8 and Core 2 had faster pd shuffles than ps, but no CPUs with AVX have shuffle units with that weakness. Also, AVX non-destructive instructions level differences between which operand has to be the destination.
As you can see from the Godbolt link, there are zero extra instructions before/after the shuffle. The "cast" intrinsics aren't doing conversion, just reinterpret to keep the C++ type system happy because Intel decided to have separate types for __m256 vs. __m256d (vs. __m256i), instead of having one generic YMM type. They chose not to have separate uint8x16 vs. uint32x4 vectors the way ARM did, though; for integer SIMD just __m256i.
So there's no need for compilers to emit extra instructions for casts, and in practice that's true; they don't introduce extra vmovaps/apd register copies or anything like that.
If you're using clang you can just write it conveniently and let clang's shuffle optimizer emit vunpcklpd for you. Or in other cases, do whatever it's going to do anyway; sometimes it makes worse choices than the source, often it does a good job.
Clang gets this wrong with -march=icelake-client, still using vunpcklpd even if you write _mm256_shuffle_ps. (Or depending on surrounding code, might optimize that shuffle into part of something else.)
Related bug report.

SSE gives no speedup for C++ number crunching

I have a heavy number-crunching program that does image processing. It is mostly convolutions. It is written in C++ and compiled with Mingw GCC 4.8.1. I run it on a laptop with a Intel Core i7 4900MQ (with SSE up to SSE4.2 and AVX2).
When I tell GCC to use SSE optimisations (with -march=native -mfpmath=sse -msse2 ), I see no speedup compared to using the default x87 FPU.
When I use doubles instead of floats, there is no slowdown.
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
Yes, you are.
Compiler is as good as your code - remember that. If you didn't design your algorithm with vectorization in mind, compiler is powerless. It is not that easy: "turn the switch on and enjoy 100% performance boost".
First of all, compile your code with -ftree-vectorizer-verbose=N to see, what really was vectorized by the compiler.
N is the verbosity level, make that 5 to see all available output (more info can be found here).
Also, you may want to read about GCC's vectorizer.
And keep in mind, that for performance-critical sections of code, using SSE/AVX intrinsics (brilliantly documented here) directly may be the best option.
There is no code, no description on test procedures, but it generally can be explained this way:
It's not all about cpu bound, it's also bounded by memory speed.
Image processing usually have large working set and exceed the amount of cache of your non-xeon cpu. Eventually the cpu encounter starvation means the overall throughput can be bounded by memory speed.
You may be using an algorithm that is not friendly for vectorization.
Not every algorithm benefits from being vectorized. There are many conditions have to meet - flow dependency, memory layout, etc.

C++ techniques for reducing CPU instruction sizes?

Each CPU instruction consumes a number of bytes. The smaller the size, the most instructions which can be held in the CPU cache.
What techniques are available when writing C++ code which allow you to reduce CPU instruction sizes?
One example could be reducing the number of FAR jumps (literally, jumps to code across larger addresses). Because the offset is a smaller number, the type used is smaller and the overall instruction is smaller.
I thought GCC's __builtin_expect may reduce jump instruction sizes by putting unlikely instructions further away.
I think I have seen somewhere that its better to use an int32_t rather than int16_t due to being the native CPU integer size and therefore more efficient CPU instructions.
Or is something which can only be done whilst writing assembly?
Now that we've all fought over micro/macro optimization, let's try to help with the actual question.
I don't have a full, definitive answer, but you might be able to start here. GCC has some macro hooks for describing performance characteristics of the target hardware. You could theoretically set up a few key macros to help gcc favor "smaller" instructions while optimizing.
Based on very limited information from this question and its one reply, you might be able to get some gain from the TARGET_RTX_COSTS costs hook. I haven't yet done enough follow up research to verify this.
I would guess that hooking into the compiler like this will be more useful than any specific C++ idioms.
Please let us know if you manage any performance gain. I'm curious.
If a processor has various length (multi-byte) instructions, the best you can do is to write your code to help the compiler make use of the smaller instruction sizes.
Get The Code Working Robustly & Correct first.
Debugging optimized code is more difficult than debugging code that is not optimized. The symbols used by the debugger line up with the source code better. During optimization, the compiler can eliminate code, which gets your code out-of-sync with the source listing.
Know Your Assembly Instructions
Not all processors have variable length instructions. Become familiar with your processors instruction set. Find out which instructions are small (one byte) versus multi-byte.
Write Code to Use Small Assembly Instructions
Help out your compiler and write your code to take advantage of the small length instructions.
Print out the assembly language code to verify that the compiler uses the small instructions.
Change your code if necessary to help out the compiler.
There is no guarantee that the compiler will use small instructions. The compiler emits instructions that it thinks will have the best performance according to the optimization settings.
Write Your Own Assembly Language Function
After generating the assembly language source code, you are now better equipped to replace the high level language with an assembly language version. You have the freedom to use small instructions.
Beware the Jabberwocky
Smaller instructions may not be the best solution in all cases. For example, the Intel Processors have block instructions (perform operations on blocks of data). These block instructions perform better than loops of small instructions. However, the block instructions take up more bytes than the smaller instructions.
The processor will fetch as many bytes as necessary, depending on the instruction, into its instruction cache. If you can write loops or code that fits into the cache, the instruction sizes become less of a concern.
Also, many processors will use large instructions to communicate with other processors, such as a floating point processor. Reduction of floating point math in your program may reduce the quanitity of these instructions.
Trim the Code Tree & Reduce the Branches
In general, branching slows down processing. Branches are the change of execution to a new location, such as loops and function calls. Processors love to data instructions, because they don't have to reload the instruction pipeline. Increasing the amount of data instructions and reducing the quantity of branches will improve performance, usually without regards to the instruction sizes.

ARM7tdmi processor testing methodology

I'm currently writing an video game console emulator that is based on ARM7tdmi processor and I am almost in the stage that I wish to test if the processor is functioning correctly. I have only developed CPU and memory part of the entire console so only possible way to debug the processor is using logging (console) system. So far, I've only tested it simply by fetching dummy Opcodes and executing random instructions. Is there an actual ARM7 program (or other methodologies) that is specifically designed for this kind of purpose to make sure the processor is correctly functioning? Thanks in advance.
I used Dummy Opcodes such as,
ADD r0, r0, r1, LSL#2
MOV r1, #r0
But in 32 bit Opcode format.
I also wrote some tests and found some bugs in a GBA emulator. I have also written my own emulators (as well as work in the processor business testing processors and boards).
I have a few things that I do regularly. These are my general test methodologies.
There are a number of open source libraries out there, for example zlib and other compression libraries, jpeg, mp3, etc. It is not hard to bare metal these, fake an fopen, fread, fwrite with chunks of data and a pointer. the compression libs as well as encryption and hashes you can self test on the target processor. compress something, decompress it and compare the original with the uncompressed. I often will also run the code under test on a host, and compute the checksum of the compressed and decompressed versions, and give me a hardcoded check value which I then run on the target platform. For jpeg or mp3 or hash algorithms I use a host version of the code under test to produce a golden value that I then compare on the target platform.
Before doing any of that though the flags are very tricky to get right, the carry flag in particular (and signed overflow), some processors invert the carry out flag when it is a subtract operation (subtract is an add with the second operand ones complemented and the carry in ones complemented (normal add without carry is a carry in of zero, subtract without carry then is an add with second operand inverted and a carry in of 1)). And that inversion of the carry out affects the carry on if the instruction set has a subtract with borrow, whether or not carry is inverted on the way in or not.
It is sometimes obvious from the conditional branch definitions (if C is this and V is that, if C is this and Z is that) for unsigned and signed variations of less than, greater than, etc as to how that particular processor manages the carry (unsigned overflow) and the signed overflow flags without having to experiment on real silicon. I dont memorize what processor does what, I figure it out per instruction set, so I dont know what ARM does.
ARM has nuances with the shift operations that you have to be careful that were implemented properly, read the pseudo code under each instruction, if shift amount == 32 then do this if shift amount == 0 then do that, otherwise do this other thing. with the arm7 you could do unaligned accesses if the fault was disabled and it would rotate the data around within the 32 bits, or something like that. If the 32 bits at address 0 was 0x12345678, then a 16 bit read at address 1 would give you something like 0x78123456 on the bus and the destination would then get 0x3456. Hopefully most folks didnt rely on that. But that and other "UNPREDICTABLE RESULTS" comments in the ARM ARM, changed from ARM ARM to ARM ARM (If you have some of the different hardcopy manuals this will be more obvious, the old white covered one (the skinny one as well as the thick one) and the blue covered one). So depending on the manual you read (for those armv4 processors) you were sometimes allowed to do something and sometimes not allowed to do something. So you might find code/binaries that do things you think are unpredictable, if you only rely on one manual.
Different compilers generate differen code sequences so if you can find different arm compilers (clang/llvm and gcc being obvious first choices), get some eval copies of other compilers if you can (Kiel is probaby a good choice, now owned by arm I think it contains both Kiel and the RVCT arm compilers). Compile the same test code with different optimization settings, test every one of those variations, and repeat that for each compiler. If you only use one compiler for testing you will leave gaps in instruction sequences as well as a number of instructions or variations that will never be tested because the compiler never generates them. I hit this exact problem once. Using open source code you get different programmer habits too, whether it is asm or C or other languages different individuals have different programming habits and as a result generate different instruction sequences and mixes of instructions which can hide or expose processor bugs. If this is a single person hobby project you eventually will rely on others. The good thing here being a gba or ds or whatever emulator when you start using roms you will have a large volume of other peoples code, unfortunately debugging that is difficult.
I heard some hearsay ones that intel/x86 design validation folks use operating systems, various ones, to beat on their processors, it creates a lot of chaos and variation. Beats up on the processor but like the roms, extrememly difficult to debug if something goes wrong. I have personal experience with that with caches and such running linux on the processors I have worked on. Didnt find the bug until we had Linux ported and booting, and the bug was crazy hard to find...fortunately the arm7tdmi does not have a cache. If you have a cache then take those combinations I described above, test code multiplied by optimization level multiplied by different compilers, and then add to that in the bootstrap or other places compile a version with one, two, three, four, nops or other data such that the alignment of the binary changes relative to the cache lines causing the same program to exercise the cache differently.
In this case where there is real hardware you are trying to emulate you can do things like have a program that generates random alu machine code, generate dozens of instructions with randomly chosen source and destination registers, randomize add, subtract, and, or, not, etc. randomize the flags on and off, etc. pre-load all the registers, set the flags to a known state, run that chunk of code and then capture the registers and flags and see how it compares to real hardware. You can produce an infinite amount of separate tests, various lengths, etc. easier to debug this than to debug a code sequence that does some data or flag thing that is part of a more complicated program.
Take that combination of test programs, multplied by optimization setting, multiplied by compiler, etc. And beat on it with interrupts. Vary the rate of the interrupts. since this is a simulator you can do something I had hardawre for one time. In the interrupt, examine the return address, compute an address that is some number of instructions out ahead of that address, remember that address. Return from the interrupt, when you see that address being fetched fire a prefetch abort, have the prefetch abort code, stop watching that address when the prefetch abort fires (in the simulation) and have the code for the prefetch abort handler return to where the abort happend (per the arm arm) and let it continue. I was able to create a fair amount of pain on the processor under test with this setup...particularly with the caches on...which you dont have on an arm7tdmi.
Note that a high percentage of the gba games are thumb mode because on that platform, which used mostly 16 bit wide data busses, thumb mode ran (much) faster than arm mode even though thumb code takes about 10-15% more instructions. as well as taking less rom space for the binary. Carefully examine the blx instruction as I think there are different implementations based on architecture armv4 is different than armv6 or 7, so if you are using an armv6 or 7 manual as a reference or hardware for validating against, understand those differences.
blah, blah, blah TL; DR. sorry for rambling this is a fun topic for me...

Using AVX CPU instructions: Poor performance without "/arch:AVX"

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.
To use AVX, it is necessary to include this:
#include "immintrin.h"
and then you can use intrinsics AVX functions like _mm256_mul_ps, _mm256_add_ps etc.
The problem is that by default, VS2010 produces code that works very slowly and shows the warning:
warning C4752: found Intel(R) Advanced Vector Extensions; consider
using /arch:AVX
It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVX to the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!
So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE. But for AVX it does not work for some reason.
2021 update: Modern versions of MSVC don't need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did.
The behavior that you are seeing is the result of expensive state-switching.
See page 102 of Agner Fog's manual:
http://www.agner.org/optimize/microarchitecture.pdf
Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.
When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)
Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.
It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX.
For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.
If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.
tl;dr for old versions of MSVC only
Use _mm256_zeroupper(); or _mm256_zeroall(); around sections of code using AVX (before or after depending on function arguments). Only use option /arch:AVX for source files with AVX rather than for an entire project to avoid breaking support for legacy-encoded SSE-only code paths.
In modern MSVC (and the other mainstream compilers, GCC/clang/ICC), the compiler knows when to use a vzeroupper asm instruction. Forcing extra vzerouppers with intrinsics can hurt performance when inlining. See Do I need to use _mm256_zeroupper in 2021?
Cause
I think the best explanation is in the Intel article, "Avoiding AVX-SSE Transition Penalties" (PDF). The abstract states:
Transitioning between 256-bit Intel® AVX instructions and legacy Intel® SSE instructions within a program may cause performance penalties because the hardware must save and restore the upper 128 bits of the YMM registers.
Separating your AVX and SSE code into different compilation units may NOT help if you switch between calling code from both SSE-enabled and AVX-enabled object files, because the transition may occur when AVX instructions or assembly are mixed with any of (from the Intel paper):
128-bit intrinsic instructions
SSE inline assembly
C/C++ floating point code that is compiled to Intel® SSE
Calls to functions or libraries that include any of the above
This means there may even be penalties when linking with external code using SSE.
Details
There are 3 processor states defined by the AVX instructions, and one of the states is where all of the YMM registers are split, allowing the lower half to be used by SSE instructions. The Intel document "Intel® AVX State Transitions: Migrating SSE Code to AVX" provides a diagram of these states:
When in state B (AVX-256 mode), all bits of the YMM registers are in use. When an SSE instruction is called, a transition to state C must occur, and this is where there is a penalty. The upper half of all YMM registers must be saved into an internal buffer before SSE can start, even if they happen to be zeros. The cost of the transitions is on the "order of 50-80 clock cycles on Sandy Bridge hardware". There is also a penalty going from C -> A, as diagrammed in Figure 2.
You can also find details about the state switching penalty causing this slowdown on page 130, Section 9.12, "Transitions between VEX and non-VEX modes" in Agner Fog's optimization guide (of version updated 2014-08-07), referenced in Mystical's answer. According to his guide, any transition to/from this state takes "about 70 clock cycles on Sandy Bridge". Just as the Intel document states, this is an avoidable transition penalty.
Skylake has a different dirty-upper mechanism that causes false dependencies for legacy-SSE with dirty uppers, rather than one-time penalties. Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
Resolution
To avoid the transition penalties you can either remove all legacy SSE code, instruct the compiler to convert all SSE instructions to their VEX encoded form of 128-bit instructions (if compiler is capable), or put the YMM registers in a known zero state before transitioning between AVX and SSE code. Essentially, to maintain the separate SSE code path, you must zero out the upper 128-bits of all 16 YMM registers (issuing a VZEROUPPER instruction) after any code that uses AVX instructions. Zeroing these bits manually forces a transition to state A, and avoids the expensive penalty since the YMM values do not need to be stored in an internal buffer by hardware. The intrinsic that performs this instruction is _mm256_zeroupper. The description for this intrinsic is very informative:
This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions. There is no transition penalty if an application clears the upper bits of all YMM registers (sets to ‘0’) via VZEROUPPER, the corresponding instruction for this intrinsic, before transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions.
In Visual Studio 2010+ (maybe even older), you get this intrinsic with immintrin.h.
Note that zeroing out the bits with other methods does not eliminate the penalty - the VZEROUPPER or VZEROALL instructions must be used.
One automatic solution implemented by the Intel Compiler is to insert a VZEROUPPER at the beginning of each function containing Intel AVX code if none of the arguments are a YMM register or __m256/__m256d/__m256i datatype, and at the end of functions if the returned value is not a YMM register or __m256/__m256d/__m256i datatype.
In the wild
This VZEROUPPER solution is used by FFTW to generate a library with both SSE and AVX support. See simd-avx.h:
/* Use VZEROUPPER to avoid the penalty of switching from AVX to SSE.
See Intel Optimization Manual (April 2011, version 248966), Section
11.3 */
#define VLEAVE _mm256_zeroupper
Then VLEAVE(); is called at the end of every function using intrinsics for AVX instructions.