How can I prefetch infrequently used code? - c++

I want to prefetch some code into the instruction cache. The code path is used infrequently but I need it to be in the instruction cache or at least in L2 for the rare cases that it is used. I have some advance notice of these rare cases. Does _mm_prefetch work for code? Is there a way to get this infrequently used code in cache? For this problem I don't care about portability so even asm would do.

The answer depends on your CPU architecture.
That said, if you are using gcc or clang, you can use the __builtin_prefetch instruction to try to generate a prefetch instruction. On Pentium 3 and later x86-type architectures, this will generate a PREFETCHh instruction, which requests a load into the data cache hierarchy. Since these architectures have unified L2 and higher caches, it may help.
The function looks like this:
__builtin_prefetch(const void *address, int locality);
The locality argument should be in the range 0...3. Assuming locality maps directly to the h part of the PREFETCHh instruction, you want to pass 1 or 2, which ask for the data to be loaded into the L2 and higher caches. See Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 2B: Instruction Set Reference, M-Z (PDF) page 4-277. (Find other volumes here.)
If you're using another compiler that doesn't have __builtin_prefetch, see whether it has the _mm_prefetch function. You may need to include a header file to get that function. For example, on OS X, that function, and constants for the locality argument, are declared in xmmintrin.h.

There isn't any (official [1] x86) instruction to prefetch code, only data. I find this a rather bizarre use-case, where the code-path is known beforehand, but executes rarely, and there is a significant benefit in prefetching the code. It would be great to understand where you've come to the conclusion that there is a significant benefit in pre-loading the code for this special case, since it would require not only analyzing that the code is significantly slower when it's not been hit for a long time, but also determining that there is spare bus-cycles to actually load the code before the processor can prefetch it by it's normal mechanism for loading code.
You may be able to use the prefetch instructions that fetch into L2, which is typically shared between I- and D-cache.
[1] I know there are some "secret" instructions that allow the processor to manipulate cache-content, but since those would require a lot of extra work, even if you could use them in user-mode code [and I expect this is not some kernel-mode code].

Related

Understanding `_mm_prefetch`

The answer What are _mm_prefetch() locality hints? goes into details on what the hint means.
My question is: which one do I WANT?
I work on a function that is called repeatedly, billions of times, with some int parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to the next, so I am doing:
int foo(int key) {
uint8_t value = cache[key];
_mm_prefetch((const char *)&cache[key * 2], _MM_HINT_T2);
// ...
The goal is to have this value in a processor cache by the next call to this function.
I am looking for confirmation on my understanding of two points:
The call to _mm_prefetch is not going to delay the processing of the instructions immediately following it.
There is no penalty for pre-fetching wrong location, just a lost benefit from guessing it right.
That function is using a lookup table of 128 128-bit values (2 KB total). Is there a way to “force” it to be cached? The index into that lookup table is incremented sequentially; should I pre-fetch them too? I should probably use another hint, to point to another level of cache? What is the best strategy here?
If you do anything related to performance, the best and ultimate way to know what you need is to try it. Fortunately, you know exactly what to try, and there are just a few possibilities.
Regarding your understanding — yes, it is correct. However, there is a cost to anything (e.g. if you add any instruction to your code, the processor will waste a nanosecond executing it). You should verify your idea of prefetching by measuring the performance before and after. For very irregular access patterns, it is very likely to work.
Regarding prefetching any sequential data — you should probably not bother. Caches hold data at 64-byte granularity, so for sequential data, prefetching will usually not help. In addition, some (all?) caches have predictive loading — they prefetch ahead even when not told to.
As I noted in the comments, there's some risk to prefetching the wrong address - a useful address will be evicted from the cache, potentially causing a cache miss.
That said:
_mm_prefetch compiles into the PREFETCHn instruction. I looked up the instruction in the AMD64 Architecture Programmer's Manual published by AMD. (Note that all of this information is necessarily chipset specific; you may need to find your CPU's docs).
AMD says (my emphasis):
The operation of this instruction is implementation-dependent. The processor implementation can ignore or change this instruction. The size of the cache line also depends on the implementation, with a minimum size of 32 bytes. AMD processors alias PREFETCH1 and PREFETCH2 to PREFETCH0
What that appears to mean is that if you're running on an AMD, then the hint is ignored, and the memory is loaded into the all levels of the cache -- unless it's a hint that it's a NTA (Non-Temporal-Access, attempts to load memory with minimal cache pollution).
Here's the full page for the instruction
I think in the end, the guidance is what the other answer says: brainstorm, implement, test, and measure. You're on the bleeding edge of perf here, and there's not going to be a one size fits all answer.
Another resource that may help you is Agner Fog's Optimization manuals, which will help you optimize for your specific CPU.

Is there a way to flush the entire CPU cache related to a program?

On x86-64 platforms, the CLFLUSH assembly instruction allows to flush the cache line corresponding to a given address. Instead of flushing the cache related to a specific address, would there be a way to flush the entire cache (either the cache related to the program being executed, or the entire cache), for example by making it full of dummy contents (or any other approach I would not be aware of):
using only standard C++17?
using standard C++17 and compiler intrinsics if necessary?
What would be the contents of the following function: (the function should work regardless of compiler optimizations)?
void flush_cache()
{
// Contents
}
For links to related questions about clearing caches (especially on x86), see the first answer on WBINVD instruction usage.
No, you cannot do this reliably or efficiently with pure ISO C++. It doesn't know or care about CPU caches. The best you could do is touch a lot of memory so everything else ends up getting evicted1, but this is not what you're really asking for. (Of course, flushing all cache is by definition inefficient...)
See Flushing the cache to prevent benchmarking fluctiations for some tips about implementation details if you go that route.
CPU cache management functions / intrinsics / asm instructions are implementation-specific extensions to the C++ language. But other than inline asm, no C or C++ implementations that I'm aware of provide a way to flush all cache, rather than a range of addresses. That's because it's not a normal thing to do.
On x86, for example, the asm instruction you're looking for is wbinvd. It writes-back any dirty lines before evicting, unlike invd (which drops cache without write-back, useful when leaving cache-as-RAM mode). So in theory wbinvd has no architectural effect, only microarchitectural, but it's so slow that's it's a privileged instruction. As Intel's insn ref manual entry for wbinvd points out, it will increase interrupt latency, because it is not itself interruptible and may have to wait for 8 MiB or more of dirty L3 cache to be flushed. i.e. delaying interrupts for that long can be considered an architectural effect, unlike most timing effects. It's also complicated on a multi-core system because it has to flush caches for all cores.
I don't think there's any way to use it in user-space (ring 3) on x86. Unlike cli / sti and in/out, it's not enabled by the IO-privilege level (which you can set on Linux with an iopl() system call). So wbinvd only works when actually running in ring 0 (i.e. in kernel code). See Privileged Instructions and CPU Ring Levels.
But if you're writing a kernel (or freestanding program that runs in ring0) in GNU C or C++, you could use asm("wbinvd" ::: "memory");. On a computer running actual DOS, normal programs run in real mode (which doesn't have any lower-privilege levels; everything is effectively kernel). That would be another way to run a microbenchmark that needs to run privileged instructions to avoid kernel<->userspace transition overhead for wbinvd, and also has the convenience of running under an OS so you can use a filesystem. Putting your microbenchmark into a Linux kernel module might be easier than booting FreeDOS from a USB stick or something, though. Especially if you want control of turbo frequency stuff.
The only reason I can think of that you might want this is for some kind of experiment to figure out how the internals of a specific CPU are designed. So the details of exactly how it's done are critical. It doesn't make sense to me to even want a portable / generic way to do this.
Or maybe in a kernel before reconfiguring physical memory layout, e.g. so there's now an MMIO region for an ethernet card where there used to be normal DRAM. But in that case your code is already totally arch-specific.
Normally when you want / need to flush caches for correctness reasons, you know which address range needs flushing. e.g. when writing drivers on architectures with DMA that isn't cache coherent, so write-back happens before a DMA read, and doesn't step on a DMA write. (And the eviction part is important for DMA reads, too: you don't want the old cached value). But x86 has cache-coherent DMA these days, because modern designs build the memory controller into the CPU die so system traffic can snoop L3 on the way from PCIe to memory.
The major case outside of drivers where you need to worry about caches is with JIT code-generation on non-x86 architectures with non-coherent instruction caches. If you (or a JIT library) write some machine code into a char[] buffer and cast it to a function pointer, architectures like ARM don't guarantee that code-fetch will "see" that newly-written data.
This is why gcc provides __builtin__clear_cache. It doesn't necessarily flush anything, only makes sure it's safe to execute that memory as code. x86 has instruction caches that are coherent with data caches and supports self-modifying code without any special syncing instructions. See godbolt for x86 and AArch64, and note that __builtin__clear_cache compiles to zero instructions for x86, but has an effect on surrounding code: without it, gcc can optimize away stores to a buffer before casting to a function pointer and calling. (It doesn't realize that data is being used as code, so it thinks they're dead stores and eliminates them.)
Despite the name, __builtin__clear_cache is totally unrelated to wbinvd. It needs an address-range as args so it's not going to flush and invalidate the entire cache. It also doesn't use use clflush, clflushopt, or clwb to actually write-back (and optionally evict) data from cache.
When you need to flush some cache for correctness, you only want to flush a range of addresses, not slow the system down by flushing all the caches.
It rarely if ever makes sense to intentionally flush caches for performance reasons, at least on x86. Sometimes you can use pollution-minimizing prefetch to read data without as much cache pollution, or use NT stores to write around cache. But doing "normal" stuff and then clflushopt after touching some memory for the last time is generally not worth it in normal cases. Like a store, it has to go all the way through the memory hierarchy to make sure it finds and flushes any copy of that line anywhere.
There isn't a light-weight instruction designed as a performance hint, like the opposite of _mm_prefetch.
The only cache-flushing you can do in user-space on x86 is with clflush / clflushopt. (Or with NT stores, which also evict the cache line if it was hot before hand). Or of course creating conflict evictions for known L1d size and associativity, like writing to multiple lines at multiples of 4kiB which all map to the same set in a 32k / 8-way L1d.
There's an Intel intrinsic _mm_clflush(void const *p) wrapper for clflush (and another for clflushopt), but these can only flush cache lines by (virtual) address. You could loop over all the cache lines in all the pages your process has mapped... (But that can only flush your own memory, not cache lines that are caching kernel data, like the kernel stack for your process or its task_struct, so the first system-call will still be faster than if you had flushed everything).
There's a Linux system call wrapper to portably evict a range of addresses: cacheflush(char *addr, int nbytes, int flags). Presumably the implementation on x86 uses clflush or clflushopt in a loop, if it's supported on x86 at all. The man page says it first appeared in MIPS Linux "but
nowadays, Linux provides a cacheflush() system call on some other
architectures, but with different arguments."
I don't think there's a Linux system call that exposes wbinvd, but you could write a kernel module that adds one.
Recent x86 extensions introduced more cache-control instructions, but still only by address to control specific cache lines. The use-case is for non-volatile memory attached directly to the CPU, such as Intel Optane DC Persistent Memory. If you want to commit to persistent storage without making the next read slow, you can use clwb. But note that clwb is not guaranteed to avoid eviction, it's merely allowed to. It might run the same as clflushopt, like may be the case on SKX.
See https://danluu.com/clwb-pcommit/, but note that pcommit isn't required: Intel decided to simplify the ISA before releasing any chips that need it, so clwb or clflushopt + sfence are sufficient. See https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction.
Anyway, this is the kind of cache-control that's relevant for modern CPUs. Whatever experiment you're doing requires ring0 and assembly on x86.
Footnote 1: Touching a lot of memory: pure ISO C++17
You could maybe allocate a very large buffer and then memset it (so those writes will pollute all the (data) caches with that data), then unmap it. If delete or free actually returns the memory to the OS right away, then it will no longer be part of your process's address space, so only a few cache lines of other data will still be hot: probably a line or two of stack (assuming you're on a C++ implementation that uses a stack, as well as running programs under an OS...). And of course this only pollutes data caches, not instruction caches, and as Basile points out, some levels of cache are private per-core, and OSes can migrate processes between CPUs.
Also, beware that using an actual memset or std::fill function call, or a loop that optimizes to that, could be optimized to use cache-bypassing or pollution-reducing stores. And I also implicitly assumed that your code is running on a CPU with write-allocate caches, instead of write-through on store misses (because all modern CPUs are designed this way). x86 supports WT memory regions on a per-page basis, but mainstream OSes use WB pages for all "normal" memory.
Doing something that can't optimize away and touches a lot of memory (e.g. a prime sieve with a long array instead of a bitmap) would be more reliable, but of course still dependent on cache pollution to evict other data. Just reading large amounts of data isn't reliable either; some CPUs implement adaptive replacement policies that reduce pollution from sequential accesses, so looping over a big array hopefully doesn't evict lots of useful data. E.g. the L3 cache in Intel IvyBridge and later does this.
The answer is no, there is no standard C++ way to do this (even with some compiler intrinsics). GCC has __builtin__clear_cache and __builtin_prefetch and Clang probably has them also.
As Johan commented, x86-64 has a privileged instruction for doing what you want, but __builtin__clear_cache doesn't use it (and is a no-op on x86-64, because instruction caches are coherent with data caches on that architecture so hardware takes care of syncing recently-stored data before executing it as code).
On Linux, you might (perhaps) use the cacheflush(2) Linux specific system call. I never used it, and I don't know if it is implemented on x86-64.
BTW, you should not reason on programs, but on processes. Each has its own virtual address space.
Your question lacks some motivation. If you care about micro-benchmarking, be aware that the kernel scheduler is allowed to reschedule and move your thread or process to some other core at arbitrary machine code instruction (be however aware of processor affinity).
(the function should work regardless of compiler optimizations)?
No, optimizing compilers are reordering and rescheduling machine code instructions and often mix several computations related to different C++ statements. They are allowed to do some computations at compile-time. Read more about the as-if rule. See CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”.

Portable explicit prefetch

I am in need of a simple and portable way to explicitly prefetch data. I do not want to use the specific feature of any specific compiler or platform, just something generic enough to work across different platforms and compilers.
One very naive solution that comes to mind is just move a byte/int from the memory location to a register, that "should" bring up that memory segment into the CPU cache to fill a line, at least this is what I logically assume. But maybe it won't be that easy? One possibility is for the compiler to optimize away the operation if that data is not accessed in the particular scope, so no prefetching will occur.
Generally speaking, prefetching and memory loads are not exactly the same operations. There are a few fundamental differences:
Prefetching invalid address does not generate faults whereas attempting to read, write or execute invalid address generates a fault (if CPU has MPU/MMU, of course).
Prefetching can be done for reading and/or writing whereas just reading a byte into register is just reading a byte into register.
You can (theoretically) specify memory locality when prefetching.
CPU might have special instructions for prefetching that are not the same as memory load instructions.
So just stick with __builtin_prefetch and let the compiler do the hard work.
Also, keep in mind that optimizing compilers may generate prefetch instructions automatically. I guess if they do, then you'd have to make sure you do not interfere with that.
Another interesting thing is that, in general, explicit prefetching does not improve performance but slightly degrades it instead. See this LWN article for details and explanation why prefetching was totally removed from the Linux kernel.
Hope it helps. Good Luck!

How can i know my array is in cache?

Lets say my array is 32KB, L1 is 64 KB. Does Windows use some of it while program is running? Maybe I am not able to use L1 because windows is making other programs work? Should I set priority of my program to use all cache?
for(int i=0;i<8192;i++)
{
array_3[i]+=clock()*(rand()%256);//clock() and rand in cache too?
//how many times do I need to use a variable to make it stay in cache?
//or cache is only for reading? look below plz
temp_a+=array_x[i]*my_function();
}
The program is in C/C++.
Same thing for L2 too please.
Also are functions kept in cache? Cache is read only? (If I change my array then it loses the cache bond?)
Does the compiler create the asm codes to use cache more yield?
Thanks
How can i know my array is in cache?
In general, you can't. Generally speaking, the cache is managed directly by hardware, not by Windows. You also can't control whether data resides in the cache (although it is possible to specify that an area of memory shouldn't be cached).
Does windows use some of it while program is running? Maybe i am not able to use L1 because windows is making other programs work? Should i set priority of my program to use all cache?
The L1 and L2 caches are shared by all processes running on a given core. When your process is running, it will use all of cache (if it needs it). When there's a context switch, some or all of the cache will be evicted, depending on what the second process needs. So next time there's a context switch back to your process, the cache may have to be refilled all over again.
But again, this is all done automatically by the hardware.
also functions are kept in cache?
On most modern processors, there is a separate cache for instructions. See e.g. this diagram which shows the arrangement for the Intel Nehalem architecture; note the shared L2 and L3 caches, but the separate L1 caches for instructions and data.
cache is read only?(if i change my array then it loses the cache bond?)
No. Caches can handle modified data, although this is considerably more complex (because of the problem of synchronising multiple caches in a multi-core system.)
does the compiler create the asm codes to use cache more yield?
As cache activity is generally all handled automatically by the hardware, no special instructions are needed.
Cache is not directly controlled by the operating system, it is done
in hardware
In case of a context switch, another application may modify the
cache, but you should not care about this. It is more important to
handle cases when your program behaves cache unfriendly.
Functions are kept in cache (I-Cahce , instruction cache)
Cache is not read only, when you write something it goes to [memory
and] the cache.
The cache is primarily controlled by the hardware. However, I know that Windows scheduler tends to schedule execution of a thread to the same core as before specifically because of the caches. It understands that it will be necessary to reload them on another core. Windows is using this behavior at least since Windows 2000.
As others have stated, you generally cannot control what is in cache. If you are writing code for high-performance and need to rely on cache for performance, then it is not uncommon to write your code so that you are using about half the space of L1 cache. Methods for doing so involve a great deal of discussion beyond the scope of StackOverflow questions. Essentially, you would want to do as much work as possible on some data before moving on to other data.
As a matter of what works practically, using about half of cache leaves enough space for other things to occur that most of your data will remain in cache. You cannot rely on this without cooperation from the operating system and other aspects of the computing platform, so it may be a useful technique for speeding up research calculations but it cannot be used where real-time performance must be guaranteed, as in operating dangerous machinery.
There are additional caveats besides how much data you use. Using data that maps to the same cache lines can evict data from cache even though there is plenty of cache unused. Matrix transposes are notorious for this, because a matrix whose row length is a multiple of a moderate power of two will have columns in which elements map to a small set of cache lines. So learning to use cache efficiently is a significant job.
As far as I know, you can't control what will be in the cache. You can declare a variable as register var_type a and then access to it will be in a single cycle(or a small number of cycles). Moreover, the amount of cycles it will take you to access a chunk of memory also depends on virtual memory translation and TLB.
It should be noted that the register keyword is merely a suggestion and the compiler is perfectly free to ignore it, as was suggested by the comment.
Even though you may not know which data is in cache and which not, you still may get an idea how much of the cache you are utilizing. Modern processor have quite many performance counters and some of them related to cache. Intel's processors may tell you how many L1 and L2 misses there were. Check this for more details of how to do it: How to read performance counters on i5, i7 CPUs

Programmatically counting cache faults

I need to evaluate the time taken by a C++ function in a bunch of hypothesis about memory hierarchy efficiency (e.g: time taken when we have a cache miss, a cache hit or page fault when reading a portion of an array), so I'd like to have some libraries that let me count the cache miss / page faults in order to be capable of auto-generating a performance summary.
I know there are some tools like cachegrind that gives some related statistics on a given application execution, but I'd like a library, as I've already said.
edit Oh, I forgot: I'm using Linux and I'm not interested in portability, it's an academic thing.
Any suggestion is welcome!
Most recent CPUs (both AMD and Intel) have performance monitor registers that can be used for this kind of job. For Intel, they're covered in the programmer's reference manual, volume 3B, chapter 30. For AMD, it's in the BIOS and Kernel Developer's Guide.
Either way, you can count things like cache hits, cache misses, memory requests, data prefetches, etc. They have pretty specific selectors, so you could get a count of (for example) the number of reads on the L2 cache to fill lines in the L1 instruction cache (while still excluding L2 reads to fill lines in the L1 data cache).
There is a Linux kernel module to give access to MSRs (Model-specific registers). Offhand, I don't know whether it gives access to the performance monitor registers, but I'd expect it probably does.
It looks like now there is exactly what I was searching for: perf_event_open.
It lets you do interesting things like initializing/enabling/disabling some performance counters for subsequently fetching their values through an uniform and intuitive API (it gives you a special file descriptor which hosts a struct containing the previously requested informations).
It is a linux-only solution and the functionalities varies depending on the kernel version, so be careful :)
Intel VTune is a performance tuning tool that does exactly what you are asking for;
Of course it works with Intel processors, as it access the internal processor counters, as explained by Jerry Coffin, so this probably not work on an AMD processor.
It expose literally undreds of counters, like cache hit/misses, branch prediction rates, etc. the real issue with it is understanding which counters to check ;)
The cache misses cannot be just counted easily. Most tools or profilers simulate the memory access by redirecting memory accesses to a function that provides this feature. That means these kind of tools instrument the code at all places where a memory access is done and makes your code run awfully slowly. This is not what your intent is I guess.
However depending on the hardware you might have some other possibilities. But even if this is the case the OS should support it (because otherwise you would get system global stats not the ones related to a process or thread)
EDIT: I could find this interesting article that may help you: http://lwn.net/Articles/417979/