Understanding `_mm_prefetch`

Understanding `_mm_prefetch` - c++

The answer What are _mm_prefetch() locality hints? goes into details on what the hint means.
My question is: which one do I WANT?
I work on a function that is called repeatedly, billions of times, with some int parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to the next, so I am doing:
int foo(int key) {
uint8_t value = cache[key];
_mm_prefetch((const char *)&cache[key * 2], _MM_HINT_T2);
// ...
The goal is to have this value in a processor cache by the next call to this function.
I am looking for confirmation on my understanding of two points:
The call to _mm_prefetch is not going to delay the processing of the instructions immediately following it.
There is no penalty for pre-fetching wrong location, just a lost benefit from guessing it right.
That function is using a lookup table of 128 128-bit values (2 KB total). Is there a way to “force” it to be cached? The index into that lookup table is incremented sequentially; should I pre-fetch them too? I should probably use another hint, to point to another level of cache? What is the best strategy here?

If you do anything related to performance, the best and ultimate way to know what you need is to try it. Fortunately, you know exactly what to try, and there are just a few possibilities.
Regarding your understanding — yes, it is correct. However, there is a cost to anything (e.g. if you add any instruction to your code, the processor will waste a nanosecond executing it). You should verify your idea of prefetching by measuring the performance before and after. For very irregular access patterns, it is very likely to work.
Regarding prefetching any sequential data — you should probably not bother. Caches hold data at 64-byte granularity, so for sequential data, prefetching will usually not help. In addition, some (all?) caches have predictive loading — they prefetch ahead even when not told to.

As I noted in the comments, there's some risk to prefetching the wrong address - a useful address will be evicted from the cache, potentially causing a cache miss.
That said:
_mm_prefetch compiles into the PREFETCHn instruction. I looked up the instruction in the AMD64 Architecture Programmer's Manual published by AMD. (Note that all of this information is necessarily chipset specific; you may need to find your CPU's docs).
AMD says (my emphasis):
The operation of this instruction is implementation-dependent. The processor implementation can ignore or change this instruction. The size of the cache line also depends on the implementation, with a minimum size of 32 bytes. AMD processors alias PREFETCH1 and PREFETCH2 to PREFETCH0
What that appears to mean is that if you're running on an AMD, then the hint is ignored, and the memory is loaded into the all levels of the cache -- unless it's a hint that it's a NTA (Non-Temporal-Access, attempts to load memory with minimal cache pollution).
Here's the full page for the instruction
I think in the end, the guidance is what the other answer says: brainstorm, implement, test, and measure. You're on the bleeding edge of perf here, and there's not going to be a one size fits all answer.
Another resource that may help you is Agner Fog's Optimization manuals, which will help you optimize for your specific CPU.

Related

How can I prefetch infrequently used code?

I want to prefetch some code into the instruction cache. The code path is used infrequently but I need it to be in the instruction cache or at least in L2 for the rare cases that it is used. I have some advance notice of these rare cases. Does _mm_prefetch work for code? Is there a way to get this infrequently used code in cache? For this problem I don't care about portability so even asm would do.

The answer depends on your CPU architecture.
That said, if you are using gcc or clang, you can use the __builtin_prefetch instruction to try to generate a prefetch instruction. On Pentium 3 and later x86-type architectures, this will generate a PREFETCHh instruction, which requests a load into the data cache hierarchy. Since these architectures have unified L2 and higher caches, it may help.
The function looks like this:
__builtin_prefetch(const void *address, int locality);
The locality argument should be in the range 0...3. Assuming locality maps directly to the h part of the PREFETCHh instruction, you want to pass 1 or 2, which ask for the data to be loaded into the L2 and higher caches. See Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 2B: Instruction Set Reference, M-Z (PDF) page 4-277. (Find other volumes here.)
If you're using another compiler that doesn't have __builtin_prefetch, see whether it has the _mm_prefetch function. You may need to include a header file to get that function. For example, on OS X, that function, and constants for the locality argument, are declared in xmmintrin.h.

There isn't any (official [1] x86) instruction to prefetch code, only data. I find this a rather bizarre use-case, where the code-path is known beforehand, but executes rarely, and there is a significant benefit in prefetching the code. It would be great to understand where you've come to the conclusion that there is a significant benefit in pre-loading the code for this special case, since it would require not only analyzing that the code is significantly slower when it's not been hit for a long time, but also determining that there is spare bus-cycles to actually load the code before the processor can prefetch it by it's normal mechanism for loading code.
You may be able to use the prefetch instructions that fetch into L2, which is typically shared between I- and D-cache.
[1] I know there are some "secret" instructions that allow the processor to manipulate cache-content, but since those would require a lot of extra work, even if you could use them in user-mode code [and I expect this is not some kernel-mode code].

Programmatically counting cache faults

I need to evaluate the time taken by a C++ function in a bunch of hypothesis about memory hierarchy efficiency (e.g: time taken when we have a cache miss, a cache hit or page fault when reading a portion of an array), so I'd like to have some libraries that let me count the cache miss / page faults in order to be capable of auto-generating a performance summary.
I know there are some tools like cachegrind that gives some related statistics on a given application execution, but I'd like a library, as I've already said.
edit Oh, I forgot: I'm using Linux and I'm not interested in portability, it's an academic thing.
Any suggestion is welcome!

Most recent CPUs (both AMD and Intel) have performance monitor registers that can be used for this kind of job. For Intel, they're covered in the programmer's reference manual, volume 3B, chapter 30. For AMD, it's in the BIOS and Kernel Developer's Guide.
Either way, you can count things like cache hits, cache misses, memory requests, data prefetches, etc. They have pretty specific selectors, so you could get a count of (for example) the number of reads on the L2 cache to fill lines in the L1 instruction cache (while still excluding L2 reads to fill lines in the L1 data cache).
There is a Linux kernel module to give access to MSRs (Model-specific registers). Offhand, I don't know whether it gives access to the performance monitor registers, but I'd expect it probably does.

It looks like now there is exactly what I was searching for: perf_event_open.
It lets you do interesting things like initializing/enabling/disabling some performance counters for subsequently fetching their values through an uniform and intuitive API (it gives you a special file descriptor which hosts a struct containing the previously requested informations).
It is a linux-only solution and the functionalities varies depending on the kernel version, so be careful :)

Intel VTune is a performance tuning tool that does exactly what you are asking for;
Of course it works with Intel processors, as it access the internal processor counters, as explained by Jerry Coffin, so this probably not work on an AMD processor.
It expose literally undreds of counters, like cache hit/misses, branch prediction rates, etc. the real issue with it is understanding which counters to check ;)

The cache misses cannot be just counted easily. Most tools or profilers simulate the memory access by redirecting memory accesses to a function that provides this feature. That means these kind of tools instrument the code at all places where a memory access is done and makes your code run awfully slowly. This is not what your intent is I guess.
However depending on the hardware you might have some other possibilities. But even if this is the case the OS should support it (because otherwise you would get system global stats not the ones related to a process or thread)
EDIT: I could find this interesting article that may help you: http://lwn.net/Articles/417979/

How can I prefetch a memory region most easily?

Background: I've implemented a stochastic algorithm that requires random ordering for best convergence. Doing so obviously destroys memory locality, however. I've found that by prefetching the next iteration's data, the performance drop is minimized.
I can prefetch n cache lines using _mm_prefetch in a simple, mostly OS+compiler-portable fashion - but what's the length of a cache line? Right now, I'm using a hardcoded value of 64, which seems to be the norm nowadays on x64 processors - but I don't know how to detect this at runtime, and a question about this last year found no simple solution.
I've seen GetLogicalProcessorInformation on windows but I'm leery of using such a complex API for something so simple, and that won't work on macs or linux anyhow.
Perhaps there's some entirely other API/intrinsic that could prefetch a memory region identified in terms of bytes (or words, or whatever) and allows me to prefetch without knowing the cache line length?
Basically, is there a reasonable alternative to _mm_prefetch with #define CACHE_LINE_LEN 64?

There's a question asking just about the same thing here. You can read it from the CPUID if you feel like delving into some assembly. You'll have to write platform specific code for this of course.
You're probably already familiar with Agner Fog's manuals for optimization which gives the cache information for many popular processors. If you are able to determine the expected CPU's you'll encounter you can just hard-code the cache line sizes and look up the CPU vendor information to set the line size.

C++ cache aware programming

is there a way in C++ to determine the CPU's cache size? i have an algorithm that processes a lot of data and i'd like to break this data down into chunks such that they fit into the cache. Is this possible?
Can you give me any other hints on programming with cache-size in mind (especially in regard to multithreaded/multicore data processing)?
Thanks!

According to "What every programmer should know about memory", by Ulrich Drepper you can do the following on Linux:
Once we have a formula for the memory
requirement we can compare it with the
cache size. As mentioned before, the
cache might be shared with multiple
other cores. Currently {There
definitely will sometime soon be a
better way!} the only way to get
correct information without hardcoding
knowledge is through the /sys
filesystem. In Table 5.2 we have seen
the what the kernel publishes about
the hardware. A program has to find
the directory:
/sys/devices/system/cpu/cpu*/cache
This is listed in Section 6: What Programmers Can Do.
He also describes a short test right under Figure 6.5 which can be used to determine L1D cache size if you can't get it from the OS.
There is one more thing I ran across in his paper: sysconf(_SC_LEVEL2_CACHE_SIZE) is a system call on Linux which is supposed to return the L2 cache size although it doesn't seem to be well documented.

C++ itself doesn't "care" about CPU caches, so there's no support for querying cache-sizes built into the language. If you are developing for Windows, then there's the GetLogicalProcessorInformation()-function, which can be used to query information about the CPU caches.

Preallocate a large array. Then access each element sequentially and record the time for each access. Ideally there will be a jump in access time when cache miss occurs. Then you can calculate your L1 Cache. It might not work but worth trying.

read the cpuid of the cpu (x86) and then determine the cache-size by a look-up-table. The table has to be filled with the cache sizes the manufacturer of the cpu publishes in its programming manuals.

Depending on what you're trying to do, you might also leave it to some library. Since you mention multicore processing, you might want to have a look at Intel Threading Building Blocks.
TBB includes cache aware memory allocators. More specifically, check cache_aligned_allocator (in the reference documentation, I couldn't find any direct link).

Interestingly enough, I wrote a program to do this awhile ago (in C though, but I'm sure it will be easy to incorporate in C++ code).
http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c
The get_cache_line function is the interesting one, which returns the location of right before the biggest spike in timing data of array accesses. It correctly guessed on my machine! If anything else, it can help you make your own.
It's based off of this article, which originally piqued my interest: http://igoro.com/archive/gallery-of-processor-cache-effects/

You can see this thread: http://software.intel.com/en-us/forums/topic/296674
The short answer is in this other thread:
On modern IA-32 hardware, the cache line size is 64. The value 128 is
a legacy of the Intel Netburst Microarchitecture (e.g. Intel Pentium
D) where 64-byte lines are paired into 128-byte sectors. When a line
in a sector is fetched, the hardware automatically fetches the other
line in the sector too. So from a false sharing perspective, the
effective line size is 128 bytes on the Netburst processors. (http://software.intel.com/en-us/forums/topic/292721)

IIRC, GCC has a __builtin_prefetch hint.
http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html
has an excellent section on this. Basically, it suggests:
__builtin_prefetch (&array[i + LookAhead], rw, locality);
where rw is a 0 (prepare for read) or 1 (prepare for a write) value, and locality uses the number 0-3, where zero is no locality, and 3 is very strong locality.
Both are optional. LookAhead would be the number of elements to look ahead to. If memory access were 100 cycles, and the unrolled loops are two cycles apart, LookAhead could be set to 50 or 51.

There are two cases that need to be distinguished. Do you need to know the cache sizes at compile time or at runtime?
Determining the cache-size at compile-time
For some applications, you know the exact architecture that your code will run on, for example, if you can compile the code directly on the host machine. In that case, simplify looking up the size and hard-coding it is an option (could be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.
If you want to avoid that complexity or if you need to support compilation on unknown architectures, you can use the C++17 feature std::hardware_constructive_interference_size as a good fallback. It will provide a compile-time estimation for the cache line, but be aware of its limitations. Note that the compiler cannot guess perfectly when it creates the binary, as the size of the cache-line is, in general, architecture dependent.
Determining the cache-size at runtime
At runtime, you have the advantage that you know the exact machine, but you will need platform specific code to read the information from the OS. A good starting point is the code snippet from this answer, which supports the major platforms (Windows, Linux, MacOS). In a similar fashion, you can also read the L2 cache size at runtime.
I would advise against trying to guess the cache line by running benchmarks at startup and measuring which one performed best. It might well work, but it is also error-prone if the CPU is used by other processes.
Combining both approaches
If you have to ship one binary and the machines that it will later run on features a range of different architectures with varying cache sizes, you could create specialized code parts for each cache size, and then dynamically (at application startup) choose the best fitting one.

The cache will usually do the right thing. The only real worry for normal programmer is false sharing, and you can't take care of that at runtime because it requires compiler directives.

Is it possible to lock some data in CPU cache?

I have a problem....
I'm writing a data into array in the while-loop. And the point is that I'm doing it really frequently. It seems to be that this writing is now a bottle-neck in the code. So as i presume it's caused by the writing to memory. This array is not really large (smth like 300 elements). The question is it possible to do it in that way: to store it in the cache and update in the memory only after the while-loop is finished?
[edit - copied from an answer added by Alex]
double* array1 = new double[1000000]; // this array has elements
unsigned long* array2 = unsigned long[300];
double varX,t,sum=0;
int iter=0,i=0;
while(i<=max_steps)
{
varX+=difX;
nm0 = int(varX);
if(nm1!=nm0)
{
array2[iter] = nm0; // if you comment this string application works more then 2 times faster :)
nm1=nm0;
t = array1[nm0]; // if you comment this string , there is almost no change in time
++iter;
}
sum+=t;
++i;
}
For the first I'd like to thank all of you for answers. Indeed, it was a little dumb not to place a code. So i decided to do it now.
double* array1 = new double[1000000]; // this array has elements
unsigned long* array2 = unsigned long[300];
double varX,t,sum=0;
int iter=0,i=0;
while(i<=max_steps)
{
varX+=difX;
nm0 = int(varX);
if(nm1!=nm0)
{
array2[iter] = nm0; // if you comment this string application works more then 2 times faster :)
nm1=nm0;
t = array1[nm0]; // if you comment this string , there is almost no change in time
++iter;
}
sum+=t;
++i;
}
So that was it. It would be nice if someone will have any ideas. Thank you very much again.
Sincerely
Alex

Not intentionally, no. Among other things, you have no idea how big the cache is, so you have no idea of what's going to fit. Furthermore, if the app were allowed to lock off part of the cache, the effects on the OS might be devastating to overall system performance. This falls squarely onto my list of "you can't do it because you shouldn't do it. Ever."
What you can do is to improve your locality of reference - try to arrange the loop such that you don't access the elements more than once, and try to access them in order in memory.
Without more clues about your application, I don't think more specific advice can be given.

The CPU does not usually offer fine-grained cache control, you're not allowed to choose what is evicted or to pin things in cache. You do have a few cache operations on some CPUs. Just as a bit of info on what you can do: Here's some interesting cache related instructions on newer x86{-64} CPUs (Doing things like this makes portability hell, but I figured you may be curious)
Software Data Prefecth
The non-temporal instruction is
prefetchnta, which fetches the data
into the second-level cache,
minimizing cache pollution.
The temporal instructions are as
follows:
* prefetcht0 – fetches the data into all cache levels, that is, to the
second-level cache for the Pentium® 4 processor.
* prefetcht1 – Identical to prefetcht0
* prefetcht2 – Identical to prefetcht0
Additionally there are a set of instructions for accessing data in memory but explicitly tell the processor to not insert the data into the cache. These are called non-temporal instructions. An example of one is here: MOVNTI.
You could use the non temporal instructions for every piece of data you DON'T want in cache, in the hopes that the rest will always stay in cache. I don't know if this would actually improve performance as there are subtle behaviors to be aware of when it comes to the cache. Also it sounds like it'd be relatively painful to do.

I have a problem.... I'm writing a data into array in the while-loop. And the point is that I'm doing it really frequently. It seems to be that this writing is now a bottle-neck in the code. So as i presume it's caused by the writing to memory. This array is not really large (smth like 300 elements). The question is it possible to do it in that way: to store it in the cache and update in the memory only after the while-loop is finished?
You don't need to. The only reason why it might get pushed out of the cache is if some other data is deemed more urgent to put in the cache.
Apart from this, an array of 300 elements should fit in the cache with no trouble (assuming the element size isn't too crazy), so most likely, your data is already in cach.
In any case, the most effective solution is probably to tweak your code. Use lots of temporaries (to indicate to the compiler that the memory address isn't important), rather than writing/reading into the array constantly. Reorder your code so loads are performed once, at the start of the loop, and break up dependency chains as much as possible.
Manually unrolling the loop gives you more flexibility to achieve these things.
And finally, two obvious tools you should use rather than guessing about the cache behavior:
A profiler, and cachegrind if available. A good profiler can tell you a lot of statistics on cache misses, and cachegrind give you a lot of information too.
Us here at StackOverflow. If you post your loop code and ask how its performance can be improved, I'm sure a lot of us will find it a fun challenge.
But as others have mentioned, when working with performance, don't guess. You need hard data and measurements, not hunches and gut feelings.

Unless your code does something completely different in between writing to the array, then most of the array will probably be kept in the cache.
Unfortunately there isn't anything you can do to affect what is in the cache, apart from rewriting your algorithm with the cache in mind. Try to use as little memory as possible in between writing to the memory: don't use lot's of variables, don't call many other functions, and try to write to the same region of the array consecutively.

I doubt that this is possible, at least on a high-level multitasking operating system. You can't guarantee that your process won't be pre-empted, and lose the CPU. If your process then owns the cache, other processes can't use it, which would make their exeucution very slow, and complicate things a great deal. You really don't want to run a modern several-GHz processor without cache, just because one application has locked all others out of it.

In this case, array2 will be quite "hot" and stay in cache for that reason alone. The trick is keeping array1 out of cache (!). You're reading it only once, so there is no point in caching it. The SSE instruction for that is MOVNTPD, intrinsic void_mm_stream_pd(double *destination, __m128i source)

Even if you could, it's a bad idea.
Modern desktop computers use multiple-core CPUs. Intel's chips are the most common chips in Desktop machines... but the Core and Core 2 processors don't share an on-die cache.
That is, didn't share a cache until the Core 2 i7 chips were released, which share an on-die 8MB L3 cache.
So, if you were able to lock data in the cache on the computer I'm typing this from, you can't even guarantee this process will be scheduled on the same core, so that cache lock may be totally useless.

If your writes are slow, make sure that no other CPU core is writing in the same memory area at the same time.

When you have a performance problem, don't assume anything, measure first. For example, comment out the writes, and see if the performance is any different.
If you are writing to an array of structures, use a structure pointer to cache the address of the structure so you are not doing the array multiply each time you do an access. Make sure you are using the native word length for the array indexer variable for maximum optimisation.

As other people have said, you can't control this directly, but changing your code may indirectly enable better caching. If you're running on linux and want to get more insight into what's happening with the CPU cache when your program runs, you can use the Cachegrind tool, part of the Valgrind suite. It's a simulation of a processor, so it's not totally realistic, but it gives you information that is hard to get any other way.

It might be possible to use some assembly code, or as onebyone pointed out, assembly intrinsics, to prefetch lines of memory into the cache, but that would cost a lot of time to tinker with it.
Just for trial, try to read in all the data (in a manner that the compiler won't optimize away), and then do the write. See if that helps.

In the early boot phases of CoreBoot (formerly LinuxBIOS) since they have no access to RAM yet (we are talking about BIOS code, and therefore the RAM hasn't been initialized yet) they set up something they call Cache-as-RAM (CAR), i.e. they use the processor cache as RAM even if not backed by actual RAM.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js