More TLB misses when process memory size larger? - c++

I have my program which I have written in C++. On linux the process is allocated a certain amount of memory. Part is the Stack, part the Heap, part Text and part BSS.
Is the following true:
The larger the amount of memory allocated to the Heap component of my process- the chance of Translation Lookaside Buffer misses increases?
And generally speaking- the more memory my application process consumes, the greater the chance of TLB misses?

I think there is no direct relationship between the amount of memory allocated and the miss rate of TLB. As far as I know, as long as your program has good locality, the TLB misses will remain low.
There is several reasons that would lead to high TLB miss:
1.Not enough memory and to many running process;
2.Low locality of your program.
3.the inefficient way you visit array elements in cycles in your codes.

Programs are usually divided into phases that exhibit completely different memory and execution characteristics - your code may allocate a huge chunk of memory at some point, then be off doing some other unrelated computations. In that case, your TLBs (that are basically just caches for address translation) would age away the unused pages and eventually drop them. While you're not using these pages, you shouldn't care about that.
The real question is - when you get to some performance critical phase, are you going to work with more pages than your TLBs can sustain simultaneously? On one hand modern CPUs have large TLB, often with 2 levels of caching - the L2 TLB of a modern intel CPU should have (IIRC) 512 entries - that's 2M worth of data if you're using 4k pages (with large pages that would have been more, but TLBs usually don't like to work with them due to potential conflicts with smaller pages..).
It's quite possible for an application to work with more than 2M of data, but you should avoid doing this at the same time if possible - either by doing cache tiling or changing the algorithms. That's not always possible (for e.g. when streaming from memory or from IO), but then the TLB misses are probably not your main bottlenecks. When working with the same set of data and accessing the same elements multiple times - you should always attempt to keep them cached as close as possible.
It's also possible to use software prefetches to make the CPU perform the TLB misses (and following page walks) earlier in time, preventing them from blocking your progress. On some CPUs hardware prefetches are already doing this for you.

Related

Data Orientated Design; how do I optimize a data structure in c++ for performance?

I would like to have a class of varying number n of objects which are easily iterated over as a group, with each object member having a large list (20+) of individually modified variables influencing class methods. Before I started learning OOP, I would just make a 2D array and load the variable values into each row, corresponding to each object, and then append/delete rows as needed. Is this still a good solution? Is there a better solution?
Again, in this case I am more interested in pushing processor performance rather than preserving abstraction and modularity, etc. In this respect, I am very confused about the way the data container ultimately is read into the L1 cache, and how to ensure that I do not induce page inefficiency or cache-misses. If for example, I have a 128 kb cache, I assume the entire container should fit into this cache to be efficient, correct?
According to Agner Fog's optimization manual, the C++ Standard Template Library is rather inefficient, because it makes extensive use of dynamic memory allocation. However, a fixed size array that is made larger than necessary (e.g. because the needed size is not known at compile time) can also be bad for performance, because a larger size means that it won't fit into the cache as easily. In such situations, the STL's dynamic memory allocation could perform better.
Generally, it is best to store your data in contiguous memory. You can use a fixed size array or an std::vector for this. However, before using std::vector, you should call std::vector::reserve() for performance reasons, so that the memory does not have to be reallocated too often. If you reallocate too often, the heap could become fragmented, which is also bad for cache performance.
Ideally, the data that you are working on will fit entirely into the Level 1 data cache (which is about 32 KB on modern desktop processors). However, even if it doesn't fit, the Level 2 cache is much larger (about 512 KB) and the Level 3 Cache is several Megabytes. The higher-level caches are still significantly faster than reading from main memory.
It is best if your memory access patterns are predictable, so that the hardware prefetcher can do its work best. Sequential memory accesses are easiest for the hardware prefetcher to predict.
The CPU cache works best if you access the same data several times and if the data is small enough to be kept in the cache. However, even if the data is used only once, the CPU cache can still make the memory access faster, by making use of prefetching.
A cache miss will occur if
the data is being accessed for the first time and the hardware prefetcher was not able to predict and prefetch the needed memory address in time, or
the data is no longer cached, because the cache had to make room for other data, due to the data being too large to fit in the cache.
In addition to the hardware prefetcher attempting to predict needed memory addresses in advance (which is automatic), it is also possible for the programmer to explicity issue a software prefetch. However, from what I have read, it is hard to get significant performance gains from doing this, except under very special circumstances.

Shorter loop, same coverage, why do I get more Last Level Cache Misses in c++ with Visual Studio 2013?

I'm trying to understand what creates cache misses and eventually how much do they cost in terms of performance in our application. But with the tests I'm doing now, I'm quite confused.
Assuming that my L3 cache is 4MB, and my LineSize is 64 bytes, I would expect that this loop (loop 1):
int8_t aArr[SIZE_L3];
int i;
for ( i = 0; i < (SIZE_L3); ++i )
{
++aArr[i];
}
...and this loop (loop 2):
int8_t aArr[SIZE_L3];
int i;
for ( i = 0; i < (SIZE_L3 / 64u); ++i )
{
++aArr[i * 64];
}
give roughly the same amount of Last Level Cache Misses, but different amount of Inclusive Last Level Cache References.
However the numbers that the profiler of Visual Studio 2013 gives me are unsettling.
With loop 1:
Inclusive Last Level Cache References: 53,000
Last Level Cache Misses: 17,000
With loop 2:
Inclusive Last Level Cache References: 69,000
Last Level Cache Misses: 35,000
I have tested this with a dynamically allocated array, and on a CPU that has a larger L3 cache (8MB) and I get a similar pattern in the results.
Why don't I get the same amount of cache misses, and why do I have more references in a shorter loop?
Incrementing every byte of int8_t aArr[SIZE_L3]; separately is slow enough that hardware prefetchers are probably able to keep up pretty well a lot of the time. Out-of-order execution can keep a lot of read-modify-writes in flight at once to different addresses, but the best-case is still one byte per clock of stores. (Bottleneck on store-port uops, assuming this was a single-threaded test on a system without a lot of other demands for memory bandwidth).
Intel CPUs have their main prefetch logic in L2 cache (as described in Intel's optimization guide; see the x86 tag wiki). So successful hardware prefetch into L2 cache before the core issues a load means the that L3 cache never sees a miss.
John McCalpin's answer on this Intel forum thread confirms that L2 hardware prefetches are NOT counted as LLC references or misses by the normal perf events like MEM_LOAD_UOPS_RETIRED.LLC_MISS. Apparently there are OFFCORE_RESPONSE events you can look at.
IvyBridge introduced next-page HW prefetch. Intel Microarchitectures before that don't cross page boundaries when prefetching, so you still get misses every 4k. And maybe TLB misses if the OS didn't opportunistically put your memory in a 2MiB hugepage. (But speculative page-walks as you approach a page boundary can probably avoid much delay there, and hardware definitely does do speculative page walks).
With a stride of 64 bytes, execution can touch memory much faster than the cache / memory hierarchy can keep up. You bottleneck on L3 / main memory. Out-of-order execution can keep about the same number of read/modify/write ops in flight at once, but the same out-of-order window covers 64x more memory.
Explaining the exact numbers in more details
For array sizes right around L3, IvyBridge's adaptive replacement policy probably makes a significant difference.
Until we know the exact uarch, and more details of the test, I can't say. It's not clear if you only ran that loop once, or if you had an outer repeat loop and those miss / reference numbers are an average per iteration.
If it's only from a single run, that's a tiny noisy sample. I assume it was somewhat repeatable, but I'm surprised the L3 references count was so high for the every-byte version. 4 * 1024^2 / 64 = 65536, so there was still an L3 reference for most of the cache lines you touched.
Of course, if you didn't have a repeat loop, and those counts include everything the code did besides the loop, maybe most of those counts came from startup / cleanup overhead in your program. (i.e. your program with the loop commented out might have 48k L3 references, IDK.)
I have tested this with a dynamically allocated array
Totally unsurprising, since it's still contiguous.
and on a CPU that has a larger L3 cache (8MB) and I get a similar pattern in the results.
Did this test use a larger array? Or did you use a 4MiB array on a CPU with an 8MiB L3?
Your assumption that "If I skip over more elements in the array, making for fewer iterations of the loop and fewer array accesses, that I should have fewer cache misses" seems to be ignoring the way that data gets fetched into the cache.
When you access memory, more data is kept in the cache than just the specific data you accessed. If I access intArray[0], then intArray[1] and intArray[2] are likely going to be fetched as well at the same time. This is one of the optimizations that allows the cache to help us work faster. So if I access those three memory locations in a row, it's sort of like having only 1 memory read that you need to wait for.
If you increase the stride, instead accessing intArray[0], then intArray[100] and intArray[200], the data may require 3 separate reads because the second and third memory accesses might not be in cache, resulting in a cache miss.
All of the exact details of your specific problem depend on your computer architecture. I would assume you are running an intel x86-based architecture, but when we are talking about hardware at this low of a level I should not assume (I think you can get Visual Studio to run on other architectures, can't you?); and I don't remember all of the specifics for that architecture anyway.
Because you generally don't know what exactly the caching system will be like on the hardware your software is run on, and it can change over time, it is usually better to just read up on caching principles in general and try to write in general code that is likely to produce fewer misses. Trying to make the code perfect on the specific machine you're developing on is usually a waste of time. The exceptions to this are for certain embedded control systems and other types of low-level systems which are not likely to change on you; unless this describes your work I suggest you just read some good articles or books about computer caches.

How does hardware affect the performance of malloc\new on Windows

I am working on the performance of a c++ application on Windows 7, which is doing a lot of computation and a lot of of small allocations. Basically I observed a bottleneck using visual studio sampling profiler and it come down to the parsing of a file and creation of a huge tree structure of the type
class TreeStruct : std::map<key, TreeStructPtr>
{
SomeMetadata p;
int* buff;
int buffsize;
}
There are ten of thousand of these structure created during the parsing
The buffer is not that big, 1 byte to few hundred bytes
The profiler report that the most costly functions is
free (13 000 exclusive samples, 38% Exclusive Samples)
operator new (13 000 exclusive samples, 38% Exclusive Samples)
realloc (4000 exclusive samples, 13% Exclusive Samples)
I managed to optimize and to reduce allocations to
operator new (2200 exclusive samples, 48% Exclusive Samples)
free (1770 exclusive samples, 38% Exclusive Samples)
some function (73 exclusive samples, 1.5% Exclusive Samples)
When I measure the client waiting time (ie a client wait for the action to process with a stopwatch) The installed version on my machine went from 85s of processing time to 16s of processing time, which is great. I proceed to test on the most powerful machine we have and was stunned that the non optimized version took only 3.5s while to optimized around 2s. Same executable, same operating system...
Question: How is such a disparity possible on two modern machines?
Here are the specs :
85s to 16s machine
3.5s to 2s machine
The processing is mono-threaded.
As others have commented, frequent small allocations are a waste of time and memory.
For every allocation, there is overhead:
Function call preparation
Function call (break in execution path; possible reload of execution
pipeline).
Algorithm to find a memory block (searching perhaps).
Allocating the memory (marking the block as unavailable).
Placing the address into a register
Returning from the function (another break in sequential execution).
Regardless of your machine's speed, the above process is a lot of execution to allocate a small block of memory.
Modern processors love to keep their data close (as in a data cache). Their performance increases when they can fetch data from the cache and not fetch outside the processor (access times slow down the further away the values are, such as memory on chip, outside core; memory off chip on the same board; memory on other boards; memory on devices (such as Flash and hard drive). Reallocating memory defeats the effectiveness of the data cache.
The Operating System may get involved and slow down your program. In the allocation or delete functions, the O.S. may check for paging. Paging, in a simple form, is the swapping of memory areas with areas on the hard drive. This may occur when other higher priority tasks are running and demand more memory.
An algorithm for speeding up data access:
Load data from memory into local variables (registers if possible).
Process the data in the local variables (registers).
Store the finished data.
If you can, place data into structures. Load all the structure members at once. Structures allow for data to be placed into contiguous memory (which reduces the need to reload the cache).
Lastly, reduce branching or changes in execution. Research "loop unrolling". Your compiler may perform this optimization at higher optimization settings.

Flushing the cache to prevent benchmarking fluctiations

I am running the c++ code of someone to do the benchmarking on a dataset. The issue I have is that often I get a timing for the first run, and these numbers massively change (i.e. 28 seconds to 10 seconds) if I run the same code again. I assume this happens due to CPU's automatic caching. Is there a way to flush the cache, or prevent these fluctuations somehow?
Not one that works "for everything, everywhere". Most processors have special instructions to flush the cache, but they are often privileged instructions, so it has to be done from inside the OS kernel, not your user-mode code. And of course, it's completely different instructions for each processor architecture.
All current x86 processors does have a clflush instruction, that flushes one cache-line, but to do that, you have to have the address of the data (or code) you want to flush. Which is fine for small and simple data structures, not so good if you have a binary tree that is all over the place. And of course, not at all portable.
In most environments, reading and writing a large block of alternative data, e.g. something like:
// Global variables.
const size_t bigger_than_cachesize = 10 * 1024 * 1024;
long *p = new long[bigger_than_cachesize];
...
// When you want to "flush" cache.
for(int i = 0; i < bigger_than_cachesize; i++)
{
p[i] = rand();
}
Using rand will be much slower than filling with something constant/known. But the compiler can't optimise the call away, which means it's (almost) guaranteed that the code will stay.
The above won't flush instruction caches - that is a lot more difficult to do, basically, you have to run some (large enough) other piece of code to do that reliably. However, instruction caches tend to have less effect on overall benchmark performance (instruction cache is EXTREMELY important for modern processor's perforamnce, that's not what I'm saying, but in the sense that the code for a benchmark is typically small enough that it all fits in cache, and the benchmark runs many times over the same code, so it's only slower the first iteration)
Other ideas
Another way to simulate "non-cache" behaviour is allocate a new area for each benchmark pass - in other words, not freeing the memory until the end of the benchmark or using an array containing the data, and output results, such that each run has it's own set of data to work on.
Further, it's common to actually measure the performance of the "hot runs" of a benchmark, not the first "cold run" where the caches are empty. This does of course depend on what you are actually trying to achieve...
Here's my basic approach:
Allocate a memory region 2x the size of the LLC, if you can determine the LLC size dynamically (or you know it statically), or if you don't, some reasonable multiple of the largest LLC size on the platform of interest1.
memset the memory region to some non-zero value: 1 will do just fine.
"Sink" the pointer somewhere so that the compiler can't optimize out the stuff above or below (writing to a volatile global works pretty much 100% of the time).
Read from random indexes in the region until you've touched each cache line an average of 10 times or so (accumulate the read values into a sum that you sink in a similar way to (3)).
Here are some notes on why this is generally works and why doing less may not work - the details are x86-centric, but similar concerns will apply on many other architectures.
You absolutely want to write to the allocated memory (step 2) before you begin your main read-only flushing loop, since otherwise you might just be repeatedly reading from the same small zero-mapped page returned by the OS to satisfy your memory allocation.
You want to use a region considerably larger than the LLC size, since the outer cache levels are typically physically addressed, but you can only allocate and access virtual addresses. If you just allocate an LLC-sized region, you generally won't get full coverage of all the ways of every cache set: some sets will be over-represented (and so will be fully flushed), while other sets be under-represented and so not all existing values can even be flushed by accessing this region of memory. A 2x over-allocation makes it highly likely that almost all sets have enough representation.
You want to avoid the optimizer doing clever things, such as noting the memory never escapes the function and eliminating all your reads and writes.
You want to iterate randomly around the memory region, rather than just striding through it linearly: some designs like the LLC on recent Intel detect when a "streaming" pattern is present, and switch from LRU to MRU since LRU is about the worst-possible replacement policy for such a load. The effect is that no matter how many times you stream though memory, some "old" lines from before your efforts can remain in the cache. Randomly accessing memory defeats this behavior.
You want to access more than just LLC amount of memory for (a) the same reason you allocate more than the LLC size (virtual access vs physical caching) and (b) because random access needs more accesses before you have a high likelihood of hitting every set enough times (c) caches are usually only pseudo-LRU, so you need more than the number of accesses you'd expect under exact-LRU to flush out every line.
Even this is not foolproof. Other hardware optimizations or caching behaviors not considered above could cause this approach to fail. You might get very unlucky with the page allocation provided by the OS and not be able to reach all the pages (you can largely mitigate this by using 2MB pages). I highly recommend testing whether your flush technique is adequate: one approach is to measure the number of cache misses using CPU performance counters while running your benchmark and see if the number makes sense based on the known working-set size2.
Note that this leaves all levels of the cache with lines in E (exclusive) or perhaps S (shared) state, and not the M (modified) state. This means that these lines don't need to be evicted to other cache levels when they are replaced by accesses in your benchmark: they can simply be dropped. The approach described in the other answer will leave most/all lines in the M state, so you'll initially have 1 line of eviction traffic for every line you access in your benchmark. You can achieve the same behavior with my recipe above by changing step 4 to write rather than read.
In that regard, neither approach here is inherently "better" than the other: in the real world the cache levels will have a mix of modified and not-modified lines, while these approaches leave the cache at the two extremes of the continuum. In principle you could benchmark with both the all-M and no-M states, and see if it matters much: if it does, you can try to evaluate what the real-world state of the cache will usually be an replicate that.
1Remember that LLC sizes are growing almost every CPU generation (mostly because core counts are increasing), so you want to leave some room for growth if this needs to be future-proof.
2 I just throw that out there as if it was "easy", but in reality may be very difficult depending on your exact problem.

How to test main memory access time?

Looking for a C/C++ program to test how long it takes to access a fixed piece of memory, specifically in RAM.
How do I ensure testing access time is not of cache or TLB data?
For example, can I "disable" all cache/TLB?
Or can I specify a specific address in RAM to write/read only?
On the other hand, how would I ensure I am only testing cache?
Are there ways to tell the compiler where to save and read from, cache/ram?
For example, is there a well know standard program (in one of these books?) that is know for this test?
I did see this but I do not understand how adjusting the size of the list, you can control whether the memory accesses hit L1 cache, L2 cache, or main memory: measuring latencies of memory
How can one correctly program this test?
Basically, as the list grows you'll see the performance worsen in steps as another layer of caching is overwhelmed. The idea is simple... if the cache holds the last N units of memory you've accessed, then looping around a buffer of even N+1 units should ensure constant cache misses. (There're more details/caveats in the "measuring latencies of memory" answer you link to in your question).
You should be able to get some idea of the potential size of the the largest cache that might front your RAM from hardware documentation - as long as you operate on more memory than that you should be measuring physical RAM times.