How to test main memory access time? - c++

Looking for a C/C++ program to test how long it takes to access a fixed piece of memory, specifically in RAM.
How do I ensure testing access time is not of cache or TLB data?
For example, can I "disable" all cache/TLB?
Or can I specify a specific address in RAM to write/read only?
On the other hand, how would I ensure I am only testing cache?
Are there ways to tell the compiler where to save and read from, cache/ram?
For example, is there a well know standard program (in one of these books?) that is know for this test?
I did see this but I do not understand how adjusting the size of the list, you can control whether the memory accesses hit L1 cache, L2 cache, or main memory: measuring latencies of memory
How can one correctly program this test?

Basically, as the list grows you'll see the performance worsen in steps as another layer of caching is overwhelmed. The idea is simple... if the cache holds the last N units of memory you've accessed, then looping around a buffer of even N+1 units should ensure constant cache misses. (There're more details/caveats in the "measuring latencies of memory" answer you link to in your question).
You should be able to get some idea of the potential size of the the largest cache that might front your RAM from hardware documentation - as long as you operate on more memory than that you should be measuring physical RAM times.

Related

What data will be cached? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
The description of cache in the book is always very general. I am a student in the field of architecture. I want to understand the behavior of the cache in more detail.
In the c/c++ language code, what data will be loaded from the memory to the cache? Will it be loaded into the cache when it is frequently used? For example, when I write a for loop in C language, I often use variables i, j, and k. Will these also be loaded into the cache? C language local variables are generally placed in the stack area, global variables will be placed in the data area? Will these be loaded into the cache first when they are used? Does the data have to go through the cache to reach the register and then to the CPU?
The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?
Normally all the memory your C++ program uses (code and data) is in cacheable memory.
Any access (read or write) to any C++ object1 will result in the cache line containing it being hot in case, assuming a normal CPU cache: set-associative, write-back / write-allocate1, even if it was previously not hot.
The simplest design is that each level of cache fetches data through the next outer layer, so after a load miss, data is hot in all levels of cache. But you can have outer caches that don't read-allocate, and act as victim caches. Or outer levels that are Exclusive of inner caches, to avoid wasting space caching the same data twice (https://en.wikipedia.org/wiki/Cache_inclusion_policy). But whatever happens, right after a read or write, at least the inner-most (closest to that CPU core) level of cache will have the data hot, so accessing it again right away (or an adjacent item in the same cache line) will be fast. Different design choices affect the chances of a line still being hot if the next access is after a bunch of other accesses. And if hot, which level of cache you may find it in. But the super basics are that any memory compiler-generated code touches ends up in cache. CPU cache transparently caches physical memory.
Many cache lines can be hot at the same time, not aliasing each other. i.e. caches have many sets. Some access patterns are pessimal, like multiple pointers all offset from each other by 4k which will make all accesses alias the same set in L1d cache, as well as sometimes having extra 4k-aliasing penalties in the CPU's memory disambiguation logic. (Assuming a 4k page size like on x86). e.g. L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes - memory performance effects can get very complicated. Knowing some theory is enough to understand what's generally good, but the exact details can be very complex. (Sometimes even for true experts like Dr. Bandwidth, e.g. this case).
Footnote 1: Loosely, An object is a named variable or dynamically allocated memory pointed to by a pointer.
Footnote 2: Write-back cache with a write-allocate policy is near universal for modern CPUs, also a pseudo-LRU replacement policy; see wikipedia. A few devices have access patterns that benefit from caches that only allocate on read but not write, but CPUs benefit from write-allocate. A modern CPU will almost always have a multi-level cache hierarchy, with each level being set-associative with some level of associativity. Some embedded CPUs may only have 1 level, or even no cache, but you'd know if you were writing code specifically for a system like that.
Modern large L3 caches sometimes use a replacement policy.
Of course, optimization can mean that some local variables (especially loop counters or array pointers) can get optimized into a register and not exist in memory at all. Registers are not part of the CPU cache or memory at all, they're a separate storage space. People often describe things as "compiler caches the value in a register", but do not confuse that with CPU cache. (related: https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ and When to use volatile with multi threading?)
If you want to see what the compiler is making the CPU do, look at the compiler's asm output. How to remove "noise" from GCC/clang assembly output?. Every memory access in the asm source is an access in computer-architecture terms, so you can apply what you know about cache state given an access pattern to figure out what will happen with a set-associative cache.
Also related:
Which cache mapping technique is used in intel core i7 processor?
Modern Microprocessors: A 90-Minute Guide!
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? - why we have multi-level caches, and some real numbers for the cache heirarchies of Haswell (like Skylake) and Piledriver (fortunately obsolete, but an interesting example of a possible design).
Generally, the most recently used cache lines will be stored in the cache. For short loops, loop counter variables are normally stored in a CPU register. For longer loops, loop counter variables will probably be stored in the cache, unless one loop iteration runs for such a long time that the loop counter gets evicted from the cache due to the CPU doing other work.
Most variables will generally be cached after the first access (or beforehand if the cache prefetcher does a good job), irrespective of how often they are used. A higher frequency of usage will only prevent the memory from being evicted from the cache, but won't influence it being cached in the first place. However, some CPU architectures offer so-called non-temporal read and write instructions, which bypass the cache. These instructions are useful if the programmer knows in advance that a memory location will only be accessed once, and therefore should not be cached. But generally, these instructions should not be used, unless you know exactly what you are doing.
The CPU cache does not care whether variables are stored on the heap or stack. Memory is simply cached according to a "most recently used" algorithm, or, to be more accurate, the cache is evicted based on a "least recently used" algorithm, whenever new room in the cache is required for a new memory access.
In the case of local variables stored on the stack, there is a high chance that the cache line of that variable is already cached due to the program using that stack cache line recently for something else. Therefore, local variables generally have good cache performance. Also, the cache prefetcher works very well with the stack, because the stack grows in a linear fashion.
The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?
Yes, first, the cache line containing p will be loaded into the cache (if it is not already cached or stored in a CPU register). Then, the cache line containing *p will be loaded into the cache.

Flushing the cache to prevent benchmarking fluctiations

I am running the c++ code of someone to do the benchmarking on a dataset. The issue I have is that often I get a timing for the first run, and these numbers massively change (i.e. 28 seconds to 10 seconds) if I run the same code again. I assume this happens due to CPU's automatic caching. Is there a way to flush the cache, or prevent these fluctuations somehow?
Not one that works "for everything, everywhere". Most processors have special instructions to flush the cache, but they are often privileged instructions, so it has to be done from inside the OS kernel, not your user-mode code. And of course, it's completely different instructions for each processor architecture.
All current x86 processors does have a clflush instruction, that flushes one cache-line, but to do that, you have to have the address of the data (or code) you want to flush. Which is fine for small and simple data structures, not so good if you have a binary tree that is all over the place. And of course, not at all portable.
In most environments, reading and writing a large block of alternative data, e.g. something like:
// Global variables.
const size_t bigger_than_cachesize = 10 * 1024 * 1024;
long *p = new long[bigger_than_cachesize];
...
// When you want to "flush" cache.
for(int i = 0; i < bigger_than_cachesize; i++)
{
p[i] = rand();
}
Using rand will be much slower than filling with something constant/known. But the compiler can't optimise the call away, which means it's (almost) guaranteed that the code will stay.
The above won't flush instruction caches - that is a lot more difficult to do, basically, you have to run some (large enough) other piece of code to do that reliably. However, instruction caches tend to have less effect on overall benchmark performance (instruction cache is EXTREMELY important for modern processor's perforamnce, that's not what I'm saying, but in the sense that the code for a benchmark is typically small enough that it all fits in cache, and the benchmark runs many times over the same code, so it's only slower the first iteration)
Other ideas
Another way to simulate "non-cache" behaviour is allocate a new area for each benchmark pass - in other words, not freeing the memory until the end of the benchmark or using an array containing the data, and output results, such that each run has it's own set of data to work on.
Further, it's common to actually measure the performance of the "hot runs" of a benchmark, not the first "cold run" where the caches are empty. This does of course depend on what you are actually trying to achieve...
Here's my basic approach:
Allocate a memory region 2x the size of the LLC, if you can determine the LLC size dynamically (or you know it statically), or if you don't, some reasonable multiple of the largest LLC size on the platform of interest1.
memset the memory region to some non-zero value: 1 will do just fine.
"Sink" the pointer somewhere so that the compiler can't optimize out the stuff above or below (writing to a volatile global works pretty much 100% of the time).
Read from random indexes in the region until you've touched each cache line an average of 10 times or so (accumulate the read values into a sum that you sink in a similar way to (3)).
Here are some notes on why this is generally works and why doing less may not work - the details are x86-centric, but similar concerns will apply on many other architectures.
You absolutely want to write to the allocated memory (step 2) before you begin your main read-only flushing loop, since otherwise you might just be repeatedly reading from the same small zero-mapped page returned by the OS to satisfy your memory allocation.
You want to use a region considerably larger than the LLC size, since the outer cache levels are typically physically addressed, but you can only allocate and access virtual addresses. If you just allocate an LLC-sized region, you generally won't get full coverage of all the ways of every cache set: some sets will be over-represented (and so will be fully flushed), while other sets be under-represented and so not all existing values can even be flushed by accessing this region of memory. A 2x over-allocation makes it highly likely that almost all sets have enough representation.
You want to avoid the optimizer doing clever things, such as noting the memory never escapes the function and eliminating all your reads and writes.
You want to iterate randomly around the memory region, rather than just striding through it linearly: some designs like the LLC on recent Intel detect when a "streaming" pattern is present, and switch from LRU to MRU since LRU is about the worst-possible replacement policy for such a load. The effect is that no matter how many times you stream though memory, some "old" lines from before your efforts can remain in the cache. Randomly accessing memory defeats this behavior.
You want to access more than just LLC amount of memory for (a) the same reason you allocate more than the LLC size (virtual access vs physical caching) and (b) because random access needs more accesses before you have a high likelihood of hitting every set enough times (c) caches are usually only pseudo-LRU, so you need more than the number of accesses you'd expect under exact-LRU to flush out every line.
Even this is not foolproof. Other hardware optimizations or caching behaviors not considered above could cause this approach to fail. You might get very unlucky with the page allocation provided by the OS and not be able to reach all the pages (you can largely mitigate this by using 2MB pages). I highly recommend testing whether your flush technique is adequate: one approach is to measure the number of cache misses using CPU performance counters while running your benchmark and see if the number makes sense based on the known working-set size2.
Note that this leaves all levels of the cache with lines in E (exclusive) or perhaps S (shared) state, and not the M (modified) state. This means that these lines don't need to be evicted to other cache levels when they are replaced by accesses in your benchmark: they can simply be dropped. The approach described in the other answer will leave most/all lines in the M state, so you'll initially have 1 line of eviction traffic for every line you access in your benchmark. You can achieve the same behavior with my recipe above by changing step 4 to write rather than read.
In that regard, neither approach here is inherently "better" than the other: in the real world the cache levels will have a mix of modified and not-modified lines, while these approaches leave the cache at the two extremes of the continuum. In principle you could benchmark with both the all-M and no-M states, and see if it matters much: if it does, you can try to evaluate what the real-world state of the cache will usually be an replicate that.
1Remember that LLC sizes are growing almost every CPU generation (mostly because core counts are increasing), so you want to leave some room for growth if this needs to be future-proof.
2 I just throw that out there as if it was "easy", but in reality may be very difficult depending on your exact problem.

More TLB misses when process memory size larger?

I have my program which I have written in C++. On linux the process is allocated a certain amount of memory. Part is the Stack, part the Heap, part Text and part BSS.
Is the following true:
The larger the amount of memory allocated to the Heap component of my process- the chance of Translation Lookaside Buffer misses increases?
And generally speaking- the more memory my application process consumes, the greater the chance of TLB misses?
I think there is no direct relationship between the amount of memory allocated and the miss rate of TLB. As far as I know, as long as your program has good locality, the TLB misses will remain low.
There is several reasons that would lead to high TLB miss:
1.Not enough memory and to many running process;
2.Low locality of your program.
3.the inefficient way you visit array elements in cycles in your codes.
Programs are usually divided into phases that exhibit completely different memory and execution characteristics - your code may allocate a huge chunk of memory at some point, then be off doing some other unrelated computations. In that case, your TLBs (that are basically just caches for address translation) would age away the unused pages and eventually drop them. While you're not using these pages, you shouldn't care about that.
The real question is - when you get to some performance critical phase, are you going to work with more pages than your TLBs can sustain simultaneously? On one hand modern CPUs have large TLB, often with 2 levels of caching - the L2 TLB of a modern intel CPU should have (IIRC) 512 entries - that's 2M worth of data if you're using 4k pages (with large pages that would have been more, but TLBs usually don't like to work with them due to potential conflicts with smaller pages..).
It's quite possible for an application to work with more than 2M of data, but you should avoid doing this at the same time if possible - either by doing cache tiling or changing the algorithms. That's not always possible (for e.g. when streaming from memory or from IO), but then the TLB misses are probably not your main bottlenecks. When working with the same set of data and accessing the same elements multiple times - you should always attempt to keep them cached as close as possible.
It's also possible to use software prefetches to make the CPU perform the TLB misses (and following page walks) earlier in time, preventing them from blocking your progress. On some CPUs hardware prefetches are already doing this for you.

What is cache in C++ programming? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Firstly I would like to tell that I come from a non-Computer Science background & I have been learning the C++ language.
I am unable to understand what exactly is a cache?
It has different meaning in different contexts.
I would like to know what would be called as a cache in a C++ program?
For example, if I have some int data in a file. If I read it & store in an int array, then would this mean that I have 'cached' the data?
To me this seems like common sense to use the data since reading from a file is always bad than reading from RAM.
But I am a little confused due to this article.
In a CPU there can be several caches, to speed up instructions in
loops or to store often accessed data. These caches are small but very
fast. Reading data from cache memory is much faster than reading it
from RAM.
It says that reading data from cache is much faster than from RAM.
I thought RAM & cache were the same.
Can somebody please clear my confusion?
EDIT: I am updating the question because previously it was too broad.
My confusion started with this answer. He says
RowData and m_data are specific to my implementation, but they are
simply used to cache information about a row in the file
What does cache in this context mean?
Any modern CPU has several layers of cache that are typically named things like L1, L2, L3 or even L4. This is called a multi-level cache. The lower the number, the faster the cache will be.
It's important to remember that the CPU runs at speeds that are significantly faster than the memory subsystem. It takes the CPU a tiny eternity to wait for something to be fetched from system memory, many, many clock-cycles elapse from the time the request is made to when the data is fetched, sent over the system bus, and received by the CPU.
There's no programming construct for dealing with caches, but if your code and data can fit neatly in the L1 cache, then it will be fastest. Next is if it can fit in the L2, and so on. If your code or data cannot fit at all, then you'll be at the mercy of the system memory, which can be orders of magnitude slower.
This is why counter-intuitive things like unrolling loops, which should be faster, might end up being slower because your code becomes too large to fit in cache. It's also why shaving a few bytes off a data structure could pay huge dividends even though the memory footprint barely changes. If it fits neatly in the cache, it will be faster.
The only way to know if you have a performance problem related to caching is to benchmark very carefully. Remember each processor type has varying amounts of cache, so what might work well on your i7 CPU might be relatively terrible on an i5.
It's only in extremely performance sensitive applications that the cache really becomes something you worry about. For example, if you need to maintain a steady 60FPS frame rate in a game, you'll be looking at cache problems constantly. Every millisecond counts here. Likewise, anything that runs the CPU at 100% for extended periods of time, such as rendering video, will want to pay very close attention to how much they could gain from adjusting the code that's emitted.
You do have control over how your code is generated with compiler flags. Some will produce smaller code, some theoretically faster by unrolling loops and other tricks. To find the optimal setting can be a very time-consuming process. Likewise, you'll need to pay very careful attention to your data structures and how they're used.
[Cache] has different meaning in different contexts.
Bingo. Here are some definitions:
Cache
Verb
Definition: To place data in some location from which it can be more efficiently or reliably retrieved than its current location. For instance:
Copying a file to a local hard drive from some remote computer
Copying data into main memory from a file on a local hard drive
Copying a value into a variable when it is stored in some kind of container type in your procedural or object oriented program.
Examples: "I'm going to cache the value in main memory", "You should just cache that, it's expensive to look up"
Noun 1
Definition: A copy of data that is presumably more immediately accessible than the source data.
Examples: "Please keep that in your cache, don't hit our servers so much"
Noun 2
Definition: A fast access memory region that is on the die of a processor, modern CPUs generally have several levels of cache. See cpu cache, note that GPUs and other types of processors will also have their own caches with different implementation details.
Examples: "Consider keeping that data in an array so that accessing it sequentially will be cache coherent"
My definition for Cache would be some thing that is in limited amount but faster to access as there is less area to look for. If you are talking about caching in any programming language then it means you are storing some information in form of a variable(variable is nothing a way to locate your data in memory) in memory. Here memory means both RAM and physical cache (CPU cache).
Physical/CPU cache is nothing but memory that is even more used than RAM, it actually stores copies of some data on RAM which is used by CPU very often. You have another level of categorisation after that as well which is on board cache(faster) and off-board cache. youu can see this link
I am updating the question because previously it was too broad. My
confusion started with this answer. He says
RowData and m_data are specific to my implementation,
but they are simply used to cache information about a row in the file
What does cache in this context mean?
This particular use means that RowData is held as a copy in memory, rather than reading (a little bit of) the row from a file every time we need some data from it. Reading from a file is a lot slower [1] than holding on to a copy of the data in our program's memory.
[1] Although in a modern OS, the actual data from the hard-disk is probably held in memory, in file-system cache, to avoid having to read the disk many times to get the same data over and over. However, this still means that the data needs to be copied from the file-system cache to the application using the data.

Accessing a certain address in memory

Why can we access to a certain place in our memory in O(1)?
Quick answer: You can't!
The system's main memory, the chips on the board, can however be addressed with a direct access. Just give the correct address and the bus will return the memory at that location (likely in a block).
Once you get into the CPU however memory access is very different. There are several caches, several cores with caches, and possibly other CPUs with caches. Though accessing main memory can be done directly, it is slow, that is why we have all these caches. But this now means that inside the CPU the memory isn't directly accessible.
When the CPU needs to access memory it thus goes into a lookup mode. It also has a locking system to share memory between the caches correctly. Different addresses will actually take different periods of time to access, depending on whether you are reading or writing, and where the most recent cache of that memory resides. This is something known as NUMA (non-uniform memory access). While the time complexity here is probably bound by a constant (so possibly/technicall O(1)) it probably isn't what most people are thinking of as constant time.
It gets more complicated than this. The CPU provides page tables for memory so that the OS can provide virtual memory to the applications (that is, it can partition the address spaces) and load memory on demand. These tables are map-like structures. When you access memory the CPU decides if the address you want is loaded, or if the OS has to retrieve it first. These maps are a function of the total memory size, so are not linear time, though very likely amortized constant time. (If you're running a virtual machine you can add another layer of tables on top here -- one reason why VMs run slightly slower).
This is just a brief overview. Hopefully enough to give you the impression that memory access isn't really constant time and depends on many things. Keep in mind however that so much optimization is employed at these levels that a high-level C program will likely appear to have constant time access.
Memory in modern computer systems is random access, so as long as you know the address of memory you need to access, the computer can go directly to that memory location and read/write to that location.
This is opposed to some [older] systems such as tape memory where the tape had to be physically spooled to access certain areas, thus farther locations take longer time to access.
Not sure what you mean by allocate in O(1) as allocating memory is typically not O(1) when dealing with typical heap on every day computers.
That depends on the computing model that you use, in the Turing machine model, neither operation is O(1), in the random access model, access is O(1), which, since that is the case for most modern hardware using RAM makes that model useful. I assume you are using a model that for the sake of simplicity also allows O(1) allocation as a close approximation to most modern implementations on a machine that is under light memory usage load.
Why can you access in O(1)? Because memory access is by address. If you know the address you want to access, then the hardware can go directly to it and fetch whatever is there in a single operation.
As for allocations being O(1), I'm not sure that is always the case. It is up to the OS to allocate a new block of memory, and the algorithm that it uses to do that may not necessarily be O(1) in all cases. For instance, if you request a large block of memory and there is no contiguous block large enough to satisfy the request, the OS may do things like page out other data or relocate information from other processes so as to create a large enough contiguous block to satisfy the request.
Though if you want to take an extremely simplified view of allocation as being "returning the address of the first byte of available memory" then it's easy to see why that could be an O(1) operation. The system just needs to return the address of the last allocated byte + 1, so as long as it tracks what the last allocated byte is after every allocation and as long as you assume an unlimited memory space, then computing the next free address is always O(1).