What is cache in C++ programming? [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Firstly I would like to tell that I come from a non-Computer Science background & I have been learning the C++ language.
I am unable to understand what exactly is a cache?
It has different meaning in different contexts.
I would like to know what would be called as a cache in a C++ program?
For example, if I have some int data in a file. If I read it & store in an int array, then would this mean that I have 'cached' the data?
To me this seems like common sense to use the data since reading from a file is always bad than reading from RAM.
But I am a little confused due to this article.
In a CPU there can be several caches, to speed up instructions in
loops or to store often accessed data. These caches are small but very
fast. Reading data from cache memory is much faster than reading it
from RAM.
It says that reading data from cache is much faster than from RAM.
I thought RAM & cache were the same.
Can somebody please clear my confusion?
EDIT: I am updating the question because previously it was too broad.
My confusion started with this answer. He says
RowData and m_data are specific to my implementation, but they are
simply used to cache information about a row in the file
What does cache in this context mean?

Any modern CPU has several layers of cache that are typically named things like L1, L2, L3 or even L4. This is called a multi-level cache. The lower the number, the faster the cache will be.
It's important to remember that the CPU runs at speeds that are significantly faster than the memory subsystem. It takes the CPU a tiny eternity to wait for something to be fetched from system memory, many, many clock-cycles elapse from the time the request is made to when the data is fetched, sent over the system bus, and received by the CPU.
There's no programming construct for dealing with caches, but if your code and data can fit neatly in the L1 cache, then it will be fastest. Next is if it can fit in the L2, and so on. If your code or data cannot fit at all, then you'll be at the mercy of the system memory, which can be orders of magnitude slower.
This is why counter-intuitive things like unrolling loops, which should be faster, might end up being slower because your code becomes too large to fit in cache. It's also why shaving a few bytes off a data structure could pay huge dividends even though the memory footprint barely changes. If it fits neatly in the cache, it will be faster.
The only way to know if you have a performance problem related to caching is to benchmark very carefully. Remember each processor type has varying amounts of cache, so what might work well on your i7 CPU might be relatively terrible on an i5.
It's only in extremely performance sensitive applications that the cache really becomes something you worry about. For example, if you need to maintain a steady 60FPS frame rate in a game, you'll be looking at cache problems constantly. Every millisecond counts here. Likewise, anything that runs the CPU at 100% for extended periods of time, such as rendering video, will want to pay very close attention to how much they could gain from adjusting the code that's emitted.
You do have control over how your code is generated with compiler flags. Some will produce smaller code, some theoretically faster by unrolling loops and other tricks. To find the optimal setting can be a very time-consuming process. Likewise, you'll need to pay very careful attention to your data structures and how they're used.

[Cache] has different meaning in different contexts.
Bingo. Here are some definitions:
Definition: To place data in some location from which it can be more efficiently or reliably retrieved than its current location. For instance:
Copying a file to a local hard drive from some remote computer
Copying data into main memory from a file on a local hard drive
Copying a value into a variable when it is stored in some kind of container type in your procedural or object oriented program.
Examples: "I'm going to cache the value in main memory", "You should just cache that, it's expensive to look up"
Noun 1
Definition: A copy of data that is presumably more immediately accessible than the source data.
Examples: "Please keep that in your cache, don't hit our servers so much"
Noun 2
Definition: A fast access memory region that is on the die of a processor, modern CPUs generally have several levels of cache. See cpu cache, note that GPUs and other types of processors will also have their own caches with different implementation details.
Examples: "Consider keeping that data in an array so that accessing it sequentially will be cache coherent"

My definition for Cache would be some thing that is in limited amount but faster to access as there is less area to look for. If you are talking about caching in any programming language then it means you are storing some information in form of a variable(variable is nothing a way to locate your data in memory) in memory. Here memory means both RAM and physical cache (CPU cache).
Physical/CPU cache is nothing but memory that is even more used than RAM, it actually stores copies of some data on RAM which is used by CPU very often. You have another level of categorisation after that as well which is on board cache(faster) and off-board cache. youu can see this link

I am updating the question because previously it was too broad. My
confusion started with this answer. He says
RowData and m_data are specific to my implementation,
but they are simply used to cache information about a row in the file
What does cache in this context mean?
This particular use means that RowData is held as a copy in memory, rather than reading (a little bit of) the row from a file every time we need some data from it. Reading from a file is a lot slower [1] than holding on to a copy of the data in our program's memory.
[1] Although in a modern OS, the actual data from the hard-disk is probably held in memory, in file-system cache, to avoid having to read the disk many times to get the same data over and over. However, this still means that the data needs to be copied from the file-system cache to the application using the data.


What data will be cached? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
The description of cache in the book is always very general. I am a student in the field of architecture. I want to understand the behavior of the cache in more detail.
In the c/c++ language code, what data will be loaded from the memory to the cache? Will it be loaded into the cache when it is frequently used? For example, when I write a for loop in C language, I often use variables i, j, and k. Will these also be loaded into the cache? C language local variables are generally placed in the stack area, global variables will be placed in the data area? Will these be loaded into the cache first when they are used? Does the data have to go through the cache to reach the register and then to the CPU?
The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?
Normally all the memory your C++ program uses (code and data) is in cacheable memory.
Any access (read or write) to any C++ object1 will result in the cache line containing it being hot in case, assuming a normal CPU cache: set-associative, write-back / write-allocate1, even if it was previously not hot.
The simplest design is that each level of cache fetches data through the next outer layer, so after a load miss, data is hot in all levels of cache. But you can have outer caches that don't read-allocate, and act as victim caches. Or outer levels that are Exclusive of inner caches, to avoid wasting space caching the same data twice (https://en.wikipedia.org/wiki/Cache_inclusion_policy). But whatever happens, right after a read or write, at least the inner-most (closest to that CPU core) level of cache will have the data hot, so accessing it again right away (or an adjacent item in the same cache line) will be fast. Different design choices affect the chances of a line still being hot if the next access is after a bunch of other accesses. And if hot, which level of cache you may find it in. But the super basics are that any memory compiler-generated code touches ends up in cache. CPU cache transparently caches physical memory.
Many cache lines can be hot at the same time, not aliasing each other. i.e. caches have many sets. Some access patterns are pessimal, like multiple pointers all offset from each other by 4k which will make all accesses alias the same set in L1d cache, as well as sometimes having extra 4k-aliasing penalties in the CPU's memory disambiguation logic. (Assuming a 4k page size like on x86). e.g. L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes - memory performance effects can get very complicated. Knowing some theory is enough to understand what's generally good, but the exact details can be very complex. (Sometimes even for true experts like Dr. Bandwidth, e.g. this case).
Footnote 1: Loosely, An object is a named variable or dynamically allocated memory pointed to by a pointer.
Footnote 2: Write-back cache with a write-allocate policy is near universal for modern CPUs, also a pseudo-LRU replacement policy; see wikipedia. A few devices have access patterns that benefit from caches that only allocate on read but not write, but CPUs benefit from write-allocate. A modern CPU will almost always have a multi-level cache hierarchy, with each level being set-associative with some level of associativity. Some embedded CPUs may only have 1 level, or even no cache, but you'd know if you were writing code specifically for a system like that.
Modern large L3 caches sometimes use a replacement policy.
Of course, optimization can mean that some local variables (especially loop counters or array pointers) can get optimized into a register and not exist in memory at all. Registers are not part of the CPU cache or memory at all, they're a separate storage space. People often describe things as "compiler caches the value in a register", but do not confuse that with CPU cache. (related: https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ and When to use volatile with multi threading?)
If you want to see what the compiler is making the CPU do, look at the compiler's asm output. How to remove "noise" from GCC/clang assembly output?. Every memory access in the asm source is an access in computer-architecture terms, so you can apply what you know about cache state given an access pattern to figure out what will happen with a set-associative cache.
Also related:
Which cache mapping technique is used in intel core i7 processor?
Modern Microprocessors: A 90-Minute Guide!
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? - why we have multi-level caches, and some real numbers for the cache heirarchies of Haswell (like Skylake) and Piledriver (fortunately obsolete, but an interesting example of a possible design).
Generally, the most recently used cache lines will be stored in the cache. For short loops, loop counter variables are normally stored in a CPU register. For longer loops, loop counter variables will probably be stored in the cache, unless one loop iteration runs for such a long time that the loop counter gets evicted from the cache due to the CPU doing other work.
Most variables will generally be cached after the first access (or beforehand if the cache prefetcher does a good job), irrespective of how often they are used. A higher frequency of usage will only prevent the memory from being evicted from the cache, but won't influence it being cached in the first place. However, some CPU architectures offer so-called non-temporal read and write instructions, which bypass the cache. These instructions are useful if the programmer knows in advance that a memory location will only be accessed once, and therefore should not be cached. But generally, these instructions should not be used, unless you know exactly what you are doing.
The CPU cache does not care whether variables are stored on the heap or stack. Memory is simply cached according to a "most recently used" algorithm, or, to be more accurate, the cache is evicted based on a "least recently used" algorithm, whenever new room in the cache is required for a new memory access.
In the case of local variables stored on the stack, there is a high chance that the cache line of that variable is already cached due to the program using that stack cache line recently for something else. Therefore, local variables generally have good cache performance. Also, the cache prefetcher works very well with the stack, because the stack grows in a linear fashion.
The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?
Yes, first, the cache line containing p will be loaded into the cache (if it is not already cached or stored in a CPU register). Then, the cache line containing *p will be loaded into the cache.

Is using istream::seekg too much expensive?

In c++, how expensive is it to use the istream::seekg operation?
EDIT: How much can I get away with seeking around a file and reading bytes? What about frequency versus magnitude of offset?
I have a large file (4GB) that I am parsing, and I want to know if it's necessary to try to consolidate some of my seekg calls. I would assume that the magnitude of differences in file location play a role--like if you seek more than a page in memory away, it will impact performance--but small seeking is of no consequence. Is this correct?
This question is heavily dependent on your operating system and disk subsystem.
Obviously, the seek itself will take essentially zero time, since it just updates an offset. Actually reading will pull some data off of disk...
...but how much data depends on many things. Your disk has a cache which may have its own block size and may do some sort of read-ahead. Your RAID controller (if any) will have its own cache, possibly with its own block size and read-ahead.
Your kernel has a page cache -- all of free RAM, essentially -- and it will also probably do some sort of read-ahead. On Linux this is configurable, and the kernel will adapt it based on how sequential your access patterns appear to be, whether you have called posix_fadvise, etc.
All of these caches mean if you access some data, then access nearby data later, there is a chance the second access will not actually touch the disk at all.
If you have the option of coding so that you access the file sequentially, that is certainly going to be faster than random reads, especially small random reads. Seeking on a single mechanical disk takes something like 10ms, so you can do the math here. (Although seeking on a solid state drive is around 100 times faster.)
Large reads are generally better than small reads... Although processing data a few kilobytes at a time can be faster than larger blocks if it allows the processing to stay in cache.
In short, you will need to provide a lot more details about your system and your application to get a proper answer, and even then the most likely answer is "benchmark it".

How can i know my array is in cache?

Lets say my array is 32KB, L1 is 64 KB. Does Windows use some of it while program is running? Maybe I am not able to use L1 because windows is making other programs work? Should I set priority of my program to use all cache?
for(int i=0;i<8192;i++)
array_3[i]+=clock()*(rand()%256);//clock() and rand in cache too?
//how many times do I need to use a variable to make it stay in cache?
//or cache is only for reading? look below plz
The program is in C/C++.
Same thing for L2 too please.
Also are functions kept in cache? Cache is read only? (If I change my array then it loses the cache bond?)
Does the compiler create the asm codes to use cache more yield?
How can i know my array is in cache?
In general, you can't. Generally speaking, the cache is managed directly by hardware, not by Windows. You also can't control whether data resides in the cache (although it is possible to specify that an area of memory shouldn't be cached).
Does windows use some of it while program is running? Maybe i am not able to use L1 because windows is making other programs work? Should i set priority of my program to use all cache?
The L1 and L2 caches are shared by all processes running on a given core. When your process is running, it will use all of cache (if it needs it). When there's a context switch, some or all of the cache will be evicted, depending on what the second process needs. So next time there's a context switch back to your process, the cache may have to be refilled all over again.
But again, this is all done automatically by the hardware.
also functions are kept in cache?
On most modern processors, there is a separate cache for instructions. See e.g. this diagram which shows the arrangement for the Intel Nehalem architecture; note the shared L2 and L3 caches, but the separate L1 caches for instructions and data.
cache is read only?(if i change my array then it loses the cache bond?)
No. Caches can handle modified data, although this is considerably more complex (because of the problem of synchronising multiple caches in a multi-core system.)
does the compiler create the asm codes to use cache more yield?
As cache activity is generally all handled automatically by the hardware, no special instructions are needed.
Cache is not directly controlled by the operating system, it is done
in hardware
In case of a context switch, another application may modify the
cache, but you should not care about this. It is more important to
handle cases when your program behaves cache unfriendly.
Functions are kept in cache (I-Cahce , instruction cache)
Cache is not read only, when you write something it goes to [memory
and] the cache.
The cache is primarily controlled by the hardware. However, I know that Windows scheduler tends to schedule execution of a thread to the same core as before specifically because of the caches. It understands that it will be necessary to reload them on another core. Windows is using this behavior at least since Windows 2000.
As others have stated, you generally cannot control what is in cache. If you are writing code for high-performance and need to rely on cache for performance, then it is not uncommon to write your code so that you are using about half the space of L1 cache. Methods for doing so involve a great deal of discussion beyond the scope of StackOverflow questions. Essentially, you would want to do as much work as possible on some data before moving on to other data.
As a matter of what works practically, using about half of cache leaves enough space for other things to occur that most of your data will remain in cache. You cannot rely on this without cooperation from the operating system and other aspects of the computing platform, so it may be a useful technique for speeding up research calculations but it cannot be used where real-time performance must be guaranteed, as in operating dangerous machinery.
There are additional caveats besides how much data you use. Using data that maps to the same cache lines can evict data from cache even though there is plenty of cache unused. Matrix transposes are notorious for this, because a matrix whose row length is a multiple of a moderate power of two will have columns in which elements map to a small set of cache lines. So learning to use cache efficiently is a significant job.
As far as I know, you can't control what will be in the cache. You can declare a variable as register var_type a and then access to it will be in a single cycle(or a small number of cycles). Moreover, the amount of cycles it will take you to access a chunk of memory also depends on virtual memory translation and TLB.
It should be noted that the register keyword is merely a suggestion and the compiler is perfectly free to ignore it, as was suggested by the comment.
Even though you may not know which data is in cache and which not, you still may get an idea how much of the cache you are utilizing. Modern processor have quite many performance counters and some of them related to cache. Intel's processors may tell you how many L1 and L2 misses there were. Check this for more details of how to do it: How to read performance counters on i5, i7 CPUs

How many bytes does a Xeon bring into the cache per memory access?

I am working on a system, written in C++, running on a Xeon on Linux, that needs to run as fast as possible. There is a large data structure (basically an array of structs) held in RAM, over 10 GB, and elements of it need to be accessed periodically. I want to revise the data structure to work with the system's caching mechanism as much as possible.
Currently, accesses are done mostly randomly across the structure, and each time 1-4 32-bit ints are read. It is a long time before another read occurs in the same place, so there is no benefit from the cache.
Now I know that when you read a byte from a random location in RAM, more than just that byte is brought into the cache. My question is how many bytes are brought in? Is it 16, 32, 64, 4096? Is this called a cache line?
I am looking to redesign the data structure to minimize random RAM accesses and work with the cache instead of against it. Knowing how many bytes are pulled into the cache on a random access will inform the design choices I make.
Update (October 2014):
Shortly after I posed the question above the project was put on hold. It has since resumed and based on suggestions in answers below, I performed some experiments around RAM access, because it seemed likely that TLB thrash was happening. I revised the program to run with huge pages (2MB instead of the standard 4KB), and observed a small speedup, about 2.5%. I found great information about setting up for huge pages here and here.
Today’s CPUs fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache.
More here : http://igoro.com/archive/gallery-of-processor-cache-effects/
A cache line for any current Xeon processor is 64 bytes. One other thing that you might want to think about is the TLB. If you are really doing random accesses across 10GB of memory then you are likely to have a lot of TLB misses which can potentially be as costly as cache misses. You can get work around with with large pages, but it's something to keep in mind.
Old SO question that has some info that might be of use to you (in particular the first answer where to look for Linux CPU info - responder doesn't mention line size proper, but 'other info' on top of associativity etc). Question is for x86, but answers are more general. Worth a look.
Where is the L1 memory cache of Intel x86 processors documented?
You might want to head over to http://agner.org/optimize/ and grab the optimization PDFs available there - there's a lot of good (low-level) information in there. Pretty focused on assembly language level, but there's lessons to be learned for C/C++ programmers as well.
Volume 3, "The microarchitecture of Intel, AMD and VIA CPUs" should be of interest :-)
Good (long) article about organizing data structures to take cache and RAM hierarchy into account from GNU's libc maintainer: https://lwn.net/Articles/250967/ (full PDF here: http://www.akkadia.org/drepper/cpumemory.pdf)

Is it possible to lock some data in CPU cache?

I have a problem....
I'm writing a data into array in the while-loop. And the point is that I'm doing it really frequently. It seems to be that this writing is now a bottle-neck in the code. So as i presume it's caused by the writing to memory. This array is not really large (smth like 300 elements). The question is it possible to do it in that way: to store it in the cache and update in the memory only after the while-loop is finished?
[edit - copied from an answer added by Alex]
double* array1 = new double[1000000]; // this array has elements
unsigned long* array2 = unsigned long[300];
double varX,t,sum=0;
int iter=0,i=0;
nm0 = int(varX);
array2[iter] = nm0; // if you comment this string application works more then 2 times faster :)
t = array1[nm0]; // if you comment this string , there is almost no change in time
For the first I'd like to thank all of you for answers. Indeed, it was a little dumb not to place a code. So i decided to do it now.
double* array1 = new double[1000000]; // this array has elements
unsigned long* array2 = unsigned long[300];
double varX,t,sum=0;
int iter=0,i=0;
nm0 = int(varX);
array2[iter] = nm0; // if you comment this string application works more then 2 times faster :)
t = array1[nm0]; // if you comment this string , there is almost no change in time
So that was it. It would be nice if someone will have any ideas. Thank you very much again.
Not intentionally, no. Among other things, you have no idea how big the cache is, so you have no idea of what's going to fit. Furthermore, if the app were allowed to lock off part of the cache, the effects on the OS might be devastating to overall system performance. This falls squarely onto my list of "you can't do it because you shouldn't do it. Ever."
What you can do is to improve your locality of reference - try to arrange the loop such that you don't access the elements more than once, and try to access them in order in memory.
Without more clues about your application, I don't think more specific advice can be given.
The CPU does not usually offer fine-grained cache control, you're not allowed to choose what is evicted or to pin things in cache. You do have a few cache operations on some CPUs. Just as a bit of info on what you can do: Here's some interesting cache related instructions on newer x86{-64} CPUs (Doing things like this makes portability hell, but I figured you may be curious)
Software Data Prefecth
The non-temporal instruction is
prefetchnta, which fetches the data
into the second-level cache,
minimizing cache pollution.
The temporal instructions are as
* prefetcht0 – fetches the data into all cache levels, that is, to the
second-level cache for the Pentium® 4 processor.
* prefetcht1 – Identical to prefetcht0
* prefetcht2 – Identical to prefetcht0
Additionally there are a set of instructions for accessing data in memory but explicitly tell the processor to not insert the data into the cache. These are called non-temporal instructions. An example of one is here: MOVNTI.
You could use the non temporal instructions for every piece of data you DON'T want in cache, in the hopes that the rest will always stay in cache. I don't know if this would actually improve performance as there are subtle behaviors to be aware of when it comes to the cache. Also it sounds like it'd be relatively painful to do.
I have a problem.... I'm writing a data into array in the while-loop. And the point is that I'm doing it really frequently. It seems to be that this writing is now a bottle-neck in the code. So as i presume it's caused by the writing to memory. This array is not really large (smth like 300 elements). The question is it possible to do it in that way: to store it in the cache and update in the memory only after the while-loop is finished?
You don't need to. The only reason why it might get pushed out of the cache is if some other data is deemed more urgent to put in the cache.
Apart from this, an array of 300 elements should fit in the cache with no trouble (assuming the element size isn't too crazy), so most likely, your data is already in cach.
In any case, the most effective solution is probably to tweak your code. Use lots of temporaries (to indicate to the compiler that the memory address isn't important), rather than writing/reading into the array constantly. Reorder your code so loads are performed once, at the start of the loop, and break up dependency chains as much as possible.
Manually unrolling the loop gives you more flexibility to achieve these things.
And finally, two obvious tools you should use rather than guessing about the cache behavior:
A profiler, and cachegrind if available. A good profiler can tell you a lot of statistics on cache misses, and cachegrind give you a lot of information too.
Us here at StackOverflow. If you post your loop code and ask how its performance can be improved, I'm sure a lot of us will find it a fun challenge.
But as others have mentioned, when working with performance, don't guess. You need hard data and measurements, not hunches and gut feelings.
Unless your code does something completely different in between writing to the array, then most of the array will probably be kept in the cache.
Unfortunately there isn't anything you can do to affect what is in the cache, apart from rewriting your algorithm with the cache in mind. Try to use as little memory as possible in between writing to the memory: don't use lot's of variables, don't call many other functions, and try to write to the same region of the array consecutively.
I doubt that this is possible, at least on a high-level multitasking operating system. You can't guarantee that your process won't be pre-empted, and lose the CPU. If your process then owns the cache, other processes can't use it, which would make their exeucution very slow, and complicate things a great deal. You really don't want to run a modern several-GHz processor without cache, just because one application has locked all others out of it.
In this case, array2 will be quite "hot" and stay in cache for that reason alone. The trick is keeping array1 out of cache (!). You're reading it only once, so there is no point in caching it. The SSE instruction for that is MOVNTPD, intrinsic void_mm_stream_pd(double *destination, __m128i source)
Even if you could, it's a bad idea.
Modern desktop computers use multiple-core CPUs. Intel's chips are the most common chips in Desktop machines... but the Core and Core 2 processors don't share an on-die cache.
That is, didn't share a cache until the Core 2 i7 chips were released, which share an on-die 8MB L3 cache.
So, if you were able to lock data in the cache on the computer I'm typing this from, you can't even guarantee this process will be scheduled on the same core, so that cache lock may be totally useless.
If your writes are slow, make sure that no other CPU core is writing in the same memory area at the same time.
When you have a performance problem, don't assume anything, measure first. For example, comment out the writes, and see if the performance is any different.
If you are writing to an array of structures, use a structure pointer to cache the address of the structure so you are not doing the array multiply each time you do an access. Make sure you are using the native word length for the array indexer variable for maximum optimisation.
As other people have said, you can't control this directly, but changing your code may indirectly enable better caching. If you're running on linux and want to get more insight into what's happening with the CPU cache when your program runs, you can use the Cachegrind tool, part of the Valgrind suite. It's a simulation of a processor, so it's not totally realistic, but it gives you information that is hard to get any other way.
It might be possible to use some assembly code, or as onebyone pointed out, assembly intrinsics, to prefetch lines of memory into the cache, but that would cost a lot of time to tinker with it.
Just for trial, try to read in all the data (in a manner that the compiler won't optimize away), and then do the write. See if that helps.
In the early boot phases of CoreBoot (formerly LinuxBIOS) since they have no access to RAM yet (we are talking about BIOS code, and therefore the RAM hasn't been initialized yet) they set up something they call Cache-as-RAM (CAR), i.e. they use the processor cache as RAM even if not backed by actual RAM.