What data will be cached? [closed]

What data will be cached? [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
The description of cache in the book is always very general. I am a student in the field of architecture. I want to understand the behavior of the cache in more detail.
In the c/c++ language code, what data will be loaded from the memory to the cache? Will it be loaded into the cache when it is frequently used? For example, when I write a for loop in C language, I often use variables i, j, and k. Will these also be loaded into the cache? C language local variables are generally placed in the stack area, global variables will be placed in the data area? Will these be loaded into the cache first when they are used? Does the data have to go through the cache to reach the register and then to the CPU?
The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?

Normally all the memory your C++ program uses (code and data) is in cacheable memory.
Any access (read or write) to any C++ object1 will result in the cache line containing it being hot in case, assuming a normal CPU cache: set-associative, write-back / write-allocate1, even if it was previously not hot.
The simplest design is that each level of cache fetches data through the next outer layer, so after a load miss, data is hot in all levels of cache. But you can have outer caches that don't read-allocate, and act as victim caches. Or outer levels that are Exclusive of inner caches, to avoid wasting space caching the same data twice (https://en.wikipedia.org/wiki/Cache_inclusion_policy). But whatever happens, right after a read or write, at least the inner-most (closest to that CPU core) level of cache will have the data hot, so accessing it again right away (or an adjacent item in the same cache line) will be fast. Different design choices affect the chances of a line still being hot if the next access is after a bunch of other accesses. And if hot, which level of cache you may find it in. But the super basics are that any memory compiler-generated code touches ends up in cache. CPU cache transparently caches physical memory.
Many cache lines can be hot at the same time, not aliasing each other. i.e. caches have many sets. Some access patterns are pessimal, like multiple pointers all offset from each other by 4k which will make all accesses alias the same set in L1d cache, as well as sometimes having extra 4k-aliasing penalties in the CPU's memory disambiguation logic. (Assuming a 4k page size like on x86). e.g. L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes - memory performance effects can get very complicated. Knowing some theory is enough to understand what's generally good, but the exact details can be very complex. (Sometimes even for true experts like Dr. Bandwidth, e.g. this case).
Footnote 1: Loosely, An object is a named variable or dynamically allocated memory pointed to by a pointer.
Footnote 2: Write-back cache with a write-allocate policy is near universal for modern CPUs, also a pseudo-LRU replacement policy; see wikipedia. A few devices have access patterns that benefit from caches that only allocate on read but not write, but CPUs benefit from write-allocate. A modern CPU will almost always have a multi-level cache hierarchy, with each level being set-associative with some level of associativity. Some embedded CPUs may only have 1 level, or even no cache, but you'd know if you were writing code specifically for a system like that.
Modern large L3 caches sometimes use a replacement policy.
Of course, optimization can mean that some local variables (especially loop counters or array pointers) can get optimized into a register and not exist in memory at all. Registers are not part of the CPU cache or memory at all, they're a separate storage space. People often describe things as "compiler caches the value in a register", but do not confuse that with CPU cache. (related: https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ and When to use volatile with multi threading?)
If you want to see what the compiler is making the CPU do, look at the compiler's asm output. How to remove "noise" from GCC/clang assembly output?. Every memory access in the asm source is an access in computer-architecture terms, so you can apply what you know about cache state given an access pattern to figure out what will happen with a set-associative cache.
Also related:
Which cache mapping technique is used in intel core i7 processor?
Modern Microprocessors: A 90-Minute Guide!
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? - why we have multi-level caches, and some real numbers for the cache heirarchies of Haswell (like Skylake) and Piledriver (fortunately obsolete, but an interesting example of a possible design).

Generally, the most recently used cache lines will be stored in the cache. For short loops, loop counter variables are normally stored in a CPU register. For longer loops, loop counter variables will probably be stored in the cache, unless one loop iteration runs for such a long time that the loop counter gets evicted from the cache due to the CPU doing other work.
Most variables will generally be cached after the first access (or beforehand if the cache prefetcher does a good job), irrespective of how often they are used. A higher frequency of usage will only prevent the memory from being evicted from the cache, but won't influence it being cached in the first place. However, some CPU architectures offer so-called non-temporal read and write instructions, which bypass the cache. These instructions are useful if the programmer knows in advance that a memory location will only be accessed once, and therefore should not be cached. But generally, these instructions should not be used, unless you know exactly what you are doing.
The CPU cache does not care whether variables are stored on the heap or stack. Memory is simply cached according to a "most recently used" algorithm, or, to be more accurate, the cache is evicted based on a "least recently used" algorithm, whenever new room in the cache is required for a new memory access.
In the case of local variables stored on the stack, there is a high chance that the cache line of that variable is already cached due to the program using that stack cache line recently for something else. Therefore, local variables generally have good cache performance. Also, the cache prefetcher works very well with the stack, because the stack grows in a linear fashion.
The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?
Yes, first, the cache line containing p will be loaded into the cache (if it is not already cached or stored in a CPU register). Then, the cache line containing *p will be loaded into the cache.

Related

Scenarios when software prefetching manual instructions are reasonable

I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions:
#include <xmmintrin.h>
enum _mm_hint
{
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);
Programs can use the _mm_prefetch intrinsic on any
pointer in the program. And The different hints to be used with the _mm_prefetch
intrinsic are implementation defined. Generally said is that each of the hints have its own meaning.
_MM_HINT_T0
fetches data to all levels of the cache for inclusive caches
and to the lowest level cache for exclusive caches
_MM_HINT_T1 hint pulls the data into L2 and
not into L1d. If there is an L3 cache the _MM_HINT_T2
hints can do something similar for it
_MM_HINT_NTA, allows telling the processor to treat the prefetched cache line specially
So can someone describe examples when this instruction used?
And how to properly choose the hint?

The idea of prefetching is based upon these facts:
Accessing memory is very expensive the first time.
The first time a memory address1 is accessed is must be fetched from memory, it is then stored in the cache hierarchy2.
Accessing memory is inherently asynchronous.
The CPU doesn't need any resource from the core to perform the lengthiest part of a load/store3 and thus it can be easily done in parallel with other tasks4.
Thanks to the above it makes sense to try a load before it is actually needed so that when the code will actually need the data, it won't have to wait.
It is very worth nothing that the CPU can go pretty far ahead when looking for something to do, but not arbitrarily deep; so sometimes it needs the help of the programmer to perform optimally.
The cache hierarchy is, by its very nature, an aspect of the micro-architecture not the architecture (read ISA). Intel or AMD cannot give strong guarantees on what these instructions do.
Furthermore using them correctly is not easy as the programmer must have clear in mind how many cycles each instruction can take.
Finally, the latest CPU are getting more and more good at hiding memory latency and lowering it.
So in general prefetching is a job for the skilled assembly programmer.
That said the only possible scenario is where the timing of a piece of code must be consistent at every invocation.
For example, if you know that an interrupt handler always update a state and it must perform as fast as possible, it is worth, when setting the hardware that uses such interrupt, to prefetch the state variable.
Regarding the different level of prefetching, my understanding is that different levels (L1 - L4) correspond to different amounts of sharing and polluting.
For example prefetch0 is good if the thread/core that executes the instruction is the same that will read the variable.
However, this will take a line in all the caches, eventually evicting other, possibly useful, lines.
You can use this for example when you know that you'll need the data surely in short.
prefetch1 is good to make the data quickly available for all core or core group (depending on how L2 is shared) without polluting L1.
You can use this if you know that you may need the data or that you'll need it after having done with another task (that takes priority in using the cache).
This is not as fast as having the data in L1 but much better than having it in memory.
prefetch2 can be used to take out most of the memory access latency since it moves the data in the L3 cache.
It doesn't pollute L1 or L2 and it is shared among cores, so it's good for data used by rare (but possible) code paths or for preparing data for other cores.
prefetchnta is the easiest to understand, it is a non-temporal move. It avoids creating an entry in every cache line for a data that is accessed only once.
prefetchw/prefetchwnt1 are like the others but makes the line Exclusive and invalidates other cores lines that alias this one.
Basically, it makes writing faster as it is in the optimal state of the MESI protocol (for cache coherence).
Finally, a prefetch can be done incrementally, first by moving into L3 and then by moving into L1 (just for the threads that need it).
In short, each instruction let you decide the compromise between pollution, sharing and speed of access.
Since these all require to keep track of the use of the cache very carefully (you need to know that it's not worth creating and entry in the L1 but it is in the L2) the use is limited to very specific environments.
In a modern OS, it's not possible to keep track of the cache, you can do a prefetch just to find your quantum expired and your program replaced by another one that evicts the just loaded line.
As for a concrete example I'm a bit out of ideas.
In the past, I had to measure the timing of some external event as consistently as possible.
I used and interrupt to periodically monitor the event, in such case I prefetched the variables needed by the interrupt handler, thereby eliminating the latency of the first access.
Another, unorthodox, use of the prefetching is to move the data into the cache.
This is useful if you want to test the cache system or unmap a device from memory relying on the cache to keep the data a bit longer.
In this case moving to L3 is enough, not all CPU has an L3, so we may need to move to L2 instead.
I understand these examples are not very good, though.
1 Actually the granularity is "cache lines" not "addresses".
2 Which I assume you are familiar with. Shortly put: It, as present, goes from L1 to L3/L4. L3/L4 is shared among cores. L1 is always private per core and shared by the core's threads, L2 usually is like L1 but some model may have L2 shared across pairs of cores.
3 The lengthiest part is the data transfer from the RAM. Computing the address and initializing the transaction takes up resources (store buffer slots and TLB entries for example).
4 However any resource used to access the memory can become a critical issue as pointed out by #Leeor and proved by the Linux kernel developer.

Flushing the cache to prevent benchmarking fluctiations

I am running the c++ code of someone to do the benchmarking on a dataset. The issue I have is that often I get a timing for the first run, and these numbers massively change (i.e. 28 seconds to 10 seconds) if I run the same code again. I assume this happens due to CPU's automatic caching. Is there a way to flush the cache, or prevent these fluctuations somehow?

Not one that works "for everything, everywhere". Most processors have special instructions to flush the cache, but they are often privileged instructions, so it has to be done from inside the OS kernel, not your user-mode code. And of course, it's completely different instructions for each processor architecture.
All current x86 processors does have a clflush instruction, that flushes one cache-line, but to do that, you have to have the address of the data (or code) you want to flush. Which is fine for small and simple data structures, not so good if you have a binary tree that is all over the place. And of course, not at all portable.
In most environments, reading and writing a large block of alternative data, e.g. something like:
// Global variables.
const size_t bigger_than_cachesize = 10 * 1024 * 1024;
long *p = new long[bigger_than_cachesize];
...
// When you want to "flush" cache.
for(int i = 0; i < bigger_than_cachesize; i++)
{
p[i] = rand();
}
Using rand will be much slower than filling with something constant/known. But the compiler can't optimise the call away, which means it's (almost) guaranteed that the code will stay.
The above won't flush instruction caches - that is a lot more difficult to do, basically, you have to run some (large enough) other piece of code to do that reliably. However, instruction caches tend to have less effect on overall benchmark performance (instruction cache is EXTREMELY important for modern processor's perforamnce, that's not what I'm saying, but in the sense that the code for a benchmark is typically small enough that it all fits in cache, and the benchmark runs many times over the same code, so it's only slower the first iteration)
Other ideas
Another way to simulate "non-cache" behaviour is allocate a new area for each benchmark pass - in other words, not freeing the memory until the end of the benchmark or using an array containing the data, and output results, such that each run has it's own set of data to work on.
Further, it's common to actually measure the performance of the "hot runs" of a benchmark, not the first "cold run" where the caches are empty. This does of course depend on what you are actually trying to achieve...

Here's my basic approach:
Allocate a memory region 2x the size of the LLC, if you can determine the LLC size dynamically (or you know it statically), or if you don't, some reasonable multiple of the largest LLC size on the platform of interest1.
memset the memory region to some non-zero value: 1 will do just fine.
"Sink" the pointer somewhere so that the compiler can't optimize out the stuff above or below (writing to a volatile global works pretty much 100% of the time).
Read from random indexes in the region until you've touched each cache line an average of 10 times or so (accumulate the read values into a sum that you sink in a similar way to (3)).
Here are some notes on why this is generally works and why doing less may not work - the details are x86-centric, but similar concerns will apply on many other architectures.
You absolutely want to write to the allocated memory (step 2) before you begin your main read-only flushing loop, since otherwise you might just be repeatedly reading from the same small zero-mapped page returned by the OS to satisfy your memory allocation.
You want to use a region considerably larger than the LLC size, since the outer cache levels are typically physically addressed, but you can only allocate and access virtual addresses. If you just allocate an LLC-sized region, you generally won't get full coverage of all the ways of every cache set: some sets will be over-represented (and so will be fully flushed), while other sets be under-represented and so not all existing values can even be flushed by accessing this region of memory. A 2x over-allocation makes it highly likely that almost all sets have enough representation.
You want to avoid the optimizer doing clever things, such as noting the memory never escapes the function and eliminating all your reads and writes.
You want to iterate randomly around the memory region, rather than just striding through it linearly: some designs like the LLC on recent Intel detect when a "streaming" pattern is present, and switch from LRU to MRU since LRU is about the worst-possible replacement policy for such a load. The effect is that no matter how many times you stream though memory, some "old" lines from before your efforts can remain in the cache. Randomly accessing memory defeats this behavior.
You want to access more than just LLC amount of memory for (a) the same reason you allocate more than the LLC size (virtual access vs physical caching) and (b) because random access needs more accesses before you have a high likelihood of hitting every set enough times (c) caches are usually only pseudo-LRU, so you need more than the number of accesses you'd expect under exact-LRU to flush out every line.
Even this is not foolproof. Other hardware optimizations or caching behaviors not considered above could cause this approach to fail. You might get very unlucky with the page allocation provided by the OS and not be able to reach all the pages (you can largely mitigate this by using 2MB pages). I highly recommend testing whether your flush technique is adequate: one approach is to measure the number of cache misses using CPU performance counters while running your benchmark and see if the number makes sense based on the known working-set size2.
Note that this leaves all levels of the cache with lines in E (exclusive) or perhaps S (shared) state, and not the M (modified) state. This means that these lines don't need to be evicted to other cache levels when they are replaced by accesses in your benchmark: they can simply be dropped. The approach described in the other answer will leave most/all lines in the M state, so you'll initially have 1 line of eviction traffic for every line you access in your benchmark. You can achieve the same behavior with my recipe above by changing step 4 to write rather than read.
In that regard, neither approach here is inherently "better" than the other: in the real world the cache levels will have a mix of modified and not-modified lines, while these approaches leave the cache at the two extremes of the continuum. In principle you could benchmark with both the all-M and no-M states, and see if it matters much: if it does, you can try to evaluate what the real-world state of the cache will usually be an replicate that.
1Remember that LLC sizes are growing almost every CPU generation (mostly because core counts are increasing), so you want to leave some room for growth if this needs to be future-proof.
2 I just throw that out there as if it was "easy", but in reality may be very difficult depending on your exact problem.

How to test main memory access time?

Looking for a C/C++ program to test how long it takes to access a fixed piece of memory, specifically in RAM.
How do I ensure testing access time is not of cache or TLB data?
For example, can I "disable" all cache/TLB?
Or can I specify a specific address in RAM to write/read only?
On the other hand, how would I ensure I am only testing cache?
Are there ways to tell the compiler where to save and read from, cache/ram?
For example, is there a well know standard program (in one of these books?) that is know for this test?
I did see this but I do not understand how adjusting the size of the list, you can control whether the memory accesses hit L1 cache, L2 cache, or main memory: measuring latencies of memory
How can one correctly program this test?

Basically, as the list grows you'll see the performance worsen in steps as another layer of caching is overwhelmed. The idea is simple... if the cache holds the last N units of memory you've accessed, then looping around a buffer of even N+1 units should ensure constant cache misses. (There're more details/caveats in the "measuring latencies of memory" answer you link to in your question).
You should be able to get some idea of the potential size of the the largest cache that might front your RAM from hardware documentation - as long as you operate on more memory than that you should be measuring physical RAM times.

What is cache in C++ programming? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Firstly I would like to tell that I come from a non-Computer Science background & I have been learning the C++ language.
I am unable to understand what exactly is a cache?
It has different meaning in different contexts.
I would like to know what would be called as a cache in a C++ program?
For example, if I have some int data in a file. If I read it & store in an int array, then would this mean that I have 'cached' the data?
To me this seems like common sense to use the data since reading from a file is always bad than reading from RAM.
But I am a little confused due to this article.
In a CPU there can be several caches, to speed up instructions in
loops or to store often accessed data. These caches are small but very
fast. Reading data from cache memory is much faster than reading it
from RAM.
It says that reading data from cache is much faster than from RAM.
I thought RAM & cache were the same.
Can somebody please clear my confusion?
EDIT: I am updating the question because previously it was too broad.
My confusion started with this answer. He says
RowData and m_data are specific to my implementation, but they are
simply used to cache information about a row in the file
What does cache in this context mean?

Any modern CPU has several layers of cache that are typically named things like L1, L2, L3 or even L4. This is called a multi-level cache. The lower the number, the faster the cache will be.
It's important to remember that the CPU runs at speeds that are significantly faster than the memory subsystem. It takes the CPU a tiny eternity to wait for something to be fetched from system memory, many, many clock-cycles elapse from the time the request is made to when the data is fetched, sent over the system bus, and received by the CPU.
There's no programming construct for dealing with caches, but if your code and data can fit neatly in the L1 cache, then it will be fastest. Next is if it can fit in the L2, and so on. If your code or data cannot fit at all, then you'll be at the mercy of the system memory, which can be orders of magnitude slower.
This is why counter-intuitive things like unrolling loops, which should be faster, might end up being slower because your code becomes too large to fit in cache. It's also why shaving a few bytes off a data structure could pay huge dividends even though the memory footprint barely changes. If it fits neatly in the cache, it will be faster.
The only way to know if you have a performance problem related to caching is to benchmark very carefully. Remember each processor type has varying amounts of cache, so what might work well on your i7 CPU might be relatively terrible on an i5.
It's only in extremely performance sensitive applications that the cache really becomes something you worry about. For example, if you need to maintain a steady 60FPS frame rate in a game, you'll be looking at cache problems constantly. Every millisecond counts here. Likewise, anything that runs the CPU at 100% for extended periods of time, such as rendering video, will want to pay very close attention to how much they could gain from adjusting the code that's emitted.
You do have control over how your code is generated with compiler flags. Some will produce smaller code, some theoretically faster by unrolling loops and other tricks. To find the optimal setting can be a very time-consuming process. Likewise, you'll need to pay very careful attention to your data structures and how they're used.

[Cache] has different meaning in different contexts.
Bingo. Here are some definitions:
Cache
Verb
Definition: To place data in some location from which it can be more efficiently or reliably retrieved than its current location. For instance:
Copying a file to a local hard drive from some remote computer
Copying data into main memory from a file on a local hard drive
Copying a value into a variable when it is stored in some kind of container type in your procedural or object oriented program.
Examples: "I'm going to cache the value in main memory", "You should just cache that, it's expensive to look up"
Noun 1
Definition: A copy of data that is presumably more immediately accessible than the source data.
Examples: "Please keep that in your cache, don't hit our servers so much"
Noun 2
Definition: A fast access memory region that is on the die of a processor, modern CPUs generally have several levels of cache. See cpu cache, note that GPUs and other types of processors will also have their own caches with different implementation details.
Examples: "Consider keeping that data in an array so that accessing it sequentially will be cache coherent"

My definition for Cache would be some thing that is in limited amount but faster to access as there is less area to look for. If you are talking about caching in any programming language then it means you are storing some information in form of a variable(variable is nothing a way to locate your data in memory) in memory. Here memory means both RAM and physical cache (CPU cache).
Physical/CPU cache is nothing but memory that is even more used than RAM, it actually stores copies of some data on RAM which is used by CPU very often. You have another level of categorisation after that as well which is on board cache(faster) and off-board cache. youu can see this link

I am updating the question because previously it was too broad. My
confusion started with this answer. He says
RowData and m_data are specific to my implementation,
but they are simply used to cache information about a row in the file
What does cache in this context mean?
This particular use means that RowData is held as a copy in memory, rather than reading (a little bit of) the row from a file every time we need some data from it. Reading from a file is a lot slower [1] than holding on to a copy of the data in our program's memory.
[1] Although in a modern OS, the actual data from the hard-disk is probably held in memory, in file-system cache, to avoid having to read the disk many times to get the same data over and over. However, this still means that the data needs to be copied from the file-system cache to the application using the data.

How can i know my array is in cache?

Lets say my array is 32KB, L1 is 64 KB. Does Windows use some of it while program is running? Maybe I am not able to use L1 because windows is making other programs work? Should I set priority of my program to use all cache?
for(int i=0;i<8192;i++)
{
array_3[i]+=clock()*(rand()%256);//clock() and rand in cache too?
//how many times do I need to use a variable to make it stay in cache?
//or cache is only for reading? look below plz
temp_a+=array_x[i]*my_function();
}
The program is in C/C++.
Same thing for L2 too please.
Also are functions kept in cache? Cache is read only? (If I change my array then it loses the cache bond?)
Does the compiler create the asm codes to use cache more yield?
Thanks

How can i know my array is in cache?
In general, you can't. Generally speaking, the cache is managed directly by hardware, not by Windows. You also can't control whether data resides in the cache (although it is possible to specify that an area of memory shouldn't be cached).
Does windows use some of it while program is running? Maybe i am not able to use L1 because windows is making other programs work? Should i set priority of my program to use all cache?
The L1 and L2 caches are shared by all processes running on a given core. When your process is running, it will use all of cache (if it needs it). When there's a context switch, some or all of the cache will be evicted, depending on what the second process needs. So next time there's a context switch back to your process, the cache may have to be refilled all over again.
But again, this is all done automatically by the hardware.
also functions are kept in cache?
On most modern processors, there is a separate cache for instructions. See e.g. this diagram which shows the arrangement for the Intel Nehalem architecture; note the shared L2 and L3 caches, but the separate L1 caches for instructions and data.
cache is read only?(if i change my array then it loses the cache bond?)
No. Caches can handle modified data, although this is considerably more complex (because of the problem of synchronising multiple caches in a multi-core system.)
does the compiler create the asm codes to use cache more yield?
As cache activity is generally all handled automatically by the hardware, no special instructions are needed.

Cache is not directly controlled by the operating system, it is done
in hardware
In case of a context switch, another application may modify the
cache, but you should not care about this. It is more important to
handle cases when your program behaves cache unfriendly.
Functions are kept in cache (I-Cahce , instruction cache)
Cache is not read only, when you write something it goes to [memory
and] the cache.

The cache is primarily controlled by the hardware. However, I know that Windows scheduler tends to schedule execution of a thread to the same core as before specifically because of the caches. It understands that it will be necessary to reload them on another core. Windows is using this behavior at least since Windows 2000.

As others have stated, you generally cannot control what is in cache. If you are writing code for high-performance and need to rely on cache for performance, then it is not uncommon to write your code so that you are using about half the space of L1 cache. Methods for doing so involve a great deal of discussion beyond the scope of StackOverflow questions. Essentially, you would want to do as much work as possible on some data before moving on to other data.
As a matter of what works practically, using about half of cache leaves enough space for other things to occur that most of your data will remain in cache. You cannot rely on this without cooperation from the operating system and other aspects of the computing platform, so it may be a useful technique for speeding up research calculations but it cannot be used where real-time performance must be guaranteed, as in operating dangerous machinery.
There are additional caveats besides how much data you use. Using data that maps to the same cache lines can evict data from cache even though there is plenty of cache unused. Matrix transposes are notorious for this, because a matrix whose row length is a multiple of a moderate power of two will have columns in which elements map to a small set of cache lines. So learning to use cache efficiently is a significant job.

As far as I know, you can't control what will be in the cache. You can declare a variable as register var_type a and then access to it will be in a single cycle(or a small number of cycles). Moreover, the amount of cycles it will take you to access a chunk of memory also depends on virtual memory translation and TLB.
It should be noted that the register keyword is merely a suggestion and the compiler is perfectly free to ignore it, as was suggested by the comment.

Even though you may not know which data is in cache and which not, you still may get an idea how much of the cache you are utilizing. Modern processor have quite many performance counters and some of them related to cache. Intel's processors may tell you how many L1 and L2 misses there were. Check this for more details of how to do it: How to read performance counters on i5, i7 CPUs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js