Branch-mispredictions versus cache misses [closed]

Branch-mispredictions versus cache misses [closed] - c++

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Consider the following two alternative pieces of code:
Alternative 1:
if (variable != new_val) // (1)
variable = new_val;
f(); // This function reads `variable`.
Alternative 2:
variable = new_val; // (2)
f(); // This function reads `variable`.
Which alternative is "statistically" faster? Assume variable is in cache L1 before (1) or (2).
I guess that alternative (1) is faster even if the branch-misprediction rate is high, but I don't really know the costs of "ifs". My guess is based on the assumption that cache-misses are way more expensive than branch-mispredictions but I don't really know.
What if variable wasn't in cache before (1) or (2)? Does it change the situation too much?
NOTE: Since the situation could change a lot among different CPUs, you can based your answer in an architecture you are familiar with, although widely used CPUs like any modern Intel architecture is preferred. The goal of my question is actually to know a bit more about how CPUs work.

Normally, alternative 2 is faster because it's less machine code executing, and the store buffer will decouple unconditional stores from other parts of the core, even if they miss in cache.
If alternative 1 was consistently faster, compilers would make asm that did that, but it's not so they don't. It introduces a possible branch miss and a load that can cache-miss. There are plausible circumstances under which it could be better (e.g. false sharing with other threads, or breaking a data dependency), but those are special cases that you'd have to confirm with performance experiments and perf counters.
Reading variable in the first place already touches memory for both variables (if neither is in registers). If you expect new_val to almost always be the same (so it predicts well), and for that load to miss in cache, branch prediction + speculative execution can be helpful to decouple later reads of variable from that cache-miss load. But it's still a cache miss load that has to get waited for because the branch condition can be checked, so the total miss penalty could end up being quite large if the branch predicts wrong. But otherwise you're hiding a lot of the cache-miss load penalty by making more later work independent of it, allowing OoO exec up to the limit of the ROB size.
Other than breaking the data dependency, if f() inlines and variable optimizes into a register, it would be pointless to branch. Otherwise, a store that misses in L1d but hits in L2 cache is still pretty cheap, and decoupled from execution by the store buffer. (Can a speculatively executed CPU branch contain opcodes that access RAM?) Even hitting in L3 is not too bad for a store, unless other threads have the line in shared state and dirtying it would interfere with them reading values of other global vars. (False sharing)
Note that later reloads of variable can use the newly-stored value even while the store is waiting to commit from the store buffer to L1d cache (store forwarding), so even if f() didn't inline and use the new_value load result directly, its use of variable still doesn't have to wait for a possible store miss on variable.
Avoiding false-sharing is one of the few reasons it could be worth branching to avoid a single store of a value that fits in a register.
Two questions linked in comments by #EOF discuss a case of this possible optimization (or possible pessimization) to avoid writes. It's sometimes done with std::atomic variables because false sharing is an even bigger deal. (And stores with the default mo_seq_cst memory order are slow on most ISAs other than AArch64, draining the store buffer.)
Strange optimization? in `libuv`. Please explain
C optimization: conditional store to avoid dirtying a cache line

Related

What data will be cached? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
The description of cache in the book is always very general. I am a student in the field of architecture. I want to understand the behavior of the cache in more detail.
In the c/c++ language code, what data will be loaded from the memory to the cache? Will it be loaded into the cache when it is frequently used? For example, when I write a for loop in C language, I often use variables i, j, and k. Will these also be loaded into the cache? C language local variables are generally placed in the stack area, global variables will be placed in the data area? Will these be loaded into the cache first when they are used? Does the data have to go through the cache to reach the register and then to the CPU?
The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?

Normally all the memory your C++ program uses (code and data) is in cacheable memory.
Any access (read or write) to any C++ object1 will result in the cache line containing it being hot in case, assuming a normal CPU cache: set-associative, write-back / write-allocate1, even if it was previously not hot.
The simplest design is that each level of cache fetches data through the next outer layer, so after a load miss, data is hot in all levels of cache. But you can have outer caches that don't read-allocate, and act as victim caches. Or outer levels that are Exclusive of inner caches, to avoid wasting space caching the same data twice (https://en.wikipedia.org/wiki/Cache_inclusion_policy). But whatever happens, right after a read or write, at least the inner-most (closest to that CPU core) level of cache will have the data hot, so accessing it again right away (or an adjacent item in the same cache line) will be fast. Different design choices affect the chances of a line still being hot if the next access is after a bunch of other accesses. And if hot, which level of cache you may find it in. But the super basics are that any memory compiler-generated code touches ends up in cache. CPU cache transparently caches physical memory.
Many cache lines can be hot at the same time, not aliasing each other. i.e. caches have many sets. Some access patterns are pessimal, like multiple pointers all offset from each other by 4k which will make all accesses alias the same set in L1d cache, as well as sometimes having extra 4k-aliasing penalties in the CPU's memory disambiguation logic. (Assuming a 4k page size like on x86). e.g. L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes - memory performance effects can get very complicated. Knowing some theory is enough to understand what's generally good, but the exact details can be very complex. (Sometimes even for true experts like Dr. Bandwidth, e.g. this case).
Footnote 1: Loosely, An object is a named variable or dynamically allocated memory pointed to by a pointer.
Footnote 2: Write-back cache with a write-allocate policy is near universal for modern CPUs, also a pseudo-LRU replacement policy; see wikipedia. A few devices have access patterns that benefit from caches that only allocate on read but not write, but CPUs benefit from write-allocate. A modern CPU will almost always have a multi-level cache hierarchy, with each level being set-associative with some level of associativity. Some embedded CPUs may only have 1 level, or even no cache, but you'd know if you were writing code specifically for a system like that.
Modern large L3 caches sometimes use a replacement policy.
Of course, optimization can mean that some local variables (especially loop counters or array pointers) can get optimized into a register and not exist in memory at all. Registers are not part of the CPU cache or memory at all, they're a separate storage space. People often describe things as "compiler caches the value in a register", but do not confuse that with CPU cache. (related: https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ and When to use volatile with multi threading?)
If you want to see what the compiler is making the CPU do, look at the compiler's asm output. How to remove "noise" from GCC/clang assembly output?. Every memory access in the asm source is an access in computer-architecture terms, so you can apply what you know about cache state given an access pattern to figure out what will happen with a set-associative cache.
Also related:
Which cache mapping technique is used in intel core i7 processor?
Modern Microprocessors: A 90-Minute Guide!
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? - why we have multi-level caches, and some real numbers for the cache heirarchies of Haswell (like Skylake) and Piledriver (fortunately obsolete, but an interesting example of a possible design).

Generally, the most recently used cache lines will be stored in the cache. For short loops, loop counter variables are normally stored in a CPU register. For longer loops, loop counter variables will probably be stored in the cache, unless one loop iteration runs for such a long time that the loop counter gets evicted from the cache due to the CPU doing other work.
Most variables will generally be cached after the first access (or beforehand if the cache prefetcher does a good job), irrespective of how often they are used. A higher frequency of usage will only prevent the memory from being evicted from the cache, but won't influence it being cached in the first place. However, some CPU architectures offer so-called non-temporal read and write instructions, which bypass the cache. These instructions are useful if the programmer knows in advance that a memory location will only be accessed once, and therefore should not be cached. But generally, these instructions should not be used, unless you know exactly what you are doing.
The CPU cache does not care whether variables are stored on the heap or stack. Memory is simply cached according to a "most recently used" algorithm, or, to be more accurate, the cache is evicted based on a "least recently used" algorithm, whenever new room in the cache is required for a new memory access.
In the case of local variables stored on the stack, there is a high chance that the cache line of that variable is already cached due to the program using that stack cache line recently for something else. Therefore, local variables generally have good cache performance. Also, the cache prefetcher works very well with the stack, because the stack grows in a linear fashion.
The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?
Yes, first, the cache line containing p will be loaded into the cache (if it is not already cached or stored in a CPU register). Then, the cache line containing *p will be loaded into the cache.

setting and checking a 32 bit variable

I am wondering if setting a 32 bit variable after checking it will be faster than just setting it? E.g. variable a is of uint32
if( a != 0)
{
a = 0;
}
or
a = 0;
The code will be running in a loop which it will run many times so I want to reduce the time to run the code.
Note variable a will be 0 most of the time, so the question can possibly be shortened to if it is faster to check a 32 bit variable or to set it. Thank you in advance!
edit: Thank you all who commented on the question, I have created a for loop and tested both assigning and if-ing for 100 thousand times. It turns out assigning is faster.(54ms for if-ing and 44ms for assigning)

What you describe is called a "silent store" optimization.
PRO: unnecessary stores are avoided.
This can reduce pressure on the store to load forwarding buffers, a component of a modern out-of-order CPU that is quite expensive in hardware, and, as a result, is often undersized, and therefore a performance bottleneck. On Intel x86 CPUs there are performance Event Monitoring counters (EMON) that you can use to investigate whether this is a problem in your program.
Interestingly, it can also reduce the number of loads that your program does. First, SW: if the stores are not eliminated, the compiler may be unable to prove that the do not write to the memory occupied by a different variable, the so-called address and pointer disambiguation problem, so the compiler may generate unnecessary reloads of such possibly but not actually conflicting memory locations. Eliminate the stores, some of these loDs may also be eliminated. Second, HW: most modern CPUs have store to load dependency predictors: fewer stores increase accuracy. If a dependency is predicted, the load may actually not be performed by hardware, and maybe converted into a register to register move. This was the subject of the recent patent lawsuits that the University of Wisconsin asserted against Intel and Apple, with awards exceeding hundreds of millions of dollars.
But the most important Reason to eliminate the unnecessary stores is to avoid unnecessarily dirtying the cache. A dirty cache line eventually has to be written to memory, even if unchanged. Wasting power. In many systems it will eventually be written to flash or SSD, wasting power and consuming the limited write cycles of the device.
These considerations have motivated academic research in silent stores, such as http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.8947&rep=rep1&type=pdf. However, a quick google scholar search shows these papers are mainly 2000-2004, and I am aware of no modern CPUs implementing true silent store elimination - actually having hardware read the old value. I suspect, however, that this lack of deployment of silent stores us mainly because CPU design went on pause for more than a decade, as focus changed from desktop PCs to cell phones. Now that cell phones are almost caught up to the sophistication of 2000-era desktop CPUs, it may arise again.
CON: Eliminating the silent store in software takes more instructions. Worse, it takes a branch. If the branch is not very predictable, the resulting branch mispredictions will consume any savings. Some machines have instructions that allow you to eliminate such stores without a branch: eg Intel's LRBNI vector store instructions with a conditional vector mask. I believe AVX has these instructions. If you or your compiler can use such instructions, then the cost is just the load of the old value and a vector compare if the old value is already in a register, then just the compare.
By the way, you can get some benefit without completely eliminating the store, but by redirecting it to a safe address. Instead if
If a[i] != 0 then a[i] := 0
Do
ptr = a+I; if *ptr == 0 then ptr.:= &safe; *ptr:=0
Still doing the store, but not dirtying so many cache lines. I have used this way if faking a conditional store instruction a lot. It is very unlikely that a compiler will do this sort of optimization.
So, unfortunately, the answer is "it depends". If you are on a vector mask machine or a GPU, and the silent stores are very common, like, more than 30%, worth thinking about. If in scalar code, probably need more like 90% silent.
Ideally, measure it yourself. Although it can be hard to make realistic measurements.
I would start with what is probably the best case for thus optimization:
char a[1024*1024*1024]; // zero filled
const int cachelinesize = 64;
for(char*p=a; p
Every store is eliminated here - mAke sure that the compiler still emits them. Good branch prediction, etc
If this limit case shows no benefit, your realistic code is unlikely to.
Come to think if it, I ran such a benchmark back in the last century. The silent store code was 2x faster, since totally memory bound, and the silent stores generate no dirty cache lines on a write back cache. Recheck thus, and then try on more realistic workload.
But first, measure whether you are memory bottlenecked or not.
By the way: if hardware implementations of silent store elimination become common, then you will never want to do it in software.
But at the moment I am aware of no hardware implementations of silent store elimination in commercially available CPUs.
As ECC becomes more common, silent store elimination becomes almost free - since you have to read the old bytes anyway to recalculate ECC in many cases.

The assignment would do you better as firstly the if statement is redundant and it would make it clearer if you omitted it, also the assignment only should be faster and even if you are not quite sure of it you can just create a simple function to test it with and without the if statement.

Flushing the cache to prevent benchmarking fluctiations

I am running the c++ code of someone to do the benchmarking on a dataset. The issue I have is that often I get a timing for the first run, and these numbers massively change (i.e. 28 seconds to 10 seconds) if I run the same code again. I assume this happens due to CPU's automatic caching. Is there a way to flush the cache, or prevent these fluctuations somehow?

Not one that works "for everything, everywhere". Most processors have special instructions to flush the cache, but they are often privileged instructions, so it has to be done from inside the OS kernel, not your user-mode code. And of course, it's completely different instructions for each processor architecture.
All current x86 processors does have a clflush instruction, that flushes one cache-line, but to do that, you have to have the address of the data (or code) you want to flush. Which is fine for small and simple data structures, not so good if you have a binary tree that is all over the place. And of course, not at all portable.
In most environments, reading and writing a large block of alternative data, e.g. something like:
// Global variables.
const size_t bigger_than_cachesize = 10 * 1024 * 1024;
long *p = new long[bigger_than_cachesize];
...
// When you want to "flush" cache.
for(int i = 0; i < bigger_than_cachesize; i++)
{
p[i] = rand();
}
Using rand will be much slower than filling with something constant/known. But the compiler can't optimise the call away, which means it's (almost) guaranteed that the code will stay.
The above won't flush instruction caches - that is a lot more difficult to do, basically, you have to run some (large enough) other piece of code to do that reliably. However, instruction caches tend to have less effect on overall benchmark performance (instruction cache is EXTREMELY important for modern processor's perforamnce, that's not what I'm saying, but in the sense that the code for a benchmark is typically small enough that it all fits in cache, and the benchmark runs many times over the same code, so it's only slower the first iteration)
Other ideas
Another way to simulate "non-cache" behaviour is allocate a new area for each benchmark pass - in other words, not freeing the memory until the end of the benchmark or using an array containing the data, and output results, such that each run has it's own set of data to work on.
Further, it's common to actually measure the performance of the "hot runs" of a benchmark, not the first "cold run" where the caches are empty. This does of course depend on what you are actually trying to achieve...

Here's my basic approach:
Allocate a memory region 2x the size of the LLC, if you can determine the LLC size dynamically (or you know it statically), or if you don't, some reasonable multiple of the largest LLC size on the platform of interest1.
memset the memory region to some non-zero value: 1 will do just fine.
"Sink" the pointer somewhere so that the compiler can't optimize out the stuff above or below (writing to a volatile global works pretty much 100% of the time).
Read from random indexes in the region until you've touched each cache line an average of 10 times or so (accumulate the read values into a sum that you sink in a similar way to (3)).
Here are some notes on why this is generally works and why doing less may not work - the details are x86-centric, but similar concerns will apply on many other architectures.
You absolutely want to write to the allocated memory (step 2) before you begin your main read-only flushing loop, since otherwise you might just be repeatedly reading from the same small zero-mapped page returned by the OS to satisfy your memory allocation.
You want to use a region considerably larger than the LLC size, since the outer cache levels are typically physically addressed, but you can only allocate and access virtual addresses. If you just allocate an LLC-sized region, you generally won't get full coverage of all the ways of every cache set: some sets will be over-represented (and so will be fully flushed), while other sets be under-represented and so not all existing values can even be flushed by accessing this region of memory. A 2x over-allocation makes it highly likely that almost all sets have enough representation.
You want to avoid the optimizer doing clever things, such as noting the memory never escapes the function and eliminating all your reads and writes.
You want to iterate randomly around the memory region, rather than just striding through it linearly: some designs like the LLC on recent Intel detect when a "streaming" pattern is present, and switch from LRU to MRU since LRU is about the worst-possible replacement policy for such a load. The effect is that no matter how many times you stream though memory, some "old" lines from before your efforts can remain in the cache. Randomly accessing memory defeats this behavior.
You want to access more than just LLC amount of memory for (a) the same reason you allocate more than the LLC size (virtual access vs physical caching) and (b) because random access needs more accesses before you have a high likelihood of hitting every set enough times (c) caches are usually only pseudo-LRU, so you need more than the number of accesses you'd expect under exact-LRU to flush out every line.
Even this is not foolproof. Other hardware optimizations or caching behaviors not considered above could cause this approach to fail. You might get very unlucky with the page allocation provided by the OS and not be able to reach all the pages (you can largely mitigate this by using 2MB pages). I highly recommend testing whether your flush technique is adequate: one approach is to measure the number of cache misses using CPU performance counters while running your benchmark and see if the number makes sense based on the known working-set size2.
Note that this leaves all levels of the cache with lines in E (exclusive) or perhaps S (shared) state, and not the M (modified) state. This means that these lines don't need to be evicted to other cache levels when they are replaced by accesses in your benchmark: they can simply be dropped. The approach described in the other answer will leave most/all lines in the M state, so you'll initially have 1 line of eviction traffic for every line you access in your benchmark. You can achieve the same behavior with my recipe above by changing step 4 to write rather than read.
In that regard, neither approach here is inherently "better" than the other: in the real world the cache levels will have a mix of modified and not-modified lines, while these approaches leave the cache at the two extremes of the continuum. In principle you could benchmark with both the all-M and no-M states, and see if it matters much: if it does, you can try to evaluate what the real-world state of the cache will usually be an replicate that.
1Remember that LLC sizes are growing almost every CPU generation (mostly because core counts are increasing), so you want to leave some room for growth if this needs to be future-proof.
2 I just throw that out there as if it was "easy", but in reality may be very difficult depending on your exact problem.

Setting or consulting boolean. Which has the best performance?

Just for curiosity, which process is fastest: setting a value to a boolean (ex: changing it from true to false) or simple checking its value (ex: if(boolean)...)

The problem I have with "which is fastest" is that they are too underspecified to actually be answered conclusively, and at the same time too broad to yield useful conclusions even if answered conclusively. The only productive avenue your curiosity can take is to build a mental model of the machine and run both cases through it.
foo = true stores the value true to the location allocated to foo. That raises the question: Where and how is foo stored? This is impossible to answer without actually running the complete source code through the compiler with the right settings. It could be anywhere in RAM, or it could be in a register, or it could use no storage at all, being completely eliminated by compiler optimizations. Depending on where foo resides, the cost of overwriting it can vary: Hundreds of CPU cycles (if in RAM and not in cache), a couple cycles (in RAM and cache), one cycle (register), zero cycles (not stored).
if (foo) generally means reading foo and then performing a conditional branch based on it. Regarding the aspects I'll discuss here (I have to omit many details and some major categories), reading is effectively like writing. The conditional branch that follows has is even less predictable, as its cost depends on the run-time behavior of the program. If the branch is always taken, branch prediction will make it virtually free (a few cycles). If it's unpredictable, it may take tens of cycles and blow the pipeline (further reducing throughput). However, it's also possible that the conditional code will be predicated, invaliding most of the above concerns and replacing it with reasoning about instruction latency, data dependencies, and the gory details of the pipeline.
As you can see from the sheer volume to be written about it (despite omitting many details and even some important secondary effects), it's virtually impossible to really answer this in any generality. You need to look at concrete, complete programs to make any sort of prediction, and you need to know your whole system from top to bottom. Note that I had to assume a very specific kind of machine to even get this far: For a GPGPU or a embedded system or an early 90's consumer CPU, I'd have to rewrite almost all of that.

I think that this is so dependent on the context :
CPU/Architecture
Is the boolean stored in memory, cache, or register
whether what's behind the if induce a jump or only a conditional move
whether the value of the boolean is predictable or not
that the question doesn't finally makes sense.

What is cache in C++ programming? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Firstly I would like to tell that I come from a non-Computer Science background & I have been learning the C++ language.
I am unable to understand what exactly is a cache?
It has different meaning in different contexts.
I would like to know what would be called as a cache in a C++ program?
For example, if I have some int data in a file. If I read it & store in an int array, then would this mean that I have 'cached' the data?
To me this seems like common sense to use the data since reading from a file is always bad than reading from RAM.
But I am a little confused due to this article.
In a CPU there can be several caches, to speed up instructions in
loops or to store often accessed data. These caches are small but very
fast. Reading data from cache memory is much faster than reading it
from RAM.
It says that reading data from cache is much faster than from RAM.
I thought RAM & cache were the same.
Can somebody please clear my confusion?
EDIT: I am updating the question because previously it was too broad.
My confusion started with this answer. He says
RowData and m_data are specific to my implementation, but they are
simply used to cache information about a row in the file
What does cache in this context mean?

Any modern CPU has several layers of cache that are typically named things like L1, L2, L3 or even L4. This is called a multi-level cache. The lower the number, the faster the cache will be.
It's important to remember that the CPU runs at speeds that are significantly faster than the memory subsystem. It takes the CPU a tiny eternity to wait for something to be fetched from system memory, many, many clock-cycles elapse from the time the request is made to when the data is fetched, sent over the system bus, and received by the CPU.
There's no programming construct for dealing with caches, but if your code and data can fit neatly in the L1 cache, then it will be fastest. Next is if it can fit in the L2, and so on. If your code or data cannot fit at all, then you'll be at the mercy of the system memory, which can be orders of magnitude slower.
This is why counter-intuitive things like unrolling loops, which should be faster, might end up being slower because your code becomes too large to fit in cache. It's also why shaving a few bytes off a data structure could pay huge dividends even though the memory footprint barely changes. If it fits neatly in the cache, it will be faster.
The only way to know if you have a performance problem related to caching is to benchmark very carefully. Remember each processor type has varying amounts of cache, so what might work well on your i7 CPU might be relatively terrible on an i5.
It's only in extremely performance sensitive applications that the cache really becomes something you worry about. For example, if you need to maintain a steady 60FPS frame rate in a game, you'll be looking at cache problems constantly. Every millisecond counts here. Likewise, anything that runs the CPU at 100% for extended periods of time, such as rendering video, will want to pay very close attention to how much they could gain from adjusting the code that's emitted.
You do have control over how your code is generated with compiler flags. Some will produce smaller code, some theoretically faster by unrolling loops and other tricks. To find the optimal setting can be a very time-consuming process. Likewise, you'll need to pay very careful attention to your data structures and how they're used.

[Cache] has different meaning in different contexts.
Bingo. Here are some definitions:
Cache
Verb
Definition: To place data in some location from which it can be more efficiently or reliably retrieved than its current location. For instance:
Copying a file to a local hard drive from some remote computer
Copying data into main memory from a file on a local hard drive
Copying a value into a variable when it is stored in some kind of container type in your procedural or object oriented program.
Examples: "I'm going to cache the value in main memory", "You should just cache that, it's expensive to look up"
Noun 1
Definition: A copy of data that is presumably more immediately accessible than the source data.
Examples: "Please keep that in your cache, don't hit our servers so much"
Noun 2
Definition: A fast access memory region that is on the die of a processor, modern CPUs generally have several levels of cache. See cpu cache, note that GPUs and other types of processors will also have their own caches with different implementation details.
Examples: "Consider keeping that data in an array so that accessing it sequentially will be cache coherent"

My definition for Cache would be some thing that is in limited amount but faster to access as there is less area to look for. If you are talking about caching in any programming language then it means you are storing some information in form of a variable(variable is nothing a way to locate your data in memory) in memory. Here memory means both RAM and physical cache (CPU cache).
Physical/CPU cache is nothing but memory that is even more used than RAM, it actually stores copies of some data on RAM which is used by CPU very often. You have another level of categorisation after that as well which is on board cache(faster) and off-board cache. youu can see this link

I am updating the question because previously it was too broad. My
confusion started with this answer. He says
RowData and m_data are specific to my implementation,
but they are simply used to cache information about a row in the file
What does cache in this context mean?
This particular use means that RowData is held as a copy in memory, rather than reading (a little bit of) the row from a file every time we need some data from it. Reading from a file is a lot slower [1] than holding on to a copy of the data in our program's memory.
[1] Although in a modern OS, the actual data from the hard-disk is probably held in memory, in file-system cache, to avoid having to read the disk many times to get the same data over and over. However, this still means that the data needs to be copied from the file-system cache to the application using the data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js