Flushing the cache to prevent benchmarking fluctiations - c++

I am running the c++ code of someone to do the benchmarking on a dataset. The issue I have is that often I get a timing for the first run, and these numbers massively change (i.e. 28 seconds to 10 seconds) if I run the same code again. I assume this happens due to CPU's automatic caching. Is there a way to flush the cache, or prevent these fluctuations somehow?

Not one that works "for everything, everywhere". Most processors have special instructions to flush the cache, but they are often privileged instructions, so it has to be done from inside the OS kernel, not your user-mode code. And of course, it's completely different instructions for each processor architecture.
All current x86 processors does have a clflush instruction, that flushes one cache-line, but to do that, you have to have the address of the data (or code) you want to flush. Which is fine for small and simple data structures, not so good if you have a binary tree that is all over the place. And of course, not at all portable.
In most environments, reading and writing a large block of alternative data, e.g. something like:
// Global variables.
const size_t bigger_than_cachesize = 10 * 1024 * 1024;
long *p = new long[bigger_than_cachesize];
...
// When you want to "flush" cache.
for(int i = 0; i < bigger_than_cachesize; i++)
{
p[i] = rand();
}
Using rand will be much slower than filling with something constant/known. But the compiler can't optimise the call away, which means it's (almost) guaranteed that the code will stay.
The above won't flush instruction caches - that is a lot more difficult to do, basically, you have to run some (large enough) other piece of code to do that reliably. However, instruction caches tend to have less effect on overall benchmark performance (instruction cache is EXTREMELY important for modern processor's perforamnce, that's not what I'm saying, but in the sense that the code for a benchmark is typically small enough that it all fits in cache, and the benchmark runs many times over the same code, so it's only slower the first iteration)
Other ideas
Another way to simulate "non-cache" behaviour is allocate a new area for each benchmark pass - in other words, not freeing the memory until the end of the benchmark or using an array containing the data, and output results, such that each run has it's own set of data to work on.
Further, it's common to actually measure the performance of the "hot runs" of a benchmark, not the first "cold run" where the caches are empty. This does of course depend on what you are actually trying to achieve...

Here's my basic approach:
Allocate a memory region 2x the size of the LLC, if you can determine the LLC size dynamically (or you know it statically), or if you don't, some reasonable multiple of the largest LLC size on the platform of interest1.
memset the memory region to some non-zero value: 1 will do just fine.
"Sink" the pointer somewhere so that the compiler can't optimize out the stuff above or below (writing to a volatile global works pretty much 100% of the time).
Read from random indexes in the region until you've touched each cache line an average of 10 times or so (accumulate the read values into a sum that you sink in a similar way to (3)).
Here are some notes on why this is generally works and why doing less may not work - the details are x86-centric, but similar concerns will apply on many other architectures.
You absolutely want to write to the allocated memory (step 2) before you begin your main read-only flushing loop, since otherwise you might just be repeatedly reading from the same small zero-mapped page returned by the OS to satisfy your memory allocation.
You want to use a region considerably larger than the LLC size, since the outer cache levels are typically physically addressed, but you can only allocate and access virtual addresses. If you just allocate an LLC-sized region, you generally won't get full coverage of all the ways of every cache set: some sets will be over-represented (and so will be fully flushed), while other sets be under-represented and so not all existing values can even be flushed by accessing this region of memory. A 2x over-allocation makes it highly likely that almost all sets have enough representation.
You want to avoid the optimizer doing clever things, such as noting the memory never escapes the function and eliminating all your reads and writes.
You want to iterate randomly around the memory region, rather than just striding through it linearly: some designs like the LLC on recent Intel detect when a "streaming" pattern is present, and switch from LRU to MRU since LRU is about the worst-possible replacement policy for such a load. The effect is that no matter how many times you stream though memory, some "old" lines from before your efforts can remain in the cache. Randomly accessing memory defeats this behavior.
You want to access more than just LLC amount of memory for (a) the same reason you allocate more than the LLC size (virtual access vs physical caching) and (b) because random access needs more accesses before you have a high likelihood of hitting every set enough times (c) caches are usually only pseudo-LRU, so you need more than the number of accesses you'd expect under exact-LRU to flush out every line.
Even this is not foolproof. Other hardware optimizations or caching behaviors not considered above could cause this approach to fail. You might get very unlucky with the page allocation provided by the OS and not be able to reach all the pages (you can largely mitigate this by using 2MB pages). I highly recommend testing whether your flush technique is adequate: one approach is to measure the number of cache misses using CPU performance counters while running your benchmark and see if the number makes sense based on the known working-set size2.
Note that this leaves all levels of the cache with lines in E (exclusive) or perhaps S (shared) state, and not the M (modified) state. This means that these lines don't need to be evicted to other cache levels when they are replaced by accesses in your benchmark: they can simply be dropped. The approach described in the other answer will leave most/all lines in the M state, so you'll initially have 1 line of eviction traffic for every line you access in your benchmark. You can achieve the same behavior with my recipe above by changing step 4 to write rather than read.
In that regard, neither approach here is inherently "better" than the other: in the real world the cache levels will have a mix of modified and not-modified lines, while these approaches leave the cache at the two extremes of the continuum. In principle you could benchmark with both the all-M and no-M states, and see if it matters much: if it does, you can try to evaluate what the real-world state of the cache will usually be an replicate that.
1Remember that LLC sizes are growing almost every CPU generation (mostly because core counts are increasing), so you want to leave some room for growth if this needs to be future-proof.
2 I just throw that out there as if it was "easy", but in reality may be very difficult depending on your exact problem.

Related

Why are the relative performance results in Google Benchmark completely different from raw loops? [duplicate]

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?
Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

Shorter loop, same coverage, why do I get more Last Level Cache Misses in c++ with Visual Studio 2013?

I'm trying to understand what creates cache misses and eventually how much do they cost in terms of performance in our application. But with the tests I'm doing now, I'm quite confused.
Assuming that my L3 cache is 4MB, and my LineSize is 64 bytes, I would expect that this loop (loop 1):
int8_t aArr[SIZE_L3];
int i;
for ( i = 0; i < (SIZE_L3); ++i )
{
++aArr[i];
}
...and this loop (loop 2):
int8_t aArr[SIZE_L3];
int i;
for ( i = 0; i < (SIZE_L3 / 64u); ++i )
{
++aArr[i * 64];
}
give roughly the same amount of Last Level Cache Misses, but different amount of Inclusive Last Level Cache References.
However the numbers that the profiler of Visual Studio 2013 gives me are unsettling.
With loop 1:
Inclusive Last Level Cache References: 53,000
Last Level Cache Misses: 17,000
With loop 2:
Inclusive Last Level Cache References: 69,000
Last Level Cache Misses: 35,000
I have tested this with a dynamically allocated array, and on a CPU that has a larger L3 cache (8MB) and I get a similar pattern in the results.
Why don't I get the same amount of cache misses, and why do I have more references in a shorter loop?
Incrementing every byte of int8_t aArr[SIZE_L3]; separately is slow enough that hardware prefetchers are probably able to keep up pretty well a lot of the time. Out-of-order execution can keep a lot of read-modify-writes in flight at once to different addresses, but the best-case is still one byte per clock of stores. (Bottleneck on store-port uops, assuming this was a single-threaded test on a system without a lot of other demands for memory bandwidth).
Intel CPUs have their main prefetch logic in L2 cache (as described in Intel's optimization guide; see the x86 tag wiki). So successful hardware prefetch into L2 cache before the core issues a load means the that L3 cache never sees a miss.
John McCalpin's answer on this Intel forum thread confirms that L2 hardware prefetches are NOT counted as LLC references or misses by the normal perf events like MEM_LOAD_UOPS_RETIRED.LLC_MISS. Apparently there are OFFCORE_RESPONSE events you can look at.
IvyBridge introduced next-page HW prefetch. Intel Microarchitectures before that don't cross page boundaries when prefetching, so you still get misses every 4k. And maybe TLB misses if the OS didn't opportunistically put your memory in a 2MiB hugepage. (But speculative page-walks as you approach a page boundary can probably avoid much delay there, and hardware definitely does do speculative page walks).
With a stride of 64 bytes, execution can touch memory much faster than the cache / memory hierarchy can keep up. You bottleneck on L3 / main memory. Out-of-order execution can keep about the same number of read/modify/write ops in flight at once, but the same out-of-order window covers 64x more memory.
Explaining the exact numbers in more details
For array sizes right around L3, IvyBridge's adaptive replacement policy probably makes a significant difference.
Until we know the exact uarch, and more details of the test, I can't say. It's not clear if you only ran that loop once, or if you had an outer repeat loop and those miss / reference numbers are an average per iteration.
If it's only from a single run, that's a tiny noisy sample. I assume it was somewhat repeatable, but I'm surprised the L3 references count was so high for the every-byte version. 4 * 1024^2 / 64 = 65536, so there was still an L3 reference for most of the cache lines you touched.
Of course, if you didn't have a repeat loop, and those counts include everything the code did besides the loop, maybe most of those counts came from startup / cleanup overhead in your program. (i.e. your program with the loop commented out might have 48k L3 references, IDK.)
I have tested this with a dynamically allocated array
Totally unsurprising, since it's still contiguous.
and on a CPU that has a larger L3 cache (8MB) and I get a similar pattern in the results.
Did this test use a larger array? Or did you use a 4MiB array on a CPU with an 8MiB L3?
Your assumption that "If I skip over more elements in the array, making for fewer iterations of the loop and fewer array accesses, that I should have fewer cache misses" seems to be ignoring the way that data gets fetched into the cache.
When you access memory, more data is kept in the cache than just the specific data you accessed. If I access intArray[0], then intArray[1] and intArray[2] are likely going to be fetched as well at the same time. This is one of the optimizations that allows the cache to help us work faster. So if I access those three memory locations in a row, it's sort of like having only 1 memory read that you need to wait for.
If you increase the stride, instead accessing intArray[0], then intArray[100] and intArray[200], the data may require 3 separate reads because the second and third memory accesses might not be in cache, resulting in a cache miss.
All of the exact details of your specific problem depend on your computer architecture. I would assume you are running an intel x86-based architecture, but when we are talking about hardware at this low of a level I should not assume (I think you can get Visual Studio to run on other architectures, can't you?); and I don't remember all of the specifics for that architecture anyway.
Because you generally don't know what exactly the caching system will be like on the hardware your software is run on, and it can change over time, it is usually better to just read up on caching principles in general and try to write in general code that is likely to produce fewer misses. Trying to make the code perfect on the specific machine you're developing on is usually a waste of time. The exceptions to this are for certain embedded control systems and other types of low-level systems which are not likely to change on you; unless this describes your work I suggest you just read some good articles or books about computer caches.

setting and checking a 32 bit variable

I am wondering if setting a 32 bit variable after checking it will be faster than just setting it? E.g. variable a is of uint32
if( a != 0)
{
a = 0;
}
or
a = 0;
The code will be running in a loop which it will run many times so I want to reduce the time to run the code.
Note variable a will be 0 most of the time, so the question can possibly be shortened to if it is faster to check a 32 bit variable or to set it. Thank you in advance!
edit: Thank you all who commented on the question, I have created a for loop and tested both assigning and if-ing for 100 thousand times. It turns out assigning is faster.(54ms for if-ing and 44ms for assigning)
What you describe is called a "silent store" optimization.
PRO: unnecessary stores are avoided.
This can reduce pressure on the store to load forwarding buffers, a component of a modern out-of-order CPU that is quite expensive in hardware, and, as a result, is often undersized, and therefore a performance bottleneck. On Intel x86 CPUs there are performance Event Monitoring counters (EMON) that you can use to investigate whether this is a problem in your program.
Interestingly, it can also reduce the number of loads that your program does. First, SW: if the stores are not eliminated, the compiler may be unable to prove that the do not write to the memory occupied by a different variable, the so-called address and pointer disambiguation problem, so the compiler may generate unnecessary reloads of such possibly but not actually conflicting memory locations. Eliminate the stores, some of these loDs may also be eliminated. Second, HW: most modern CPUs have store to load dependency predictors: fewer stores increase accuracy. If a dependency is predicted, the load may actually not be performed by hardware, and maybe converted into a register to register move. This was the subject of the recent patent lawsuits that the University of Wisconsin asserted against Intel and Apple, with awards exceeding hundreds of millions of dollars.
But the most important Reason to eliminate the unnecessary stores is to avoid unnecessarily dirtying the cache. A dirty cache line eventually has to be written to memory, even if unchanged. Wasting power. In many systems it will eventually be written to flash or SSD, wasting power and consuming the limited write cycles of the device.
These considerations have motivated academic research in silent stores, such as http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.8947&rep=rep1&type=pdf. However, a quick google scholar search shows these papers are mainly 2000-2004, and I am aware of no modern CPUs implementing true silent store elimination - actually having hardware read the old value. I suspect, however, that this lack of deployment of silent stores us mainly because CPU design went on pause for more than a decade, as focus changed from desktop PCs to cell phones. Now that cell phones are almost caught up to the sophistication of 2000-era desktop CPUs, it may arise again.
CON: Eliminating the silent store in software takes more instructions. Worse, it takes a branch. If the branch is not very predictable, the resulting branch mispredictions will consume any savings. Some machines have instructions that allow you to eliminate such stores without a branch: eg Intel's LRBNI vector store instructions with a conditional vector mask. I believe AVX has these instructions. If you or your compiler can use such instructions, then the cost is just the load of the old value and a vector compare if the old value is already in a register, then just the compare.
By the way, you can get some benefit without completely eliminating the store, but by redirecting it to a safe address. Instead if
If a[i] != 0 then a[i] := 0
Do
ptr = a+I; if *ptr == 0 then ptr.:= &safe; *ptr:=0
Still doing the store, but not dirtying so many cache lines. I have used this way if faking a conditional store instruction a lot. It is very unlikely that a compiler will do this sort of optimization.
So, unfortunately, the answer is "it depends". If you are on a vector mask machine or a GPU, and the silent stores are very common, like, more than 30%, worth thinking about. If in scalar code, probably need more like 90% silent.
Ideally, measure it yourself. Although it can be hard to make realistic measurements.
I would start with what is probably the best case for thus optimization:
char a[1024*1024*1024]; // zero filled
const int cachelinesize = 64;
for(char*p=a; p
Every store is eliminated here - mAke sure that the compiler still emits them. Good branch prediction, etc
If this limit case shows no benefit, your realistic code is unlikely to.
Come to think if it, I ran such a benchmark back in the last century. The silent store code was 2x faster, since totally memory bound, and the silent stores generate no dirty cache lines on a write back cache. Recheck thus, and then try on more realistic workload.
But first, measure whether you are memory bottlenecked or not.
By the way: if hardware implementations of silent store elimination become common, then you will never want to do it in software.
But at the moment I am aware of no hardware implementations of silent store elimination in commercially available CPUs.
As ECC becomes more common, silent store elimination becomes almost free - since you have to read the old bytes anyway to recalculate ECC in many cases.
The assignment would do you better as firstly the if statement is redundant and it would make it clearer if you omitted it, also the assignment only should be faster and even if you are not quite sure of it you can just create a simple function to test it with and without the if statement.

Alignment of data members and member functions for performance

Is it true aligning data members of a struct/class no longer yields the benefits it used to, especially on nehalem because of hardware improvements? If so, is it still the case that alignment will always make better performance, just very small noticeable improvements compared with on past CPUs?
Does alignment of member variables extend to member functions? I believe I once read (it could be on the wikibooks "C++ performance") that there are rules for "packing" member functions into various "units" (i.e. source files) for optimum loading into the instruction cache? (If I have got my terminology wrong here please correct me).
Processors are still much faster than what the RAM can deliver, so they still need caches. Caches still consist of fixed-size cache lines. Also, main memory is delivered in pages and pages are accessed using a translation lookaside buffer. This buffer, again, has a fixed size cache.
Which means that both spatial and temporal locality matter a lot (i.e. how you pack stuff, and how you access it). Packing structures well (sorted by padding/alignment requirements) as opposed to packing them in some haphazard order usually results in smaller structure sizes.
Smaller structure sizes mean, if you have loads of data:
more structures fit into one cache line (cache miss = 50-200 cycles)
fewer pages are needed (page fault = 10-20 million CPU cycles)
fewer TLB entries are needed, fewer TLB misses (TLB miss = 50-500 cycles)
Going linearly over a few gigabytes of tightly packed SoA data can be 3 orders of magnitude faster (or 8-10 orders of magnitude, if page faults are involved) than doing the same thing in a naive way with bad layout/packing.
Whether or not you hand-align individual 4-byte or 2-byte values (say, a typical int or short) to 2 or 4 bytes makes a very small difference on recent Intel CPUs (hardly noticeable). Insofar, it may seem tempting to "optimize" on that, but I strongly advise against doing so.
This is usually something one best doesn't worry about and leaves to the compiler to figure out. If for no other reason, then because the gains are marginal at best, but some other processor architectures will raise an exception if you get it wrong. Therefore, if you try to be too smart, you'll suddenly have unexplainable crashes once you compile on some other architecture. When that happens, you'll feel sorry.
Of course, if you don't have at least several dozen of megabytes of data to process, you need not care at all.
Aligning data to suit the processor will never hurt, but some processors will have more notable drawbacks than others, I think is the best way to answer this question.
Aligning functions into cache-line units seems a bit of a red herring to me. For small functions, what you really want is inlining if at all possible. If the code can't be inlined, then it's probably larger than a cache-line anyway. [Unless it's a virtual function, of course]. I don't think this has ever been a huge factor tho - either code is generally called often, and thus normally in the cache, or it's not called very often, and not very often in the cache. I'm sure it's possibe to come up with some code where calling one function, func1() will also drag in func2() into the cache, so if you always call func1() and func2() in short succession, it would have some benefit. But it's really not something that is that great of a benefit unless you have a lot of functions with pairs or groups of functions that are called close together. [By the way, I don't think the compiler is guaranteed to place your function code in any particular order, no matter which order you place it in the source file].
Cache-alignment is a slightly different matter, since cache-lines can still have a HUGE effect if you get it right vs. getting it wrong. This is more important for multithreading than general "loading data". The key here is to avoid sharing data in the same cache-line between processors. In a project I worked on some 10 or so years ago, a benchmark had a function that used an array of two integers to count up the number of iterations each thread did. When that got split into two separate cache-lines, the benchmark improved from 0.6x of running on a single processor to 1.98x of one processor. The same effect will happen on modern CPU's, even if they are much faster - the effect may not be exactly the same, but it will be a large slowdown (and the more processors sharing data, the more effect, so a quad-core system would be worse than a dual core, etc). This is because every time a processor updates something in a cache-line, all other processors that have read that cache-line must reload it from the processor that updated it [or from memory in the old days].

Is it possible to lock some data in CPU cache?

I have a problem....
I'm writing a data into array in the while-loop. And the point is that I'm doing it really frequently. It seems to be that this writing is now a bottle-neck in the code. So as i presume it's caused by the writing to memory. This array is not really large (smth like 300 elements). The question is it possible to do it in that way: to store it in the cache and update in the memory only after the while-loop is finished?
[edit - copied from an answer added by Alex]
double* array1 = new double[1000000]; // this array has elements
unsigned long* array2 = unsigned long[300];
double varX,t,sum=0;
int iter=0,i=0;
while(i<=max_steps)
{
varX+=difX;
nm0 = int(varX);
if(nm1!=nm0)
{
array2[iter] = nm0; // if you comment this string application works more then 2 times faster :)
nm1=nm0;
t = array1[nm0]; // if you comment this string , there is almost no change in time
++iter;
}
sum+=t;
++i;
}
For the first I'd like to thank all of you for answers. Indeed, it was a little dumb not to place a code. So i decided to do it now.
double* array1 = new double[1000000]; // this array has elements
unsigned long* array2 = unsigned long[300];
double varX,t,sum=0;
int iter=0,i=0;
while(i<=max_steps)
{
varX+=difX;
nm0 = int(varX);
if(nm1!=nm0)
{
array2[iter] = nm0; // if you comment this string application works more then 2 times faster :)
nm1=nm0;
t = array1[nm0]; // if you comment this string , there is almost no change in time
++iter;
}
sum+=t;
++i;
}
So that was it. It would be nice if someone will have any ideas. Thank you very much again.
Sincerely
Alex
Not intentionally, no. Among other things, you have no idea how big the cache is, so you have no idea of what's going to fit. Furthermore, if the app were allowed to lock off part of the cache, the effects on the OS might be devastating to overall system performance. This falls squarely onto my list of "you can't do it because you shouldn't do it. Ever."
What you can do is to improve your locality of reference - try to arrange the loop such that you don't access the elements more than once, and try to access them in order in memory.
Without more clues about your application, I don't think more specific advice can be given.
The CPU does not usually offer fine-grained cache control, you're not allowed to choose what is evicted or to pin things in cache. You do have a few cache operations on some CPUs. Just as a bit of info on what you can do: Here's some interesting cache related instructions on newer x86{-64} CPUs (Doing things like this makes portability hell, but I figured you may be curious)
Software Data Prefecth
The non-temporal instruction is
prefetchnta, which fetches the data
into the second-level cache,
minimizing cache pollution.
The temporal instructions are as
follows:
* prefetcht0 – fetches the data into all cache levels, that is, to the
second-level cache for the Pentium® 4 processor.
* prefetcht1 – Identical to prefetcht0
* prefetcht2 – Identical to prefetcht0
Additionally there are a set of instructions for accessing data in memory but explicitly tell the processor to not insert the data into the cache. These are called non-temporal instructions. An example of one is here: MOVNTI.
You could use the non temporal instructions for every piece of data you DON'T want in cache, in the hopes that the rest will always stay in cache. I don't know if this would actually improve performance as there are subtle behaviors to be aware of when it comes to the cache. Also it sounds like it'd be relatively painful to do.
I have a problem.... I'm writing a data into array in the while-loop. And the point is that I'm doing it really frequently. It seems to be that this writing is now a bottle-neck in the code. So as i presume it's caused by the writing to memory. This array is not really large (smth like 300 elements). The question is it possible to do it in that way: to store it in the cache and update in the memory only after the while-loop is finished?
You don't need to. The only reason why it might get pushed out of the cache is if some other data is deemed more urgent to put in the cache.
Apart from this, an array of 300 elements should fit in the cache with no trouble (assuming the element size isn't too crazy), so most likely, your data is already in cach.
In any case, the most effective solution is probably to tweak your code. Use lots of temporaries (to indicate to the compiler that the memory address isn't important), rather than writing/reading into the array constantly. Reorder your code so loads are performed once, at the start of the loop, and break up dependency chains as much as possible.
Manually unrolling the loop gives you more flexibility to achieve these things.
And finally, two obvious tools you should use rather than guessing about the cache behavior:
A profiler, and cachegrind if available. A good profiler can tell you a lot of statistics on cache misses, and cachegrind give you a lot of information too.
Us here at StackOverflow. If you post your loop code and ask how its performance can be improved, I'm sure a lot of us will find it a fun challenge.
But as others have mentioned, when working with performance, don't guess. You need hard data and measurements, not hunches and gut feelings.
Unless your code does something completely different in between writing to the array, then most of the array will probably be kept in the cache.
Unfortunately there isn't anything you can do to affect what is in the cache, apart from rewriting your algorithm with the cache in mind. Try to use as little memory as possible in between writing to the memory: don't use lot's of variables, don't call many other functions, and try to write to the same region of the array consecutively.
I doubt that this is possible, at least on a high-level multitasking operating system. You can't guarantee that your process won't be pre-empted, and lose the CPU. If your process then owns the cache, other processes can't use it, which would make their exeucution very slow, and complicate things a great deal. You really don't want to run a modern several-GHz processor without cache, just because one application has locked all others out of it.
In this case, array2 will be quite "hot" and stay in cache for that reason alone. The trick is keeping array1 out of cache (!). You're reading it only once, so there is no point in caching it. The SSE instruction for that is MOVNTPD, intrinsic void_mm_stream_pd(double *destination, __m128i source)
Even if you could, it's a bad idea.
Modern desktop computers use multiple-core CPUs. Intel's chips are the most common chips in Desktop machines... but the Core and Core 2 processors don't share an on-die cache.
That is, didn't share a cache until the Core 2 i7 chips were released, which share an on-die 8MB L3 cache.
So, if you were able to lock data in the cache on the computer I'm typing this from, you can't even guarantee this process will be scheduled on the same core, so that cache lock may be totally useless.
If your writes are slow, make sure that no other CPU core is writing in the same memory area at the same time.
When you have a performance problem, don't assume anything, measure first. For example, comment out the writes, and see if the performance is any different.
If you are writing to an array of structures, use a structure pointer to cache the address of the structure so you are not doing the array multiply each time you do an access. Make sure you are using the native word length for the array indexer variable for maximum optimisation.
As other people have said, you can't control this directly, but changing your code may indirectly enable better caching. If you're running on linux and want to get more insight into what's happening with the CPU cache when your program runs, you can use the Cachegrind tool, part of the Valgrind suite. It's a simulation of a processor, so it's not totally realistic, but it gives you information that is hard to get any other way.
It might be possible to use some assembly code, or as onebyone pointed out, assembly intrinsics, to prefetch lines of memory into the cache, but that would cost a lot of time to tinker with it.
Just for trial, try to read in all the data (in a manner that the compiler won't optimize away), and then do the write. See if that helps.
In the early boot phases of CoreBoot (formerly LinuxBIOS) since they have no access to RAM yet (we are talking about BIOS code, and therefore the RAM hasn't been initialized yet) they set up something they call Cache-as-RAM (CAR), i.e. they use the processor cache as RAM even if not backed by actual RAM.