Speed of C program execution - c++

I got one problem at my exam for subject Principal of Programming Language. I thought for long time but i still did not understand the problem
Problem:
Below is a program C, that is executed in MSVC++ 6.0 environment on a PC with configuration ~ CPU Intel 1.8GHz, Ram 512MB
#define M 10000
#define N 5000
int a[M][N];
void main() {
int i, j;
time_t start, stop;
// Part A
start = time(0);
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
a[i][j] = 0;
stop = time(0);
printf("%d\n", stop - start);
// Part B
start = time(0);
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
a[i][j] = 0;
stop = time(0);
printf("%d\n", stop - start);
}
Explain why does part A only execute in 1s, but it took part B 8s to finish?

This has to do with how the array's memory is laid out and how it gets loaded into the cache and accessed: in version A, when accessing a cell of the array, the neighbors get loaded with it into the cache, and the code then immediately accesses those neighbors. In version B, one cell is accessed (and its neighbors loaded into the cache), but the next access is far away, on the next row, and so the whole cache line was loaded but only one value used, and another cache line must be filled for each access. Hence the speed difference.

Row-major order versus column-major order.
Recall first that all multi-dimensional arrays are represented in memory as a continguous block of memory. Thus the multidimensional array A(m,n) might be represented in memory as
a00 a01 a02 ... a0n a10 a11 a12 ... a1n a20 ... amn
In the first loop, you run through this block of memory sequentially. Thus, you run through the array traversing the elements in the following order
a00 a01 a02 ... a0n a10 a11 a12 ... a1n a20 ... amn
1 2 3 n n+1 n+2 n+3 ... 2n 2n+1 mn
In the second loop, you skip around in memory and run through the array traversing the elements in the following order
a00 a10 a20 ... am0 a01 a11 a21 ... am1 a02 ... amn
or, perhaps more clearly,
a00 a01 a02 ... a10 a11 a12 ... a20 ... amn
1 m+1 2m+1 2 m+2 2m+2 3 mn
All that skipping around really hurts you because you don't gain advantages from caching. When you run through the array sequentially, neighboring elements are loaded into the cache. When you skip around through the array, you don't get these benefits and instead keep getting cache misses harming performance.

Because of hardware architectural optimizations. Part A is executing operations on sequential memory addresses, which allows the hardware to substantially accelerate how the calculations are handled. Part B is basically jumping around in memory all the time, which defeats a lot of hardware optimizations.
Key concept for this specific case is processor cache.

The array you are declaring is laid out line-wise in memory. Basically you have a large block of M×N integers and C does a little trickery to make you believe that it's rectangular. But in reality it's flat.
So when you iterate through it line-wise (with M as the outer loop variable) then you are really going linearly through memory. Something which the CPU cache handles very well.
However, when you iterate with N in the outer loop, then you are always making more or less random jumps in memory (at least for the hardware it looks like that). You are accessing the first cell, then move M integers further and do the same, etc. Since your pages in memory are usually around 4 KiB large, this causes another page to be accessed for each iteration of the inner loop. That way nearly any caching strategy fails and you see a major slowdown.

The trouble is here, how your array is layed in memory.
In the computer memory, arrays are normally allocated such as, that first all columns of the first row are comming, then of the second row and so on.
Your computer memory is best viewed as a long stripe of bytes -- it is a one dimensional array of memory -- not two dimensional, so multi-dimensional arrays have to be allocated in the described way.
Now comes a further problem: Modern CPUs have caches. They have multiple caches and they have so called "cache-lines" for the first-level cache. What does this mean. Access to memory is fast, but not fast enough. Modern CPUs are much faster. So they have their on-chip caches which speed things up. Also they don't access single memory locations any more, but they fill one complete cache-line in one fetch. This also is for performance. But this behaviour gives all operations advantages which process data linearly. When you access first all columns in a row, then the next row and so on -- you are working linearly in fact. When you first process all first columns of all rows, you "jump" in the memory. Thus you always force a new cache-line to be filled, just few bytes can be processed, then the cache-line is possibly invalidated by your next jump ....
Thus column-major-order is bad for modern processors, since it does not work linearly.

Related

Access efficiency of C++ 2D array

I have a 2D array a1[10000][100] with 10000 rows and 100 columns, and also a 2D array a2[100][10000] which is the transposed matrix of a1.
Now I need to access 2 columns (eg. the 21th and the 71th columns) of a1 in the order of a1[0][20], a1[0][70], a1[1][20], a1[1][70], ..., a1[9999][20], a1[9999][70]. Or I can also access a2 to achive the same goal (the order: a2[20][0], a2[70][0], a2[20][1], a2[70][1], ..., a2[20][9999], a2[70][9999]). But the latter is much faster than the former. The related code is simplified as follows (size1 = 10000):
1 sum1 = 0;
2 for (i = 0; i < size1; ++i) {
3 x = a1[i][20];
4 y = a1[i][70];
5 sum1 = x + y;
6 } // this loop is slower
7 a_sum1[i] = sum1;
8
9 sum2 = 0;
10 for (i = 0; i < size1; ++i) {
11 x = a2[20][i];
12 y = a2[70][i];
14 sum2 = x + y;
15 } // this loop is faster
16 a_sum2[i] = sum2;
Accessing more rows (I have also tried 3, 4 rows rather than 2 rows in the example above) of a2 is also faster than accessing the same number of columns of a1. Of course I can also replace Lines 3-5 (or Lines 11-14) with a loop (by using an extra array to store the column/row indexes to be accessed), it also gets the same result that the latter is faster than the former.
Why is the latter much faster than the former? I know something about cache lines but I have no idea of the reason for this case. Thanks.
You can benefit from the memory cache if you access addresses within the same cache line in a short amount of time. The explanation below assumes your arrays contain 4-byte integers.
In your first loop, your two memory accesses in the loop are 50*4 bytes apart, and the next iteration jumps forward 400 bytes. Every memory access here is a cache miss.
In the second loop, you still have two memory accesses that are 50*400 bytes apart, but on the next loop iteration you access addresses that are right next to the previously fetched value. Assuming the common 64-byte cache line size, you only have two cache misses every 16 iterations of the loop, the rest can be served from two cache lines loaded at the start of such a cycle.
This is because C++ has a row-major order (https://en.wikipedia.org/wiki/Row-_and_column-major_order). You should avoid column-major access in C/C++ (https://www.appentra.com/knowledge/checks/pwr010/).
The reason is that the elements are stored by rows and the access by rows allows to better use cache lines, vectorization and other hardware features/techniques.
The reason is cache locality.
a2[20][0], a2[20][1], a2[20][2] ... are stored in memory next to each other. And a1[0][20], a1[1][20], a1[2][20] ... aren't (the same applies to a2[70][0], a2[70][1], a2[70][2] ...).
That means that accessing a1[0][20], a1[1][20], a1[2][20] would waste DRAM bandwidth, as it would use only 4 or 8 bytes of each 64-byte cache line loaded from DRAM.

How to speed up this histogram of LUT lookups?

First, I have an array int a[1000][1000]. All these integers are between 0 and 32767 ,and they are known constants: they never change during a run of the program.
Second, I have an array b[32768], which contains integers between 0 and 32. I use this array to map all arrays in a to 32 bins:
int bins[32]{};
for (auto e : a[i])//mapping a[i] to 32 bins.
bins[b[e]]++;
Each time, array b will be initialized with a new array, and I need to hash all those 1000 arrays in array a (each contains 1000 ints) to 1000 arrays each contains 32 ints represents for how many ints fall into its each bin .
int new_array[32768] = {some new mapping};
copy(begin(new_array), end(new_array), begin(b));//reload array b;
int bins[1000][32]{};//output array to store results .
for (int i = 0; i < 1000;i++)
for (auto e : a[i])//hashing a[i] to 32 bins.
bins[i][b[e]]++;
I can map 1000*1000 values in 0.00237 seconds. Is there any other way that I can speed up my code? (Like SIMD?) This piece of code is the bottleneck of my program.
This is essentially a histogram problem. You're mapping values 16-bit values to 5-bit values with a 32k-entry lookup table, but after that it's just histogramming the LUT results. Like ++counts[ b[a[j]] ];, where counts is bins[i]. See below for more about histograms.
First of all, you can use the smallest possible data-types to increase the density of your LUT (and of the original data). On x86, a zero or sign-extending load of 8-bit or 16-bit data into a register is almost exactly the same cost as a regular 32-bit int load (assuming both hit in cache), and an 8-bit or 16-bit store is also just as cheap as a 32-bit store.
Since your data size exceeds L1 d-cache size (32kiB for all recent Intel designs), and you access it in a scattered pattern, you have a lot to gain from shrinking your cache footprint. (For more x86 perf info, see the x86 tag wiki, especially Agner Fog's stuff).
Since a has less than 65536 entries in each plane, your bin counts will never overflow a 16-bit counter, so bins can be uint16_t as well.
Your copy() makes no sense. Why are you copying into b[32768] instead of having your inner loop use a pointer to the current LUT? You use it read-only. The only reason you'd still want to copy is to copy from int to uin8_t if you can't change the code that produces different LUTs to produce int8_t or uint8_t in the first place.
This version takes advantage of those ideas and a few histogram tricks, and compiles to asm that looks good (Godbolt compiler explorer: gcc6.2 -O3 -march=haswell (AVX2)):
// untested
//#include <algorithm>
#include <stdint.h>
const int PLANES = 1000;
void use_bins(uint16_t bins[PLANES][32]); // pass the result to an extern function so it doesn't optimize away
// 65536 or higher triggers the static_assert
alignas(64) static uint16_t a[PLANES][1000]; // static/global, I guess?
void lut_and_histogram(uint8_t __restrict__ lut[32768])
{
alignas(16) uint16_t bins[PLANES][32]; // don't zero the whole thing up front: that would evict more data from cache than necessary
// Better would be zeroing the relevant plane of each bin right before using.
// you pay the rep stosq startup overhead more times, though.
for (int i = 0; i < PLANES;i++) {
alignas(16) uint16_t tmpbins[4][32] = {0};
constexpr int a_elems = sizeof(a[0])/sizeof(uint16_t);
static_assert(a_elems > 1, "someone changed a[] into a* and forgot to update this code");
static_assert(a_elems <= UINT16_MAX, "bins could overflow");
const uint16_t *ai = a[i];
for (int j = 0 ; j<a_elems ; j+=4) { //hashing a[i] to 32 bins.
// Unrolling to separate bin arrays reduces serial dependencies
// to avoid bottlenecks when the same bin is used repeatedly.
// This has to be balanced against using too much L1 cache for the bins.
// TODO: load a vector of data from ai[j] and unpack it with pextrw.
// even just loading a uint64_t and unpacking it to 4 uint16_t would help.
tmpbins[0][ lut[ai[j+0]] ]++;
tmpbins[1][ lut[ai[j+1]] ]++;
tmpbins[2][ lut[ai[j+2]] ]++;
tmpbins[3][ lut[ai[j+3]] ]++;
static_assert(a_elems % 4 == 0, "unroll factor doesn't divide a element count");
}
// TODO: do multiple a[i] in parallel instead of slicing up a single run.
for (int k = 0 ; k<32 ; k++) {
// gcc does auto-vectorize this with a short fully-unrolled VMOVDQA / VPADDW x3
bins[i][k] = tmpbins[0][k] + tmpbins[1][k] +
tmpbins[2][k] + tmpbins[3][k];
}
}
// do something with bins. An extern function stops it from optimizing away.
use_bins(bins);
}
The inner-loop asm looks like this:
.L2:
movzx ecx, WORD PTR [rdx]
add rdx, 8 # pointer increment over ai[]
movzx ecx, BYTE PTR [rsi+rcx]
add WORD PTR [rbp-64272+rcx*2], 1 # memory-destination increment of a histogram element
movzx ecx, WORD PTR [rdx-6]
movzx ecx, BYTE PTR [rsi+rcx]
add WORD PTR [rbp-64208+rcx*2], 1
... repeated twice more
With those 32-bit offsets from rbp (instead of 8-bit offsets from rsp, or using another register :/) the code density isn't wonderful. Still, the average instruction length isn't so long that it's likely to bottleneck on instruction decode on any modern CPU.
A variation on multiple bins:
Since you need to do multiple histograms anyway, just do 4 to 8 of them in parallel instead of slicing the bins for a single histogram. The unroll factor doesn't even have to be a power of 2.
That eliminates the need for the bins[i][k] = sum(tmpbins[0..3][k]) loop over k at the end.
Zero bins[i..i+unroll_factor][0..31] right before use, instead of zeroing the whole thing outside the loop. That way all the bins will be hot in L1 cache when you start, and this work can overlap with the more load-heavy work of the inner loop.
Hardware prefetchers can keep track of multiple sequential streams, so don't worry about having a lot more cache misses in loading from a. (Also use vector loads for this, and slice them up after loading).
Other questions with useful answers about histograms:
Methods to vectorise histogram in SIMD? suggests the multiple-bin-arrays and sum at the end trick.
Optimizing SIMD histogram calculation x86 asm loading a vector of a values and extracting to integer registers with pextrb. (In your code, you'd use pextrw / _mm_extract_epi16). With all the load/store uops happening, doing a vector load and using ALU ops to unpack makes sense. With good L1 hit rates, memory uop throughput may be the bottleneck, not memory / cache latency.
How to optimize histogram statistics with neon intrinsics? some of the same ideas: multiple copies of the bins array. It also has an ARM-specific suggestion for doing address calculations in a SIMD vector (ARM can get two scalars from a vector in a single instruction), and laying out the multiple-bins array the opposite way.
AVX2 Gather instructions for the LUT
If you're going to run this on Intel Skylake, you could maybe even do the LUT lookups with AVX2 gather instructions. (On Broadwell, it's probably a break-even, and on Haswell it would lose; they support vpgatherdd (_mm_i32gather_epi32), but don't have as efficient an implementation. Hopefully Skylake avoids hitting the same cache line multiple times when there is overlap between elements).
And yes, you can still gather from an array of uint16_t (with scale factor = 2), even though the smallest gather granularity is 32-bit elements. It means you get garbage in the high half of each 32-bit vector element instead of 0, but that shouldn't matter. Cache-line splits aren't ideal, since we're probably bottlenecked on cache throughput.
Garbage in the high half of gathered elements doesn't matter because you're extracting only the useful 16 bits anyway with pextrw. (And doing the histogram part of the process with scalar code).
You could potentially use another gather to load from the histogram bins, as long as each element comes from a separate slice/plane of histogram bins. Otherwise, if two elements come from the same bin, it would only be incremented by one when you manually scatter the incremented vector back into the histogram (with scalar stores). This kind of conflict detection for scatter stores is why AVX512CD exists. AVX512 does have scatter instructions, as well as gather (already added in AVX2).
AVX512
See page 50 of Kirill Yukhin's slides from 2014 for an example loop that retries until there are no conflicts; but it doesn't show how get_conflict_free_subset() is implemented in terms of __m512i _mm512_conflict_epi32 (__m512i a) (vpconflictd) (which returns a bitmap in each element of all the preceding elements it conflicts with). As #Mysticial points out, a simple implementation is less simple than it would be if the conflict-detect instruction simply produced a mask-register result, instead of another vector.
I searched for but didn't find an Intel-published tutorial/guide on using AVX512CD, but presumably they think using _mm512_lzcnt_epi32 (vplzcntd) on the result of vpconflictd is useful for some cases, because it's also part of AVX512CD.
Maybe you're "supposed" to do something more clever than just skipping all elements that have any conflicts? Maybe to detect a case where a scalar fallback would be better, e.g. all 16 dword elements have the same index? vpbroadcastmw2d broadcasts a mask register to all 32-bit elements of the result, so that lets you line up a mask-register value with the bitmaps in each element from vpconflictd. (And there are already compare, bitwise, and other operations between elements from AVX512F).
Kirill's slides list VPTESTNM{D,Q} (from AVX512F) along with the conflict-detection instructions. It generates a mask from DEST[j] = (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] == 0)? 1 : 0. i.e. AND elements together, and set the mask result for that element to 1 if they don't intersect.
Possibly also relevant: http://colfaxresearch.com/knl-avx512/ says "For a practical illustration, we construct and optimize a micro-kernel for particle binning particles", with some code for AVX2 (I think). But it's behind a free registration which I haven't done. Based on the diagram, I think they're doing the actual scatter part as scalar, after some vectorized stuff to produce data they want to scatter. The first link says the 2nd link is "for previous instruction sets".
Avoid gather/scatter conflict detection by replicating the count array
When the number of buckets is small compared to the size of the array, it becomes viable to replicate the count arrays and unroll to minimize store-forwarding latency bottlenecks with repeated elements. But for a gather/scatter strategy, it also avoids the possibility of conflicts, solving the correctness problem, if we use a different array for each vector element.
How can we do that when a gather / scatter instruction only takes one array base? Make all the count arrays contiguous, and offset each index vector with one extra SIMD add instruction, fully replacing conflict detection and branching.
If the number of buckets isn't a multiple of 16, you might want to round up the array geometry so each subset of counts starts at an aligned offset. Or not, if cache locality is more important than avoiding unaligned loads in the reduction at the end.
const size_t nb = 32; // number of buckets
const int VEC_WIDTH = 16; // sizeof(__m512i) / sizeof(uint32_t)
alignas(__m512i) uint32_t counts[nb * VEC_WIDTH] = {0};
// then in your histo loop
__m512i idx = ...; // in this case from LUT lookups
idx = _mm512_add_epi32(idx, _mm512_setr_epi32(
0*nb, 1*nb, 2*nb, 3*nb, 4*nb, 5*nb, 6*nb, 7*nb,
8*nb, 9*nb, 10*nb, 11*nb, 12*nb, 13*nb, 14*nb, 15*nb));
// note these are C array indexes, not byte offsets
__m512i vc = _mm512_i32gather_epi32(idx, counts, sizeof(counts[0]));
vc = _mm512_add_epi32(vc, _mm512_set1_epi32(1));
_mm512_i32scatter_epi32(counts, idx, vc, sizeof(counts[0]));
https://godbolt.org/z/8Kesx7sEK shows that the above code actually compiles. (Inside a loop, the vector-constant setup could get hoisted, but not setting mask registers to all-one before each gather or scatter, or preparing a zeroed merge destination.)
Then after the main histogram loop, reduce down to one count array:
// Optionally with size_t nb as an arg
// also optionally use restrict if you never reduce in-place, into the bottom of the input.
void reduce_counts(int *output, const int *counts)
{
for (int i = 0 ; i < nb - (VEC_WIDTH-1) ; i+=VEC_WIDTH) {
__m512i v = _mm512_load_si512(&counts[i]); // aligned load, full cache line
// optional: unroll this and accumulate two vectors in parallel for better spatial locality and more ILP
for (int offset = nb; offset < nb*VEC_WIDTH ; offset+=nb) {
__m512i tmp = _mm512_loadu_si512(&counts[i + offset]);
v = _mm512_add_epi32(v, tmp);
}
_mm512_storeu_si512(&output[i], v);
}
// if nb isn't a multiple of the vector width, do some cleanup here
// possibly using a masked store to write into a final odd-sized destination
}
Obviously this is bad with too many buckets; you end up having to zero way more memory, and loop over a lot of it at the end. Using 256-bit instead of 512-bit gathers helps, you only need half as many arrays, but efficiency of gather/scatter instructions improves with wider vectors. e.g. one vpgatherdd per 5 cycles for 256-bit on Cascade Lake, one per 9.25 for 512-bit. (And both are 4 uops for the front-end)
Or on Ice Lake, one vpscatterdd ymm per 7 cycles, one zmm per 11 cycles. (vs. 14 for 2x ymm). https://uops.info/
In your bins[1000][32] case, you could actually use the later elements of bins[i+0..15] as extra count arrays, if you zero first, at least for the first 1000-15 outer loop iterations. That would avoid touching extra memory: zeroing for the next outer loop would start at the previous counts[32], effectively.
(This would be playing a bit fast and loose with C 2D vs. 1D arrays, but all the actual accesses past the end of the [32] C array type would be via memset (i.e. unsigned char*) or via _mm* intrinsics which are also allowed to alias anything)
Related:
Tiny histograms (like 4 buckets) can use count[0] += (arr[i] == 0) and so on, which you can vectorize with SIMD packed compares - Micro Optimization of a 4-bucket histogram of a large array or list This is interesting when the number of buckets is less than or equal to the number of elements in a SIMD vector.

cache friendly C++ operation on matrix in C++?

My application does some operations on matrices of large size.
I recently came accross the concept of cache & the performance effect it can have through this answer.
I would like to know what would be the best algorithm which is cache friendly for my case.
Algorithm 1:
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * c[j][j];//K is a constant double variable
}//c is a 2 dimensional array of double variables
}
Algorithm 2:
double *A = new double[size];
for(int n = 0; n < size; n++)
A[n] = c[n][n];
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * A[j];
}
}
The size of my array is more than 1000x1000.
Benchmarking on my laptop shows Algorithm 2 is better than 1 for size 5000x5000.
Please note that I have multi threaded my application such that a set of rows is operated by a thread.
For example: For array of size 1000x1000.
thread1 -> row 0 to row 249
thread2 -> row 250 to row 499
thread3 -> row 500 to row 749
thread4 -> row 750 to row 999
If your benchmarks show significant improvement for the second case, then it most likely is the better choice. But of course, to know for "an average CPU", we'd have to know that for a large number of CPU's that can be called average - there is no other way. And it will really depend on the definition of Average CPU. Are we talking "any x86 (AMD + Intel) CPU" or "Any random CPU that we can find in anything from a watch to the latest super-fast creation in the x86 range"?
The "copy the data in c[n][n]" method helps because it gets its own address, and doesn't get thrown out of the (L1) cache when the code walks its way over the larger matrix [and all the data you need for the multiplication is "close together". If you walk c[j][j], every j steps will jump sizeof(double) * (size * j + 1) bytes per iteration, so if size is anything more than 4, the next item needed wont be in the same cache-line, so another memory read is needed to get that data.
In other words, for anything that has a decent size cache (bigger than size * sizeof(double)), it's a definite benefit. Even with smaller cache, it's quite likely SOME benefit, but the chances are higher that the cached copy will be thrown out by some part of c[i][j].
In summary, the second algorithm is very likely better for nearly all options.
Algorithm2 benefits from what's called "spatial locality", moving the diagonal into a single dimension array makes it reside in memory in consecutive addresses, and thereby:
Enjoys the benefit of fetching multiple useful elements per a single cache line (presumably 64byte, depending on your CPU), better utilizing cache and memory BW (whereas c[n][n] would also fetch a lot of useless data since it's in the same lines).
Enjoys the benefits of a HW stream prefetchers (assuming such exist in your CPU), that aggressively run ahead of your code along the page and brings the data in advance to the lower cache levels, improving the memory latency.
It should be pointed that moving the data to A doesn't necessarily improve cacheability since A would still compete against a lot of data constantly coming from c and thrashing the cache. However, since it's used over and over, there's a high chance that a good LRU algorithm would make it stay in the cache anyway. You could help that by using streaming memory operations for array c. It should be noted that these are very volatile performance tools, and may on some scenarios lead to perf reduction if not used correctly.
Another potential benefit could come from mixing SW prefetches slightly ahead of reaching every new array line.

Where is the bottleneck in this code?

I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible.
//n is about 60
for (int k = 0;k < n;k++)
{
double fone = z[k*n+i+1];
double fzer = z[k*n+i];
z[k*n+i+1]= s*fzer+c*fone;
z[k*n+i] = c*fzer-s*fone;
}
Are there any optimizations that can be made such as vectorization or some evil inline that can help this code?
I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html
Short answer: Change the memory layout of your matrix from row-major order to column-major order.
Long answer:
It seems you are accessing the (i)th and (i+1)th column of a matrix stored in row-major order - probably a big matrix that doesn't as a whole fit into CPU cache. Basically, on every loop iteration the CPU has to wait for RAM (in the order of hundred cycles). After a few iteraterations, theoretically, the address prediction should kick in and the CPU should speculatively load the data items even before the loop acesses them. That should help with RAM latency. But that still leaves the problem that the code uses the memory bus inefficiently: CPU and memory never exchange single bytes, only cache-lines (64 bytes on current processors). Of every 64 byte cache-line loaded and stored your code only touches 16 bytes (or a quarter).
Transposing the matrix and accessing it in native major order would increase memory bus utilization four-fold. Since that is probably the bottle-neck of your code, you can expect a speedup of about the same order.
Whether it is worth it, depends on the rest of your algorithm. Other parts may of course suffer because of the changed memory layout.
I take it you are rotating something (or rather, lots of things, by the same angle (s being a sin, c being a cos))?
Counting backwards is always good fun and cuts out variable comparison for each iteration, and should work here. Making the counter the index might save a bit of time also (cuts out a bit of arithmetic, as said by others).
for (int k = (n-1) * n + i; k >= 0; k -= n)
{
double fone=z[k+1];
double fzer=z[k];
z[k+1]=s*fzer+c*fone;
z[k] =c*fzer-s*fone;
}
Nothing dramatic here, but it looks tidier if nothing else.
As first move i'd cache pointers in this loop:
//n is about 60
double *cur_z = &z[0*n+i]
for (int k = 0;k < n;k++)
{
double fone = *(cur_z+1);
double fzer = *cur_z;
*(cur_z+1)= s*fzer+c*fone;
*cur_z = c*fzer-s*fone;
cur_z += n;
}
Second, i think its better to make templatized version of this function. As a result, you can get good perfomance benefit if your matrix holds integer values (since FPU operations are slower).

Why are elementwise additions much faster in separate loops than in a combined loop?

Suppose a1, b1, c1, and d1 point to heap memory, and my numerical code has the following core loop.
const int n = 100000;
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
c1[j] += d1[j];
}
This loop is executed 10,000 times via another outer for loop. To speed it up, I changed the code to:
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
c1[j] += d1[j];
}
Compiled on Microsoft Visual C++ 10.0 with full optimization and SSE2 enabled for 32-bit on a Intel Core 2 Duo (x64), the first example takes 5.5 seconds and the double-loop example takes only 1.9 seconds.
Disassembly for the first loop basically looks like this (this block is repeated about five times in the full program):
movsd xmm0,mmword ptr [edx+18h]
addsd xmm0,mmword ptr [ecx+20h]
movsd mmword ptr [ecx+20h],xmm0
movsd xmm0,mmword ptr [esi+10h]
addsd xmm0,mmword ptr [eax+30h]
movsd mmword ptr [eax+30h],xmm0
movsd xmm0,mmword ptr [edx+20h]
addsd xmm0,mmword ptr [ecx+28h]
movsd mmword ptr [ecx+28h],xmm0
movsd xmm0,mmword ptr [esi+18h]
addsd xmm0,mmword ptr [eax+38h]
Each loop of the double loop example produces this code (the following block is repeated about three times):
addsd xmm0,mmword ptr [eax+28h]
movsd mmword ptr [eax+28h],xmm0
movsd xmm0,mmword ptr [ecx+20h]
addsd xmm0,mmword ptr [eax+30h]
movsd mmword ptr [eax+30h],xmm0
movsd xmm0,mmword ptr [ecx+28h]
addsd xmm0,mmword ptr [eax+38h]
movsd mmword ptr [eax+38h],xmm0
movsd xmm0,mmword ptr [ecx+30h]
addsd xmm0,mmword ptr [eax+40h]
movsd mmword ptr [eax+40h],xmm0
The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:
Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?
It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.
Here is the full code. It uses TBB Tick_Count for higher resolution timing, which can be disabled by not defining the TBB_TIMING Macro:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <string>
//#define TBB_TIMING
#ifdef TBB_TIMING
#include <tbb/tick_count.h>
using tbb::tick_count;
#else
#include <time.h>
#endif
using namespace std;
//#define preallocate_memory new_cont
enum { new_cont, new_sep };
double *a1, *b1, *c1, *d1;
void allo(int cont, int n)
{
switch(cont) {
case new_cont:
a1 = new double[n*4];
b1 = a1 + n;
c1 = b1 + n;
d1 = c1 + n;
break;
case new_sep:
a1 = new double[n];
b1 = new double[n];
c1 = new double[n];
d1 = new double[n];
break;
}
for (int i = 0; i < n; i++) {
a1[i] = 1.0;
d1[i] = 1.0;
c1[i] = 1.0;
b1[i] = 1.0;
}
}
void ff(int cont)
{
switch(cont){
case new_sep:
delete[] b1;
delete[] c1;
delete[] d1;
case new_cont:
delete[] a1;
}
}
double plain(int n, int m, int cont, int loops)
{
#ifndef preallocate_memory
allo(cont,n);
#endif
#ifdef TBB_TIMING
tick_count t0 = tick_count::now();
#else
clock_t start = clock();
#endif
if (loops == 1) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
}
} else {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
c1[j] += d1[j];
}
}
}
double ret;
#ifdef TBB_TIMING
tick_count t1 = tick_count::now();
ret = 2.0*double(n)*double(m)/(t1-t0).seconds();
#else
clock_t end = clock();
ret = 2.0*double(n)*double(m)/(double)(end - start) *double(CLOCKS_PER_SEC);
#endif
#ifndef preallocate_memory
ff(cont);
#endif
return ret;
}
void main()
{
freopen("C:\\test.csv", "w", stdout);
char *s = " ";
string na[2] ={"new_cont", "new_sep"};
cout << "n";
for (int j = 0; j < 2; j++)
for (int i = 1; i <= 2; i++)
#ifdef preallocate_memory
cout << s << i << "_loops_" << na[preallocate_memory];
#else
cout << s << i << "_loops_" << na[j];
#endif
cout << endl;
long long nmax = 1000000;
#ifdef preallocate_memory
allo(preallocate_memory, nmax);
#endif
for (long long n = 1L; n < nmax; n = max(n+1, long long(n*1.2)))
{
const long long m = 10000000/n;
cout << n;
for (int j = 0; j < 2; j++)
for (int i = 1; i <= 2; i++)
cout << s << plain(n, m, j, i);
cout << endl;
}
}
It shows FLOP/s for different values of n.
Upon further analysis of this, I believe this is (at least partially) caused by the data alignment of the four-pointers. This will cause some level of cache bank/way conflicts.
If I've guessed correctly on how you are allocating your arrays, they are likely to be aligned to the page line.
This means that all your accesses in each loop will fall on the same cache way. However, Intel processors have had 8-way L1 cache associativity for a while. But in reality, the performance isn't completely uniform. Accessing 4-ways is still slower than say 2-ways.
EDIT: It does in fact look like you are allocating all the arrays separately.
Usually when such large allocations are requested, the allocator will request fresh pages from the OS. Therefore, there is a high chance that large allocations will appear at the same offset from a page-boundary.
Here's the test code:
int main(){
const int n = 100000;
#ifdef ALLOCATE_SEPERATE
double *a1 = (double*)malloc(n * sizeof(double));
double *b1 = (double*)malloc(n * sizeof(double));
double *c1 = (double*)malloc(n * sizeof(double));
double *d1 = (double*)malloc(n * sizeof(double));
#else
double *a1 = (double*)malloc(n * sizeof(double) * 4);
double *b1 = a1 + n;
double *c1 = b1 + n;
double *d1 = c1 + n;
#endif
// Zero the data to prevent any chance of denormals.
memset(a1,0,n * sizeof(double));
memset(b1,0,n * sizeof(double));
memset(c1,0,n * sizeof(double));
memset(d1,0,n * sizeof(double));
// Print the addresses
cout << a1 << endl;
cout << b1 << endl;
cout << c1 << endl;
cout << d1 << endl;
clock_t start = clock();
int c = 0;
while (c++ < 10000){
#if ONE_LOOP
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
#else
for(int j=0;j<n;j++){
a1[j] += b1[j];
}
for(int j=0;j<n;j++){
c1[j] += d1[j];
}
#endif
}
clock_t end = clock();
cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
return 0;
}
Benchmark Results:
EDIT: Results on an actual Core 2 architecture machine:
2 x Intel Xeon X5482 Harpertown # 3.2 GHz:
#define ALLOCATE_SEPERATE
#define ONE_LOOP
00600020
006D0020
007A0020
00870020
seconds = 6.206
#define ALLOCATE_SEPERATE
//#define ONE_LOOP
005E0020
006B0020
00780020
00850020
seconds = 2.116
//#define ALLOCATE_SEPERATE
#define ONE_LOOP
00570020
00633520
006F6A20
007B9F20
seconds = 1.894
//#define ALLOCATE_SEPERATE
//#define ONE_LOOP
008C0020
00983520
00A46A20
00B09F20
seconds = 1.993
Observations:
6.206 seconds with one loop and 2.116 seconds with two loops. This reproduces the OP's results exactly.
In the first two tests, the arrays are allocated separately. You'll notice that they all have the same alignment relative to the page.
In the second two tests, the arrays are packed together to break that alignment. Here you'll notice both loops are faster. Furthermore, the second (double) loop is now the slower one as you would normally expect.
As #Stephen Cannon points out in the comments, there is a very likely possibility that this alignment causes false aliasing in the load/store units or the cache. I Googled around for this and found that Intel actually has a hardware counter for partial address aliasing stalls:
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/~amplifierxe/pmw_dp/events/partial_address_alias.html
5 Regions - Explanations
Region 1:
This one is easy. The dataset is so small that the performance is dominated by overhead like looping and branching.
Region 2:
Here, as the data sizes increase, the amount of relative overhead goes down and the performance "saturates". Here two loops is slower because it has twice as much loop and branching overhead.
I'm not sure exactly what's going on here... Alignment could still play an effect as Agner Fog mentions cache bank conflicts. (That link is about Sandy Bridge, but the idea should still be applicable to Core 2.)
Region 3:
At this point, the data no longer fits in the L1 cache. So performance is capped by the L1 <-> L2 cache bandwidth.
Region 4:
The performance drop in the single-loop is what we are observing. And as mentioned, this is due to the alignment which (most likely) causes false aliasing stalls in the processor load/store units.
However, in order for false aliasing to occur, there must be a large enough stride between the datasets. This is why you don't see this in region 3.
Region 5:
At this point, nothing fits in the cache. So you're bound by memory bandwidth.
OK, the right answer definitely has to do something with the CPU cache. But to use the cache argument can be quite difficult, especially without data.
There are many answers, that led to a lot of discussion, but let's face it: Cache issues can be very complex and are not one dimensional. They depend heavily on the size of the data, so my question was unfair: It turned out to be at a very interesting point in the cache graph.
#Mysticial's answer convinced a lot of people (including me), probably because it was the only one that seemed to rely on facts, but it was only one "data point" of the truth.
That's why I combined his test (using a continuous vs. separate allocation) and #James' Answer's advice.
The graphs below shows, that most of the answers and especially the majority of comments to the question and answers can be considered completely wrong or true depending on the exact scenario and parameters used.
Note that my initial question was at n = 100.000. This point (by accident) exhibits special behavior:
It possesses the greatest discrepancy between the one and two loop'ed version (almost a factor of three)
It is the only point, where one-loop (namely with continuous allocation) beats the two-loop version. (This made Mysticial's answer possible, at all.)
The result using initialized data:
The result using uninitialized data (this is what Mysticial tested):
And this is a hard-to-explain one: Initialized data, that is allocated once and reused for every following test case of different vector size:
Proposal
Every low-level performance related question on Stack Overflow should be required to provide MFLOPS information for the whole range of cache relevant data sizes! It's a waste of everybody's time to think of answers and especially discuss them with others without this information.
The second loop involves a lot less cache activity, so it's easier for the processor to keep up with the memory demands.
Imagine you are working on a machine where n was just the right value for it only to be possible to hold two of your arrays in memory at one time, but the total memory available, via disk caching, was still sufficient to hold all four.
Assuming a simple LIFO caching policy, this code:
for(int j=0;j<n;j++){
a[j] += b[j];
}
for(int j=0;j<n;j++){
c[j] += d[j];
}
would first cause a and b to be loaded into RAM and then be worked on entirely in RAM. When the second loop starts, c and d would then be loaded from disk into RAM and operated on.
the other loop
for(int j=0;j<n;j++){
a[j] += b[j];
c[j] += d[j];
}
will page out two arrays and page in the other two every time around the loop. This would obviously be much slower.
You are probably not seeing disk caching in your tests but you are probably seeing the side effects of some other form of caching.
There seems to be a little confusion/misunderstanding here so I will try to elaborate a little using an example.
Say n = 2 and we are working with bytes. In my scenario we thus have just 4 bytes of RAM and the rest of our memory is significantly slower (say 100 times longer access).
Assuming a fairly dumb caching policy of if the byte is not in the cache, put it there and get the following byte too while we are at it you will get a scenario something like this:
With
for(int j=0;j<n;j++){
a[j] += b[j];
}
for(int j=0;j<n;j++){
c[j] += d[j];
}
cache a[0] and a[1] then b[0] and b[1] and set a[0] = a[0] + b[0] in cache - there are now four bytes in cache, a[0], a[1] and b[0], b[1]. Cost = 100 + 100.
set a[1] = a[1] + b[1] in cache. Cost = 1 + 1.
Repeat for c and d.
Total cost = (100 + 100 + 1 + 1) * 2 = 404
With
for(int j=0;j<n;j++){
a[j] += b[j];
c[j] += d[j];
}
cache a[0] and a[1] then b[0] and b[1] and set a[0] = a[0] + b[0] in cache - there are now four bytes in cache, a[0], a[1] and b[0], b[1]. Cost = 100 + 100.
eject a[0], a[1], b[0], b[1] from cache and cache c[0] and c[1] then d[0] and d[1] and set c[0] = c[0] + d[0] in cache. Cost = 100 + 100.
I suspect you are beginning to see where I am going.
Total cost = (100 + 100 + 100 + 100) * 2 = 800
This is a classic cache thrash scenario.
It's not because of a different code, but because of caching: RAM is slower than the CPU registers and a cache memory is inside the CPU to avoid to write the RAM every time a variable is changing. But the cache is not big as the RAM is, hence, it maps only a fraction of it.
The first code modifies distant memory addresses alternating them at each loop, thus requiring continuously to invalidate the cache.
The second code don't alternate: it just flow on adjacent addresses twice. This makes all the job to be completed in the cache, invalidating it only after the second loop starts.
I cannot replicate the results discussed here.
I don't know if poor benchmark code is to blame, or what, but the two methods are within 10% of each other on my machine using the following code, and one loop is usually just slightly faster than two - as you'd expect.
Array sizes ranged from 2^16 to 2^24, using eight loops. I was careful to initialize the source arrays so the += assignment wasn't asking the FPU to add memory garbage interpreted as a double.
I played around with various schemes, such as putting the assignment of b[j], d[j] to InitToZero[j] inside the loops, and also with using += b[j] = 1 and += d[j] = 1, and I got fairly consistent results.
As you might expect, initializing b and d inside the loop using InitToZero[j] gave the combined approach an advantage, as they were done back-to-back before the assignments to a and c, but still within 10%. Go figure.
Hardware is Dell XPS 8500 with generation 3 Core i7 # 3.4 GHz and 8 GB memory. For 2^16 to 2^24, using eight loops, the cumulative time was 44.987 and 40.965 respectively. Visual C++ 2010, fully optimized.
PS: I changed the loops to count down to zero, and the combined method was marginally faster. Scratching my head. Note the new array sizing and loop counts.
// MemBufferMystery.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
#include <cmath>
#include <string>
#include <time.h>
#define dbl double
#define MAX_ARRAY_SZ 262145 //16777216 // AKA (2^24)
#define STEP_SZ 1024 // 65536 // AKA (2^16)
int _tmain(int argc, _TCHAR* argv[]) {
long i, j, ArraySz = 0, LoopKnt = 1024;
time_t start, Cumulative_Combined = 0, Cumulative_Separate = 0;
dbl *a = NULL, *b = NULL, *c = NULL, *d = NULL, *InitToOnes = NULL;
a = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
b = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
c = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
d = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
InitToOnes = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
// Initialize array to 1.0 second.
for(j = 0; j< MAX_ARRAY_SZ; j++) {
InitToOnes[j] = 1.0;
}
// Increase size of arrays and time
for(ArraySz = STEP_SZ; ArraySz<MAX_ARRAY_SZ; ArraySz += STEP_SZ) {
a = (dbl *)realloc(a, ArraySz * sizeof(dbl));
b = (dbl *)realloc(b, ArraySz * sizeof(dbl));
c = (dbl *)realloc(c, ArraySz * sizeof(dbl));
d = (dbl *)realloc(d, ArraySz * sizeof(dbl));
// Outside the timing loop, initialize
// b and d arrays to 1.0 sec for consistent += performance.
memcpy((void *)b, (void *)InitToOnes, ArraySz * sizeof(dbl));
memcpy((void *)d, (void *)InitToOnes, ArraySz * sizeof(dbl));
start = clock();
for(i = LoopKnt; i; i--) {
for(j = ArraySz; j; j--) {
a[j] += b[j];
c[j] += d[j];
}
}
Cumulative_Combined += (clock()-start);
printf("\n %6i miliseconds for combined array sizes %i and %i loops",
(int)(clock()-start), ArraySz, LoopKnt);
start = clock();
for(i = LoopKnt; i; i--) {
for(j = ArraySz; j; j--) {
a[j] += b[j];
}
for(j = ArraySz; j; j--) {
c[j] += d[j];
}
}
Cumulative_Separate += (clock()-start);
printf("\n %6i miliseconds for separate array sizes %i and %i loops \n",
(int)(clock()-start), ArraySz, LoopKnt);
}
printf("\n Cumulative combined array processing took %10.3f seconds",
(dbl)(Cumulative_Combined/(dbl)CLOCKS_PER_SEC));
printf("\n Cumulative seperate array processing took %10.3f seconds",
(dbl)(Cumulative_Separate/(dbl)CLOCKS_PER_SEC));
getchar();
free(a); free(b); free(c); free(d); free(InitToOnes);
return 0;
}
I'm not sure why it was decided that MFLOPS was a relevant metric. I though the idea was to focus on memory accesses, so I tried to minimize the amount of floating point computation time. I left in the +=, but I am not sure why.
A straight assignment with no computation would be a cleaner test of memory access time and would create a test that is uniform irrespective of the loop count. Maybe I missed something in the conversation, but it is worth thinking twice about. If the plus is left out of the assignment, the cumulative time is almost identical at 31 seconds each.
It's because the CPU doesn't have so many cache misses (where it has to wait for the array data to come from the RAM chips). It would be interesting for you to adjust the size of the arrays continually so that you exceed the sizes of the level 1 cache (L1), and then the level 2 cache (L2), of your CPU and plot the time taken for your code to execute against the sizes of the arrays. The graph shouldn't be a straight line like you'd expect.
The first loop alternates writing in each variable. The second and third ones only make small jumps of element size.
Try writing two parallel lines of 20 crosses with a pen and paper separated by 20 cm. Try once finishing one and then the other line and try another time by writting a cross in each line alternately.
The Original Question
Why is one loop so much slower than two loops?
Conclusion:
Case 1 is a classic interpolation problem that happens to be an inefficient one. I also think that this was one of the leading reasons why many machine architectures and developers ended up building and designing multi-core systems with the ability to do multi-threaded applications as well as parallel programming.
Looking at it from this kind of an approach without involving how the hardware, OS, and compiler(s) work together to do heap allocations that involve working with RAM, cache, page files, etc.; the mathematics that is at the foundation of these algorithms shows us which of these two is the better solution.
We can use an analogy of a Boss being a Summation that will represent a For Loop that has to travel between workers A & B.
We can easily see that Case 2 is at least half as fast if not a little more than Case 1 due to the difference in the distance that is needed to travel and the time taken between the workers. This math lines up almost virtually and perfectly with both the benchmark times as well as the number of differences in assembly instructions.
I will now begin to explain how all of this works below.
Assessing The Problem
The OP's code:
const int n=100000;
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
And
for(int j=0;j<n;j++){
a1[j] += b1[j];
}
for(int j=0;j<n;j++){
c1[j] += d1[j];
}
The Consideration
Considering the OP's original question about the two variants of the for loops and his amended question towards the behavior of caches along with many of the other excellent answers and useful comments; I'd like to try and do something different here by taking a different approach about this situation and problem.
The Approach
Considering the two loops and all of the discussion about cache and page filing I'd like to take another approach as to looking at this from a different perspective. One that doesn't involve the cache and page files nor the executions to allocate memory, in fact, this approach doesn't even concern the actual hardware or the software at all.
The Perspective
After looking at the code for a while it became quite apparent what the problem is and what is generating it. Let's break this down into an algorithmic problem and look at it from the perspective of using mathematical notations then apply an analogy to the math problems as well as to the algorithms.
What We Do Know
We know is that this loop will run 100,000 times. We also know that a1, b1, c1 & d1 are pointers on a 64-bit architecture. Within C++ on a 32-bit machine, all pointers are 4 bytes and on a 64-bit machine, they are 8 bytes in size since pointers are of a fixed length.
We know that we have 32 bytes in which to allocate for in both cases. The only difference is we are allocating 32 bytes or two sets of 2-8 bytes on each iteration wherein the second case we are allocating 16 bytes for each iteration for both of the independent loops.
Both loops still equal 32 bytes in total allocations. With this information let's now go ahead and show the general math, algorithms, and analogy of these concepts.
We do know the number of times that the same set or group of operations that will have to be performed in both cases. We do know the amount of memory that needs to be allocated in both cases. We can assess that the overall workload of the allocations between both cases will be approximately the same.
What We Don't Know
We do not know how long it will take for each case unless if we set a counter and run a benchmark test. However, the benchmarks were already included from the original question and from some of the answers and comments as well; and we can see a significant difference between the two and this is the whole reasoning for this proposal to this problem.
Let's Investigate
It is already apparent that many have already done this by looking at the heap allocations, benchmark tests, looking at RAM, cache, and page files. Looking at specific data points and specific iteration indices were also included and the various conversations about this specific problem have many people starting to question other related things about it. How do we begin to look at this problem by using mathematical algorithms and applying an analogy to it? We start off by making a couple of assertions! Then we build out our algorithm from there.
Our Assertions:
We will let our loop and its iterations be a Summation that starts at 1 and ends at 100000 instead of starting with 0 as in the loops for we don't need to worry about the 0 indexing scheme of memory addressing since we are just interested in the algorithm itself.
In both cases we have four functions to work with and two function calls with two operations being done on each function call. We will set these up as functions and calls to functions as the following: F1(), F2(), f(a), f(b), f(c) and f(d).
The Algorithms:
1st Case: - Only one summation but two independent function calls.
Sum n=1 : [1,100000] = F1(), F2();
F1() = { f(a) = f(a) + f(b); }
F2() = { f(c) = f(c) + f(d); }
2nd Case: - Two summations but each has its own function call.
Sum1 n=1 : [1,100000] = F1();
F1() = { f(a) = f(a) + f(b); }
Sum2 n=1 : [1,100000] = F1();
F1() = { f(c) = f(c) + f(d); }
If you noticed F2() only exists in Sum from Case1 where F1() is contained in Sum from Case1 and in both Sum1 and Sum2 from Case2. This will be evident later on when we begin to conclude that there is an optimization that is happening within the second algorithm.
The iterations through the first case Sum calls f(a) that will add to its self f(b) then it calls f(c) that will do the same but add f(d) to itself for each 100000 iterations. In the second case, we have Sum1 and Sum2 that both act the same as if they were the same function being called twice in a row.
In this case we can treat Sum1 and Sum2 as just plain old Sum where Sum in this case looks like this: Sum n=1 : [1,100000] { f(a) = f(a) + f(b); } and now this looks like an optimization where we can just consider it to be the same function.
Summary with Analogy
With what we have seen in the second case it almost appears as if there is optimization since both for loops have the same exact signature, but this isn't the real issue. The issue isn't the work that is being done by f(a), f(b), f(c), and f(d). In both cases and the comparison between the two, it is the difference in the distance that the Summation has to travel in each case that gives you the difference in execution time.
Think of the for loops as being the summations that does the iterations as being a Boss that is giving orders to two people A & B and that their jobs are to meat C & D respectively and to pick up some package from them and return it. In this analogy, the for loops or summation iterations and condition checks themselves don't actually represent the Boss. What actually represents the Boss is not from the actual mathematical algorithms directly but from the actual concept of Scope and Code Block within a routine or subroutine, method, function, translation unit, etc. The first algorithm has one scope where the second algorithm has two consecutive scopes.
Within the first case on each call slip, the Boss goes to A and gives the order and A goes off to fetch B's package then the Boss goes to C and gives the orders to do the same and receive the package from D on each iteration.
Within the second case, the Boss works directly with A to go and fetch B's package until all packages are received. Then the Boss works with C to do the same for getting all of D's packages.
Since we are working with an 8-byte pointer and dealing with heap allocation let's consider the following problem. Let's say that the Boss is 100 feet from A and that A is 500 feet from C. We don't need to worry about how far the Boss is initially from C because of the order of executions. In both cases, the Boss initially travels from A first then to B. This analogy isn't to say that this distance is exact; it is just a useful test case scenario to show the workings of the algorithms.
In many cases when doing heap allocations and working with the cache and page files, these distances between address locations may not vary that much or they can vary significantly depending on the nature of the data types and the array sizes.
The Test Cases:
First Case: On first iteration the Boss has to initially go 100 feet to give the order slip to A and A goes off and does his thing, but then the Boss has to travel 500 feet to C to give him his order slip. Then on the next iteration and every other iteration after the Boss has to go back and forth 500 feet between the two.
Second Case: The Boss has to travel 100 feet on the first iteration to A, but after that, he is already there and just waits for A to get back until all slips are filled. Then the Boss has to travel 500 feet on the first iteration to C because C is 500 feet from A. Since this Boss( Summation, For Loop ) is being called right after working with A he then just waits there as he did with A until all of C's order slips are done.
The Difference In Distances Traveled
const n = 100000
distTraveledOfFirst = (100 + 500) + ((n-1)*(500 + 500));
// Simplify
distTraveledOfFirst = 600 + (99999*1000);
distTraveledOfFirst = 600 + 99999000;
distTraveledOfFirst = 99999600
// Distance Traveled On First Algorithm = 99,999,600ft
distTraveledOfSecond = 100 + 500 = 600;
// Distance Traveled On Second Algorithm = 600ft;
The Comparison of Arbitrary Values
We can easily see that 600 is far less than approximately 100 million. Now, this isn't exact, because we don't know the actual difference in distance between which address of RAM or from which cache or page file each call on each iteration is going to be due to many other unseen variables. This is just an assessment of the situation to be aware of and looking at it from the worst-case scenario.
From these numbers it would almost appear as if algorithm one should be 99% slower than algorithm two; however, this is only the Boss's part or responsibility of the algorithms and it doesn't account for the actual workers A, B, C, & D and what they have to do on each and every iteration of the Loop. So the boss's job only accounts for about 15 - 40% of the total work being done. The bulk of the work that is done through the workers has a slightly bigger impact towards keeping the ratio of the speed rate differences to about 50-70%
The Observation: - The differences between the two algorithms
In this situation, it is the structure of the process of the work being done. It goes to show that Case 2 is more efficient from both the partial optimization of having a similar function declaration and definition where it is only the variables that differ by name and the distance traveled.
We also see that the total distance traveled in Case 1 is much farther than it is in Case 2 and we can consider this distance traveled our Time Factor between the two algorithms. Case 1 has considerable more work to do than Case 2 does.
This is observable from the evidence of the assembly instructions that were shown in both cases. Along with what was already stated about these cases, this doesn't account for the fact that in Case 1 the boss will have to wait for both A & C to get back before he can go back to A again for each iteration. It also doesn't account for the fact that if A or B is taking an extremely long time then both the Boss and the other worker(s) are idle waiting to be executed.
In Case 2 the only one being idle is the Boss until the worker gets back. So even this has an impact on the algorithm.
The OP's Amended Question(s)
EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:
Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?
It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.
Regarding These Questions
As I have demonstrated without a doubt, there is an underlying issue even before the Hardware and Software becomes involved.
Now as for the management of memory and caching along with page files, etc. which all work together in an integrated set of systems between the following:
The architecture (hardware, firmware, some embedded drivers, kernels and assembly instruction sets).
The OS (file and memory management systems, drivers and the registry).
The compiler (translation units and optimizations of the source code).
And even the source code itself with its set(s) of distinctive algorithms.
We can already see that there is a bottleneck that is happening within the first algorithm before we even apply it to any machine with any arbitrary architecture, OS, and programmable language compared to the second algorithm. There already existed a problem before involving the intrinsics of a modern computer.
The Ending Results
However; it is not to say that these new questions are not of importance because they themselves are and they do play a role after all. They do impact the procedures and the overall performance and that is evident with the various graphs and assessments from many who have given their answer(s) and or comment(s).
If you paid attention to the analogy of the Boss and the two workers A & B who had to go and retrieve packages from C & D respectively and considering the mathematical notations of the two algorithms in question; you can see without the involvement of the computer hardware and software Case 2 is approximately 60% faster than Case 1.
When you look at the graphs and charts after these algorithms have been applied to some source code, compiled, optimized, and executed through the OS to perform their operations on a given piece of hardware, you can even see a little more degradation between the differences in these algorithms.
If the Data set is fairly small it may not seem all that bad of a difference at first. However, since Case 1 is about 60 - 70% slower than Case 2 we can look at the growth of this function in terms of the differences in time executions:
DeltaTimeDifference approximately = Loop1(time) - Loop2(time)
//where
Loop1(time) = Loop2(time) + (Loop2(time)*[0.6,0.7]) // approximately
// So when we substitute this back into the difference equation we end up with
DeltaTimeDifference approximately = (Loop2(time) + (Loop2(time)*[0.6,0.7])) - Loop2(time)
// And finally we can simplify this to
DeltaTimeDifference approximately = [0.6,0.7]*Loop2(time)
This approximation is the average difference between these two loops both algorithmically and machine operations involving software optimizations and machine instructions.
When the data set grows linearly, so does the difference in time between the two. Algorithm 1 has more fetches than algorithm 2 which is evident when the Boss has to travel back and forth the maximum distance between A & C for every iteration after the first iteration while algorithm 2 the Boss has to travel to A once and then after being done with A he has to travel a maximum distance only one time when going from A to C.
Trying to have the Boss focusing on doing two similar things at once and juggling them back and forth instead of focusing on similar consecutive tasks is going to make him quite angry by the end of the day since he had to travel and work twice as much. Therefore do not lose the scope of the situation by letting your boss getting into an interpolated bottleneck because the boss's spouse and children wouldn't appreciate it.
Amendment: Software Engineering Design Principles
-- The difference between local Stack and heap allocated computations within iterative for loops and the difference between their usages, their efficiencies, and effectiveness --
The mathematical algorithm that I proposed above mainly applies to loops that perform operations on data that is allocated on the heap.
Consecutive Stack Operations:
If the loops are performing operations on data locally within a single code block or scope that is within the stack frame it will still sort of apply, but the memory locations are much closer where they are typically sequential and the difference in distance traveled or execution time is almost negligible. Since there are no allocations being done within the heap, the memory isn't scattered, and the memory isn't being fetched through ram. The memory is typically sequential and relative to the stack frame and stack pointer.
When consecutive operations are being done on the stack, a modern processor will cache repetitive values and addresses keeping these values within local cache registers. The time of operations or instructions here is on the order of nano-seconds.
Consecutive Heap Allocated Operations:
When you begin to apply heap allocations and the processor has to fetch the memory addresses on consecutive calls, depending on the architecture of the CPU, the bus controller, and the RAM modules the time of operations or execution can be on the order of micro to milliseconds. In comparison to cached stack operations, these are quite slow.
The CPU will have to fetch the memory address from RAM and typically anything across the system bus is slow compared to the internal data paths or data buses within the CPU itself.
So when you are working with data that needs to be on the heap and you are traversing through them in loops, it is more efficient to keep each data set and its corresponding algorithms within its own single loop. You will get better optimizations compared to trying to factor out consecutive loops by putting multiple operations of different data sets that are on the heap into a single loop.
It is okay to do this with data that is on the stack since they are frequently cached, but not for data that has to have its memory address queried every iteration.
This is where software engineering and software architecture design comes into play. It is the ability to know how to organize your data, knowing when to cache your data, knowing when to allocate your data on the heap, knowing how to design and implement your algorithms, and knowing when and where to call them.
You might have the same algorithm that pertains to the same data set, but you might want one implementation design for its stack variant and another for its heap-allocated variant just because of the above issue that is seen from its O(n) complexity of the algorithm when working with the heap.
From what I've noticed over the years, many people do not take this fact into consideration. They will tend to design one algorithm that works on a particular data set and they will use it regardless of the data set being locally cached on the stack or if it was allocated on the heap.
If you want true optimization, yes it might seem like code duplication, but to generalize it would be more efficient to have two variants of the same algorithm. One for stack operations, and the other for heap operations that are performed in iterative loops!
Here's a pseudo example: Two simple structs, one algorithm.
struct A {
int data;
A() : data{0}{}
A(int a) : data{a}{}
};
struct B {
int data;
B() : data{0}{}
A(int b) : data{b}{}
}
template<typename T>
void Foo( T& t ) {
// Do something with t
}
// Some looping operation: first stack then heap.
// Stack data:
A dataSetA[10] = {};
B dataSetB[10] = {};
// For stack operations this is okay and efficient
for (int i = 0; i < 10; i++ ) {
Foo(dataSetA[i]);
Foo(dataSetB[i]);
}
// If the above two were on the heap then performing
// the same algorithm to both within the same loop
// will create that bottleneck
A* dataSetA = new [] A();
B* dataSetB = new [] B();
for ( int i = 0; i < 10; i++ ) {
Foo(dataSetA[i]); // dataSetA is on the heap here
Foo(dataSetB[i]); // dataSetB is on the heap here
} // this will be inefficient.
// To improve the efficiency above, put them into separate loops...
for (int i = 0; i < 10; i++ ) {
Foo(dataSetA[i]);
}
for (int i = 0; i < 10; i++ ) {
Foo(dataSetB[i]);
}
// This will be much more efficient than above.
// The code isn't perfect syntax, it's only pseudo code
// to illustrate a point.
This is what I was referring to by having separate implementations for stack variants versus heap variants. The algorithms themselves don't matter too much, it's the looping structures that you will use them in that do.
It may be old C++ and optimizations. On my computer I obtained almost the same speed:
One loop: 1.577 ms
Two loops: 1.507 ms
I run Visual Studio 2015 on an E5-1620 3.5 GHz processor with 16 GB RAM.