Memory access comparison - c++

Which one of the 2 is faster (C++)?
for(i=0; i<n; i++)
{
sum_a = sum_a + a[i];
sum_b = sum_b + b[i];
}
Or
for(i=0; i<n; i++)
{
sum_a = sum_a + a[i];
}
for(i=0; i<n; i++)
{
sum_b = sum_b + b[i];
}
I am a beginner so I don't know whether this makes sense, but in the first version, array 'a' is accessed, then 'b', which might lead to many memory switches, since arrays 'a' and 'b' are at different memory locations. But in the second version, whole of array 'a' is accessed first, and then whole of array 'b', which means continuous memory locations are accessed instead of alternating between the two arrays.
Does this make any difference between the execution time of the two versions (even a very negligible one)?

I don't think there is correct answer to this question. In general, second version has more twice as much iterations (CPU execution overhead), but worse access to memory (Memory access overhead). Now imagine you run this code on PC that has slow clock, but insanely good cache. Memory overhead gets reduced, but since clock is slow running same loop twice makes execution much longer. Other way around: fast clock, but bad memory - running two loops is not a problem, so it's better to optimize for memory access.
Here is cool example on how you can profile your app: Link

Which one of the 2 is faster (C++)?
Either. It depends on
The implementation of operator+ and operator[] (in case they are overloaded)
Location of the arrays in memory (adjacent or not)
Size of the arrays
Size of the cpu caches
Associativity of caches
Cache speed in relation to memory speed
Possibly other factors
As Revolver_Ocelot mentionend in their observation in a comment, some compilers may even transform the written loop into the other form.
Does this make any difference between the execution time of the two versions (even a very negligible one)?
It can make a difference. The difference may be significant or negligible.
Your analysis is sound. Memory access is typically much slower than cache, and jumping between two memory locations may cause cache thrashing † in some situations. I would recommend using the separated approach by default, and only combine the loops if you have measured it to be faster on your target CPU.
† As MSalters points out thrashing shouldn't be a problem modern desktop processors (modern as in ~x86).

Related

Using one loop vs two loops

I was reading this blog :- https://developerinsider.co/why-is-one-loop-so-much-slower-than-two-loops/. And I decided to check it out using C++ and Xcode. So, I wrote a simple program given below and when I executed it, I was surprised by the result. Actually the 2nd function was slower compared to the first function contrary to what is stated in the article. Can anyone please help me figure out why this is the case?
#include <iostream>
#include <vector>
#include <chrono>
using namespace std::chrono;
void function1() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0;j<n;j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
void function2() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0; j<n; j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0; j<n; j++){
a1[j] += b1[j];
}
for(int j=0;j<n;j++){
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
int main(int argc, const char * argv[]) {
function1();
function2();
return 0;
}
TL;DR: The loops are basically the same, and if you are seeing differences, then your measurement is wrong. Performance measurement and more importantly, reasoning about performance requires a lot of computer knowledge, some scientific rigor, and much engineering acumen. Now for the long version...
Unfortunately, there is some very inaccurate information in the article to which you've linked, as well as in the answers and some comments here.
Let's start with the article. There won't be any disk caching that has any effect on the performance of these functions. It is true that virtual memory is paged to disk, when demand on physical memory exceeds what's available, but that's not a factor that you have to consider for programs that touch 1.6MB of memory (4 * 4 * 100K).
And if paging comes into play, the performance difference won't exactly be subtle either. If these arrays were paged to disk and back, the performance difference would be in order of 1000x for fastest disks, not 10% or 100%.
Paging and page faults and its effect on performance is neither trivial, nor intuitive. You need to read about it, and experiment with it seriously. What little information that article has is completely inaccurate to the point of being misleading.
The second is your profiling strategy and the micro-benchmark itself. Clearly, with such simple operations on data (an add,) the bottleneck will be memory bandwidth itself (maybe instruction retire limits or something like that with such a simple loop.) And since you only read memory linearly, and use all you read, whether its in 4 interleaving streams or 2, you are making use of all the bandwidth that is available.
However, if you call your function1 or function2 in a loop, you will be measuring the bandwidth of different parts of the memory hierarchy depending on N, from L1 all the way to L3 and main memory. (You should know the size of all levels of cache on your machine, and how they work.) This is obvious if you know how CPU caches work, and really mystifying otherwise. Do you want to know how fast this is when you do it the first time, when the arrays are cold, or do you want to measure the hot access?
Is your real use case copying the same mid-sized array over and over again?
If not, what is it? What are you benchmarking? Are you trying to measure something or just experimenting?
Shouldn't you be measuring the fastest run through a loop, rather than the average since that can be massively affected by a (basically random) context switch or an interrupt?
Have you made sure you are using the correct compiler switches? Have you looked at the generated assembly code to make sure the compiler is not adding debug checks and what not, and is not optimizing stuff away that it shouldn't (after all, you are just executing useless loops, and an optimizing compiler wants nothing more than to avoid generating code that is not needed).
Have you looked at the theoretical memory/cache bandwidth number for your hardware? Your specific CPU and RAM combination will have theoretical limits. And be it 5, 50, or 500 GiB/s, it will give you an upper bound on how much data you can move around and work with. The same goes with the number of execution units, the IPC or your CPU, and a few dozen other numbers that will affect the performance of this kind of micro-benchmark.
If you are reading 4 integers (4 bytes each, from a, b, c, and d) and then doing two adds and writing the two results back, and doing it 100'000 times, then you are - roughly - looking at 2.4MB of memory read and write. If you do it 10 times in 300 micro-seconds, then your program's memory (well, store buffer/L1) throughput is about 80 GB/s. Is that low? Is that high? Do you know? (You should have a rough idea.)
And let me tell you that the other two answers here at the time of this writing (namely this and this) do not make sense. I can't make heads nor tails of the first one, and the second one is almost completely wrong (conditional branches in a 100'000-times for loop are bad? allocating an additional iterator variable is costly? cold access to array on stack vs. on the heap has "serious performance implications?)
And finally, as written, the two functions have very similar performances. It is really hard separating the two, and unless you can measure a real difference in a real use case, I'd say write whichever one that makes you happier.
If you really really want a theoretical difference between them, I'd say the one with two separate loops is very slightly better because it is usually not a good idea interleaving access to unrelated data.
This has nothing to do with caching or instruction efficiency. Simple iterations over long vectors are purely a matter of bandwidth. (Google: stream benchmark.) And modern CPUs have enough bandwidth to satisfy not all of their cores, but a good deal.
So if you combine the two loops, executing them on a single core, there is probably enough bandwidth for all loads and stores at the rate that memory can sustain. But if you use two loops, you leave bandwidth unused, and the runtime will be a little less than double.
The reasons why the second is faster in your case (I do not think that this works on any machine) is better cpu caching at the point at ,which you cpu has enough cache to store the arrays, the stuff your OS requires and so on, the second function will probably be much slower than the first.
from a performance standpoint. I doubt that the two loop code will give better performance if there are enough other programs running as well, because the second function has obviously worse efficiency then the first and if there is enough other stuff cached the performance lead throw caching will be eliminated.
I'll just chime in here with a little something to keep in mind when looking into performance - unless you are writing embedded software for a real-time device, the performance of such low level code as this should not be a concern.
In 99.9% of all other cases, they will be fast enough.

Which algorithm brings the best performance? [duplicate]

This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
What is the overhead in splitting a for-loop into multiple for-loops, if the total work inside is the same? [duplicate]
(4 answers)
Closed 9 years ago.
I have a piece of code that is really dirty.
I want to optimize it a little bit. Does it makes any difference when I take one of the following structures or are they identical with the point of view to performance in c++ ?
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
....
or
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
if
if ... else ...
if
if ... else ...
....
for end
Thanks in Advance!
Both are O(n). As we do not know the guts of the various for loops it is impossible to say.
BTW - Mark it as pseudo code and not C++
The 1st one may spend less time incrementing/testing i and conditionally branching (assuming the compiler's optimiser doesn't reduce it to the equivalent of the second one anyway), but with loop unrolling the time taken for the i loop may be insignificant compared to the time spent within the loop anyway.
Countering that, it's easily possible that the choice of separate versus combined loops will affect the ratio of cache hits, and that could significantly impact either version: it really depends on the code. For example, if each of the three if/else statements accessed different arrays at index i, then they'll be competing for CPU cache and could slow each other down. On the other hand, if they accessed the same array at index i, doing different steps in some calculation, then it's probably better to do all three steps while those memory pages are still in cache.
There are potential impacts other than caches - from impact to register allocation, speed of I/O devices (e.g. if each loop operates on lines/records from a distinct file on different physical drives, it's very probably faster to process some of each file in a loop, rather than sequentially process each file), etc..
If you care, benchmark your actual application with representative data.
Just from the structure of the loop it is not possible to say which approach will be faster.
Algorithmically, both has the same complexity O(n). However, both might have different performance numbers depending upon the kind of operation you are performing on the elements and the size of the container.
The size of container may have an impact on locality and hence the performance. So generally speaking, you would like to chew the data as much as you can, once you get it into the cache. So I would prefer the second approach. To get a clear picture you should actually measure the performance of you approach.
The second is only slightly more efficient than the first. You save:
Initialization of loop index
Calling size()
Comparing the loop index with the size()`
Incrementing the loop index
These are very minor optimizations. Do it if it doesn't impact readability.
I would expect the second approach to be at least marginally more optimal in most cases as it can leverage the locality of reference with respect to access to elements of the entity collection/set. Note that in the first approach, each for loop would need to start accessing elements from the beginning; depending on the size of the cache, the size of the list and the extent to which compiler can infer and optimize, this may lead to cache misses when a new for loop attempts to read an element even though that element would have been read already by a preceding loop.

Accelerate programme with multiple processors

I found that sometimes it's faster to divide one loop into two or more
for (i=0; i<AMT; i++) {
a[i] += c[i];
b[i] += d[i];
}
||
\/
for (i=0; i<AMT; i++) {
//a[i] += c[i];
b[i] += d[i];
}
for (i=0; i<AMT; i++) {
a[i] += c[i];
//b[i] += d[i];
}
On my desktop, win7, AMD Phenom(tm) x6 1055T, the two-loop version runs faster with around 1/3 time less time.
But if I am dealing with assignment,
for (i=0; i<AMT; i++) {
b[i] = rand()%100;
c[i] = rand()%100;
}
dividing the assignment of b and c into two loops is no faster than in one loop.
I think that there are some rules the OS use to determine if certain codes
can be run by multiple processors.
I want to ask if my guess is right, and if I'm right, what are such rules or occasions that multiple processors will
be automatically (without thread programming) used to speed up my programmes?
It is possible that your compiler is vectorizing the simpler loops. In the assembler output you would see this as the compiled program using SIMD instructions (like Intel's SSE) to process larger chunks of data than one number a time. Automatic vectorization is a hard problem, and it's plausible that the compiler would not be able to vectorize the loop that updates both a and b at the same time. This could partially explain why breaking the complex loop into two would be faster.
In the "assignment" loops, each invocation to rand() depends on the output of the previous invocations, which means that vectorization is inherently impossible. Breaking the loop into two would not make it benefit from SIMD instructions like in the first case, so you wouldn't see it run any faster. Looking at the assembler code the compiler generates would tell you what optimizations the compiler performed and what instructions it used.
Even if the compiler is vectorizing the loop the program is not using more than one CPU or thread; there is no concurrency. What happens is that the one CPU that there is is capable of running the single thread of execution on multiple data points in parallel. The distinction between parallel and concurrent programming is subtle but important.
Cache locality might also explain why breaking the first loop into two makes it run faster, but not why breaking the "assignment" loop into two doesn't. It is possible that b and c in the "assignment" loop are sufficiently small so that they fit into the cache, which would mean that the loop already has optimal performance and breaking it further brings no benefit. If this were the case, making b and c larger would force the loop to start trashing the cache and breaking the loop into two would have the expected benefit.
The optimization is done by the compiler (http://en.wikipedia.org/wiki/Loop_optimization).
if you are using GCC, check this page http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for the list of available optimization rules.
In another hand, see that you are using rand() function which consumes a lot of CPU time.
I want to ask if my guess is right, and if I'm right, what are such rules or occasions that multiple processors will be automatically (without thread programming) used to speed up my programmes?
No, the guess is not right. In all three cases the code is run on a single core.
It is for some other reason that splitting the first loop into two makes it faster. Perhaps your compiler is able to generate better code, or the CPU is having easier time prefetching the right data, etc. It is hard to tell without analysing the generated machine code.

C++ How to force prefetch data to cache? (array loop)

I have loop like this
start = __rdtsc();
unsigned long long count = 0;
for(int i = 0; i < N; i++)
for(int j = 0; j < M; j++)
count += tab[i][j];
stop = __rdtsc();
time = (stop - start) * 1/3;
Need to check how prefetch data influences on efficiency. How to force prefetch some values from memory into cache before they will be counted?
For GCC only:
__builtin_prefetch((const void*)(prefetch_address),0,0);
prefetch_address can be invalid, there will be no segfault. If there too small difference between prefetch_address and current location, there might be no effect or even slowdown. Try to set it at least 1k ahead.
First, I suppose that tab is a large 2D array such as a static array (e.g., int tab[1024*1024][1024*1024]) or a dynamically-allocated array (e.g., int** tab and following mallocs). Here, you want to prefetch some data from tab to the cache to reduce the execution time.
Simply, I don't think that you need to manually insert any prefetching to your code, where a simple reduction for a 2D array is performed. Modern CPUs will do automatic prefetching if necessary and profitable.
Two facts you should know for this problem:
(1) You are already exploit the spatial locality of tab inside of the innermost loop. Once tab[i][0] is read (after a cache miss, or a page fault), the data from tab[i][0] to tab[i][15] will be in your CPU caches, assuming that the cache line size is 64 bytes.
(2) However, when the code traverses in the row, i.e., tab[i][M-1] to tab[i+1][0], it is highly likely to happen a cold cache miss, especially when tab is a dynamically-allocated array where each row could be allocated in a fragmented way. However, if the array is statically allocated, each row will be located contiguously in the memory.
So, prefetching makes a sense only when you read (1) the first item of the next row and (2) j + CACHE_LINE_SIZE/sizeof(tab[0][0]) ahead of time.
You may do so by inserting a prefetch operation (e.g., __builtin_prefetch) in the upper loop. However, modern compilers may not always emit such prefetch instructions. If you really want to do that, you should check the generated binary code.
However, as I said, I do not recommend you do that because modern CPUs will mostly do prefetching automatically, and that automatic prefetching will mostly outperform your manual code. For instance, an Intel CPU like Ivy Bridge processors, there are multiple data prefetchers such as prefetching to L1, L2, or L3 cache. (I don't think mobile processors have a fancy data prefetcher though). Some prefetchers will load adjacent cache lines.
If you do more expensive computations on large 2D arrays, there are many alternative algorithms that are more friendly to caches. A notable example would be blocked(titled) matrix multiply. A naive matrix multiplication suffers a lot of cache misses, but a blocked algorithm significantly reduces cache misses by calculating on small subsets that are fit to caches. See some references like this.
The easiest/most portable method is to simply read some data every cacheline bytes apart. Assuming tab is a proper two-dimensional array, you could:
char *tptr = (char *)&tab[0][0];
tptr += 64;
char temp;
volatile char keep_temp_alive;
for(int i = 0; i < N; i++)
{
temp += *tptr;
tptr += 64;
for(j = 0; j < M; j++)
count += tab[i][j];
}
keep_temp_alive = temp;
Something like that. However, it does depend on:
1. You don't end up reading outside the allocated memory [by too much].
2. the J loop is not that much larger than 64 bytes. If it is, you may want to add more steps of temp += *tptr; tptr += 64; in the begginning of the loop.
The keep_temp_alive after the loop is essential to prevent the compiler from completely removing temp as unnecessary loads.
Unfortunately, I'm too slow writing generic code to suggest the builtin instructions, the points for that goes to Leonid.
The __builtin_prefetch instruction is pretty helpful, but is clang/gcc specific. If you are compiling to multiple compiler targets, I had luck using the x86 intrinsic _mm_prefetch with both clang and MSVC.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_prefetch

Slow writing to array in C++

I was just wondering if this is expected behavior in C++. The code below runs at around 0.001 ms:
for(int l=0;l<100000;l++){
int total=0;
for( int i = 0; i < num_elements; i++)
{
total+=i;
}
}
However if the results are written to an array, the time of execution shoots up to 15 ms:
int *values=(int*)malloc(sizeof(int)*100000);
for(int l=0;l<100000;l++){
int total=0;
for( unsigned int i = 0; i < num_elements; i++)
{
total+=i;
}
values[l]=total;
}
I can appreciate that writing to the array takes time but is the time proportionate?
Cheers everyone
The first example can be implemented using just CPU registers. Those can be accessed billions of times per second. The second example uses so much memory that it certainly overflows L1 and possibly L2 cache (depending on CPU model). That will be slower. Still, 15 ms/100.000 writes comes out to 1.5 ns per write - 667 Mhz effectively. That's not slow.
It looks like the compiler is optimizing that loop out entirely in the first case.
The total effect of the loop is a no-op, so the compiler just removes it.
It's very simple.
In first case You have just 3 variables, which can be easily stored in GPR (general purpose registers), but it doesn't mean that they are there all the time, but they are probably in L1 cache memory, which means thah they can be accessed very fast.
In second case You have more than 100k variables, and You need about 400kB to store them. That is deffinitely to much for registers and L1 cache memory. In best case it could be in L2 cache memory, but probably not all of them will be in L2. If something is not in register, L1, L2 (I assume that your processor doesn't have L3) it means that You need to search for it in RAM and it takes muuuuuch more time.
I would suspect that what you are seeing is an effect of virtual memory and possibly paging. The malloc call is going to allocate a decent sized chunk of memory that is probably represented by a number of virtual pages. Each page is linked into process memory separately.
You may also be measuring the cost of calling malloc depending on how you timed the loop. In either case, the performance is going to be very sensitive to compiler optimization options, threading options, compiler versions, runtime versions, and just about anything else. You cannot safely assume that the cost is linear with the size of the allocation. The only thing that you can do is measure it and figure out how to best optimize once it has been proven to be a problem.