What is the fastest way access random (non-sequential) elements in an array if the access pattern is known beforehand? The access is random for different needs at every step so rearranging the elements is expensive option. The code below is represents important sample of the whole application.
#include <iostream>
#include "chrono"
#include <cstdlib>
#define NN 1000000
struct Astr{
double x[3], v[3];
int i, j, k;
long rank, p, q, r;
};
int main ()
{
struct Astr *key;
key = new Astr[NN];
int ii, *sequence;
sequence = new int[NN]; // access pattern is stored here
float frac ;
// create array of structs
// create array for random numbers between 0 to NN to access 'key'
for(int i=0; i < NN; i++){
key[i].x[1] = static_cast<double>(i);
key[i].p = static_cast<long>(i);
frac = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
sequence[i] = static_cast<int>(frac * static_cast<float>(NN));
}
// part to check and improve
// =========================================Random=======================================================
std::chrono::high_resolution_clock::time_point TstartMain = std::chrono::high_resolution_clock::now();
double tmp;
long rnk;
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
ii = sequence[i];
tmp = key[ii].x[1];
rnk = key[ii].p;
key[ii].x[1] = tmp * 1.01;
key[ii].p = rnk * 1.01;
}
std::chrono::high_resolution_clock::time_point TendMain = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
double time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << "\n Random array access " << time_uni << "s \n" ;
// ==========================================Sequential======================================================
TstartMain = std::chrono::high_resolution_clock::now();
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
tmp = key[i].x[1];
rnk = key[i].p;
key[i].x[1] = tmp * 1.01;
key[i].p = rnk * 1.01;
}
TendMain = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << " Sequential array access " << time_uni << "s \n" ;
// ================================================================================================
delete [] key;
delete [] sequence;
}
As expected, sequential access is faster; the answer is following on my machine-
Random array access 21.3763s
Sequential array access 8.7755s
The main question is whether random access could be made any faster.
The code improvement could be in terms of the container itself ( e.g. list/vector rather than array). Could software prefetching be implemented?
In theory it is possible to help guide the pre-fetcher to speed up random access (well, on those CPU's that support it - e.g. _mm_prefetch for Intel/AMD). In practice however this is often a complete waste of time, and will more often than not, slow down your code.
The general theory is that you pass a pointer to the _mm_prefetch intrinsic a loop iteration or two prior to using the value. There are however problems with this:
It is likely that you'll end up tuning the code for your CPU. When running that same code on other platforms, you'll probably find that different CPU cache layouts/sizes mean that your prefetch optimisations are now actually slowing the performance down.
The additional prefetch instructions will end up using up more of your instruction cache, and most likely your uop cache as well. You may find this alone slows the code down.
This assumes the CPU actually pays attention to the _mm_prefetch instruction. It is only a hint, so there are no guarentees it will be respected by the CPU.
If you want to speed up random memory access, there are better methods than prefetching imho.
Reduce the size of the data (i.e. use shorts/float16s inplace of int/float, eradicate any erronious padding in your structs, etc). By reducing the size of the structs, you have less memory to read, so it will go quicker! (Simple compression schemes aren't a bad idea either!)
Sort your data so that instead of doing random access, you are processing the data sequentially.
Other than those two options, the best bet is to leave prefetching well alone, and the compiler do it's thing with your random access (The only exception: you are optimising code for a ~2001 Pentium 4, where prefetching was basically required).
To give an example of what #robthebloke says, the following code makes ~15% improvment on my machine:
#include <immintrin.h>
void do_it(struct Astr *key, const int *sequence) {
for(int i = 0; i < NN-8; ++i) {
_mm_prefetch(key + sequence[i+8], _MM_HINT_NTA);
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
for(int i = NN-8; i < NN; ++i) {
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
}
Related
I am working on a project where I implement two popular MST algorithms in C++ and then print how long each one takes to execute. Please ignore the actual algorithms, I have already tested them and am only interested in getting accurate measurements of how long they take.
void Graph::krushkalMST(bool e){
size_t s2 = size * size;
typedef struct{uint loc; uint val;} wcType; //struct used for storing a copy of the weights values to be sorted, with original locations
wcType* weightsCopy = new wcType[s2]; //copy of the weights which will be sorted.
for(int i = 0; i < s2; i++){
weightsCopy[i].loc = i;
weightsCopy[i].val = weights[i];
}
std::vector<uint> T(0); //List of edges in the MST
auto start = std::chrono::high_resolution_clock::now(); //time the program was started
typedef int (*cmpType)(const void*, const void*); //comparison function type
static cmpType cmp = [](const void* ua, const void* ub){ //Compare function used by the sort as a C++ lambda
uint a = ((wcType*)ua)->val, b = ((wcType*)ub)->val;
return (a == b) ? 0 : (a == NULLEDGE) ? 1 : (b == NULLEDGE) ? -1 : (a < b) ? -1 : 1;
};
std::qsort((void*)weightsCopy, s2, sizeof(wcType), cmp); //sort edges into ascending order using a quick sort (supposedly quick sort)
uint* componentRefs = new uint[size]; //maps nodes to what component they currently belong to
std::vector<std::vector<uint>> components(size); //vector of components, each component is a vector of nodes;
for(int i = 0; i < size; i++){
//unOptimize(components);
components[i] = std::vector<uint>({(uint)i});
componentRefs[i] = i;
}
for(int wcIndex = 0; components.size() >= 2 ; wcIndex++){
uint i = getI(weightsCopy[wcIndex].loc), j = getJ(weightsCopy[wcIndex].loc); //get pair of nodes with the smallest edge
uint ci = componentRefs[i], cj = componentRefs[j]; //locations of nodes i and j
if(ci != cj){
T.push_back(weightsCopy[wcIndex].loc); //push the edge into T
for(int k = 0; k < components[cj].size(); k++) //move each member in j's component to i's component
components[ci].push_back(components[cj][k]);
for(int k = 0; k < components[cj].size(); k++) //copy this change into the reference locations
componentRefs[components[cj][k]] = ci;
components.erase(components.begin() + cj); //delete j's component
for(int k = 0; k < size; k++)
if(componentRefs[k] >= cj)
componentRefs[k]--;
}
}
auto end = std::chrono::high_resolution_clock::now();
uint time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count();
std::cout<<"\nMST found my krushkal's Algorithm:\n";
printData(time, T, e);
delete[] weightsCopy;
delete[] componentRefs;
}
void Graph::primMST(bool e){
std::vector<uint> T(0); //List of edges in the MST
auto start = std::chrono::high_resolution_clock::now(); //Start calculating the time the algorithm takes
bool* visited = new bool[size]; //Maps each node to a visited value
visited[0] = true;
for(int i = 1; i < size; i++)
visited[i] = false;
for(uint numVisited = 1; numVisited < size; numVisited++){
uint index = 0; //index of the smallest cost edge to unvisited node
uint minCost = std::numeric_limits<uint>::max(); //cost of the smallest edge filling those conditions
for(int i = 0; i < size; i++){
if(visited[i]){
for(int j = 0; j < size; j++){
if(!visited[j]){
uint curIndex = i * size + j, weight = dweights[curIndex];
if(weight != NULLEDGE && weight < minCost){
index = curIndex;
minCost = weight;
}
}
}
}
}
T.push_back(index);
visited[getI(index)] = true;
}
auto end = std::chrono::high_resolution_clock::now();
uint time = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count();
std::cout<<"\nMST found my Prim's Algorithm:\n";
printData(time, T, e);
delete[] visited;
}
I initially used clock() from <ctime> to try and get an accurate measurement of how long this would take, my largest test file has a graph of 40 nodes with 780 edges (sufficiently large enough to warrant some compute time), and even then on a slow computer using g++ with -O0 i would get either 0 or 1 milliseconds. On my desktop I was only ever able to get 0 ms, however as I need a more accurate way to distinguish time between test cases I decided I would try for the high_resolution_clock provided by the <chrono> library.
This is where the real trouble began, I would (and still) consistently get that the program took 0 nanoseconds to execute.
In my search for a solution I came across multiple questions that deal with similar issues, most of which state that <chrono> is system dependent and you're unlikely to actually be able to get nanosecond or even microsecond values. Never the less, I tried using std::chrono::microsecond only to still consistently get 0. Eventually I found what I thought was someone who was having the same problem as me:
counting duration with std::chrono gives 0 nanosecond when it should take long
However, this is clearly a problem of an overactive optimizer which has deleted an unnecessary piece of code, whereas in my case the end result always depends on the results for series of complex loops which must be executed in full. I am on Windows 10, compiling with GCC using -O0.
My best hypothesis is I'm doing something wrong or that windows doesn't support anything smaller then milliseconds while using std::chrono and std::chrono::nanoseconds are actually just milliseconds padded with 0s on the end (as I observe when I put a system("pause") in the algorithm and unpause at arbitrary times). Please let me know if you find anyway around this or if there is any other way I can achieve higher resolution time.
At the request of #Ulrich Eckhardt, I am including minimal reproducible example as well as the results of the test I preformed using it, and I must say it is rather insightful.
#include<iostream>
#include<chrono>
#include<cmath>
int main()
{
double c = 1;
for(int itter = 1; itter < 10000000; itter *= 10){
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < itter; i++)
c += sqrt(c) + log(c);
auto end = std::chrono::high_resolution_clock::now();
int time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count();
std::cout<<"calculated: "<<c<<". "<<itter<<" iterations took "<<time<<"ns\n";
}
system("pause");
}
For my loop I choose a random arbitrary mathematical formula and make sure to use the result of what the loop does so it's not optimized out of existence. Testing it with various iterations on my desktop yields:
This seems to imply that a certain threshold is required before the it starts counting time, since dividing the time taken by the first result that yields non-zero time by 10, we get another non-zero time which is not what the result says despite that being how it should work assuming this whole loop is takes O(n) time with n iterations that is. If anything this small example baffles me even further.
Switch to steady_clock and you get the correct results for both MSVC and MinGW GCC.
You should avoid using the high_resolution_clock as it is just an alias to either steady_clock or system_clock. For measuring elapsed time in a stop watch like fashion, you always want steady_clock. high_resolution_clock is an unfortunate thing and should be avoided.
I just checked and MSVC has the following:
using high_resolution_clock = steady_clock;
while MinGW GCC has:
/**
* #brief Highest-resolution clock
*
* This is the clock "with the shortest tick period." Alias to
* std::system_clock until higher-than-nanosecond definitions
* become feasible.
*/
using high_resolution_clock = system_clock;
So I want to optimize the sum of a really big array and in order to do that I have wrote a multi-threaded code. The problem is that with this code I'm getting better timing results using only one thread instead of 2 or 3 or 4 threads...
Can someone explain me why this happens?
(Also I've only started coding in C++ this semester, until then I only knew C, so I'm sorry for possible dumb mistakes)
This is the thread code
*localSum = 0.0;
for (size_t i = 0; i < stop; i++)
*localSum += v[i];
Main process code
int numThreads = atoi(argv[1]);
int N = 100000000;
// create the input vector v and put some values in v
vector<double> v(N);
for (int i = 0; i < N; i++)
v[i] = i;
// this vector will contain the partial sum for each thread
vector<double> localSum(numThreads, 0);
// create threads. Each thread will compute part of the sum and store
// its result in localSum[threadID] (threadID = 0, 1, ... numThread-1)
startChrono();
vector<thread> myThreads(numThreads);
for (int i = 0; i < numThreads; i++){
int start = i * v.size() / numThreads;
myThreads[i] = thread(threadsum, i, numThreads, &v[start], &localSum[i],v.size()/numThreads);
}
for_each(myThreads.begin(), myThreads.end(), mem_fn(&thread::join));
// calculate global sum
double globalSum = 0.0;
for (int i = 0; i < numThreads; i++)
globalSum += localSum[i];
cout.precision(12);
cout << "Sum = " << globalSum << endl;
cout << "Runtime: " << stopChrono() << endl;
exit(EXIT_SUCCESS);
}
There are a few things:
1- The array just isn't big enough. Vectorized streaming add will be really hard to beat. You need a more complex function than add to really see results. Or a very large array.
2- Related, the overhead of all the thread creation and joining is going to swamp any performance gains from the threading. Adding is really fast, and you can easily saturate the CPU's functional units. for the thread to help it can't even be a hyperthread on the same core, it would need to be on a different core entirely (as the hyperthreads would both compete for the floating point units).
To test this, you can try to create all the treads before you start the timer and stop them all after you stop the timer (have them set a done flag instead of waiting on the join).
3- All your localsum variables are sharing the same cache line. Better would be to make the localsum variable on the stack and put the result into the array instead of adding directly into the array: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
If for some reason, you need to keep the sum observable to others in that array, pad the localsum vector entries like this so they don't share the same cache line:
struct localsumentry {
double sum;
char pad[56];
};
I performed a small test to determine behavior of accessing a vector of pointers vs vector of values. It turns out that for small memory blocks both perform equally well, however, for large memory blocks there is a significant difference.
What is the explanation for such behavior?
For the code below, performed on my pc, the difference for D=0 is about 35% and for D=10 it is unnoticeable.
int D = 0;
int K = 1 << (22 - D);
int J = 100 * (1 << D);
int sum = 0;
std::vector<int> a(K);
std::iota(a.begin(), a.end(), 0);
long start = clock();
for (int j = 0; j < J; ++j)
for (int i = 0; i < a.size(); ++i)
sum += a[i];
std::cout << double(clock() - start) / CLOCKS_PER_SEC << " " << sum << std::endl;
sum = 0;
std::vector<int*> b(a.size());
for (int i = 0; i < a.size(); ++i) b[i] = &a[i];
start = clock();
for (int j = 0; j < J; ++j)
for (int i = 0; i < b.size(); ++i)
sum += *b[i];
std::cout << double(clock() - start) / CLOCKS_PER_SEC << " " << sum << std::endl;
Getting data from global memory is slow, so the CPU has a small bit of really fast memory to help memory access keep up with the processor. When handling memory requests, your computer will try and speed up future requests to an single integer or pointer in memory by requesting a whole bunch of them around the location you requested and storing them in cache. Once that fast memory is full is has to get rid of its least favorite bit whenever something new is requested.
Your small problems may fit entirely or substantially in cache and so memory access is super fast. Large problems can't fit in this fast memory so you have a problem. The vector is stored as K consecutive memory locations. When you access a vector of int it loads the int and a handful of his nearby values these can be used right away. However, when you load an int* it loads a pointer to an actual value as well as several other pointers. This takes up some memory. Then, when you dereference with * it loads the actual value and possibly some actual values nearby. This takes up more memory. Not only do you have to perform more work, but you also are filling up memory faster. The actual increase in time will vary as it is highly dependent on the architecture, operation (in this case +), and memory speeds. Also, your compiler will work quite hard to minimize the delays.
I'm trying to get a good understanding of branch prediction by measuring the time to run loops with predictable branches vs. loops with random branches.
So I wrote a program that takes large arrays of 0's and 1's arranged in different orders (i.e. all 0's, repeating 0-1, all rand), and iterates through the array branching based on if the current index is 0 or 1, doing time-wasting work.
I expected that harder-to-guess arrays would take longer to run on, since the branch predictor would guess wrong more often, and that the time-delta between runs on two sets of arrays would remain the same regardless of the amount of time-wasting work.
However, as amount of time-wasting work increased, the difference in time-to-run between arrays increased, A LOT.
(X-axis is amount of time-wasting work, Y-axis is time-to-run)
Does anyone understand this behavior? You can see the code I'm running at the following code:
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
static const int s_iArrayLen = 999999;
static const int s_iMaxPipelineLen = 60;
static const int s_iNumTrials = 10;
int doWorkAndReturnMicrosecondsElapsed(int* vals, int pipelineLen){
int* zeroNums = new int[pipelineLen];
int* oneNums = new int[pipelineLen];
for(int i = 0; i < pipelineLen; ++i)
zeroNums[i] = oneNums[i] = 0;
chrono::time_point<chrono::system_clock> start, end;
start = chrono::system_clock::now();
for(int i = 0; i < s_iArrayLen; ++i){
if(vals[i] == 0){
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
else{
for(int i = 0; i < pipelineLen; ++i)
++oneNums[i];
}
}
end = chrono::system_clock::now();
int elapsedMicroseconds = (int)chrono::duration_cast<chrono::microseconds>(end-start).count();
//This should never fire, it just exists to guarantee the compiler doesn't compile out our zeroNums/oneNums
for(int i = 0; i < pipelineLen - 1; ++i)
if(zeroNums[i] != zeroNums[i+1] || oneNums[i] != oneNums[i+1])
return -1;
delete[] zeroNums;
delete[] oneNums;
return elapsedMicroseconds;
}
struct TestMethod{
string name;
void (*func)(int, int&);
int* results;
TestMethod(string _name, void (*_func)(int, int&)) { name = _name; func = _func; results = new int[s_iMaxPipelineLen]; }
};
int main(){
srand( (unsigned int)time(nullptr) );
vector<TestMethod> testMethods;
testMethods.push_back(TestMethod("all-zero", [](int index, int& out) { out = 0; } ));
testMethods.push_back(TestMethod("repeat-0-1", [](int index, int& out) { out = index % 2; } ));
testMethods.push_back(TestMethod("repeat-0-0-0-1", [](int index, int& out) { out = (index % 4 == 0) ? 0 : 1; } ));
testMethods.push_back(TestMethod("rand", [](int index, int& out) { out = rand() % 2; } ));
int* vals = new int[s_iArrayLen];
for(int currentPipelineLen = 0; currentPipelineLen < s_iMaxPipelineLen; ++currentPipelineLen){
for(int currentMethod = 0; currentMethod < (int)testMethods.size(); ++currentMethod){
int resultsSum = 0;
for(int trialNum = 0; trialNum < s_iNumTrials; ++trialNum){
//Generate a new array...
for(int i = 0; i < s_iArrayLen; ++i)
testMethods[currentMethod].func(i, vals[i]);
//And record how long it takes
resultsSum += doWorkAndReturnMicrosecondsElapsed(vals, currentPipelineLen);
}
testMethods[currentMethod].results[currentPipelineLen] = (resultsSum / s_iNumTrials);
}
}
cout << "\t";
for(int i = 0; i < s_iMaxPipelineLen; ++i){
cout << i << "\t";
}
cout << "\n";
for (int i = 0; i < (int)testMethods.size(); ++i){
cout << testMethods[i].name.c_str() << "\t";
for(int j = 0; j < s_iMaxPipelineLen; ++j){
cout << testMethods[i].results[j] << "\t";
}
cout << "\n";
}
int end;
cin >> end;
delete[] vals;
}
Pastebin link: http://pastebin.com/F0JAu3uw
I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.
I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.
Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.
Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:
Is that more what you expected?
If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...
Edit:
And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.
I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.
In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:
if(vals[i] == 0)
{
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.
I'm not sure if this can explain wavy pattern of your results, but:
Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:
Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts
So your loops are not only measuring memory throughput but they are also affecting BTB.
If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).
Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.
This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
Performance of breaking apart one loop into two loops
(6 answers)
Closed 9 years ago.
What is the overhead in splitting a for-loop like this,
int i;
for (i = 0; i < exchanges; i++)
{
// some code
// some more code
// even more code
}
into multiple for-loops like this?
int i;
for (i = 0; i < exchanges; i++)
{
// some code
}
for (i = 0; i < exchanges; i++)
{
// some more code
}
for (i = 0; i < exchanges; i++)
{
// even more code
}
The code is performance-sensitive, but doing the latter would improve readability significantly. (In case it matters, there are no other loops, variable declarations, or function calls, save for a few accessors, within each loop.)
I'm not exactly a low-level programming guru, so it'd be even better if someone could measure up the performance hit in comparison to basic operations, e.g. "Each additional for-loop would cost the equivalent of two int allocations." But, I understand (and wouldn't be surprised) if it's not that simple.
Many thanks, in advance.
There are often way too many factors at play... And it's easy to demonstrate both ways:
For example, splitting the following loop results in almost a 2x slow-down (full test code at the bottom):
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
And this is almost stating the obvious since you need to loop through 3 times instead of once and you make 3 passes over the entire array instead of 1.
On the other hand, if you take a look at this question: Why are elementwise additions much faster in separate loops than in a combined loop?
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
The opposite is sometimes true due to memory alignment.
What to take from this?
Pretty much anything can happen. Neither way is always faster and it depends heavily on what's inside the loops.
And as such, determining whether such an optimization will increase performance is usually trial-and-error. With enough experience you can make fairly confident (educated) guesses. But in general, expect anything.
"Each additional for-loop would cost the equivalent of two int allocations."
You are correct that it's not that simple. In fact it's so complicated that the numbers don't mean much. A loop iteration may take X cycles in one context, but Y cycles in another due to a multitude of factors such as Out-of-order Execution and data dependencies.
Not only is the performance context-dependent, but it also vary with different processors.
Here's the test code:
#include <time.h>
#include <iostream>
using namespace std;
int main(){
int size = 10000;
int *data = new int[size];
clock_t start = clock();
for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
#else
for (int c = 0; c < size; c++){
data[c] *= 10;
}
for (int c = 0; c < size; c++){
data[c] += 7;
}
for (int c = 0; c < size; c++){
data[c] &= 15;
}
#endif
}
clock_t end = clock();
cout << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
}
Output (one loop): 4.08 seconds
Output (3 loops): 7.17 seconds
Processors prefer to have a higher ratio of data instructions to jump instructions.
Branch instructions may force your processor to clear the instruction pipeline and reload.
Based on the reloading of the instruction pipeline, the first method would be faster, but not significantly. You would add at least 2 new branch instructions by splitting.
A faster optimization is to unroll the loop. Unrolling the loop tries to improve the ratio of data instructions to branch instructions by performing more instructions inside the loop before branching to the top of the loop.
Another significant performance optimization is to organize the data so it fits into one of the processor's cache lines. So for example, you could split have inner loops that process a single cache of data and the outer loop would load new items into the cache.
This optimizations should only be applied after the program runs correctly and robustly and the environment demands more performance. The environment defined as observers (animation / movies), users (waiting for a response) or hardware (performing operations before a critical time event). Any other purpose is a waste of your time, as the OS (running concurrent programs) and storage access will contribute more to your program's performance issues.
This will give you a good indication of whether or not one version is faster than another.
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
#include <string>
const int iterations = 100;
namespace
{
const int exchanges = 200;
template<typename TTest>
void Test(const std::string &name, TTest &&test)
{
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::duration<float, std::milli> ms;
std::array<float, iterations> timings;
for (auto i = 0; i != iterations; ++i)
{
auto t0 = Clock::now();
test();
timings[i] = ms(Clock::now() - t0).count();
}
auto avg = std::accumulate(timings.begin(), timings.end(), 0) / iterations;
std::cout << "Average time, " << name << ": " << avg << std::endl;
}
}
int main()
{
Test("single loop",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
// some more code
// even more code
}
});
Test("separated loops",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
}
for (auto i = 0; i < exchanges; ++i)
{
// some more code
}
for (auto i = 0; i < exchanges; ++i)
{
// even more code
}
});
}
The thing is quite simple. The first code is like taking a single lap on a race track and the other code is like taking a full 3-lap race. So, more time required to take three laps rather than one lap. However, if the loops are doing something that needs to be done in sequence and they depend on each other then second code will do the stuff. for example if first loop is doing some calculations and second loop is doing some work with those calculations then both loops need to be done in sequence otherwise not...