Branch Prediction: Writing Code to Understand it; Getting Weird Results - c++

I'm trying to get a good understanding of branch prediction by measuring the time to run loops with predictable branches vs. loops with random branches.
So I wrote a program that takes large arrays of 0's and 1's arranged in different orders (i.e. all 0's, repeating 0-1, all rand), and iterates through the array branching based on if the current index is 0 or 1, doing time-wasting work.
I expected that harder-to-guess arrays would take longer to run on, since the branch predictor would guess wrong more often, and that the time-delta between runs on two sets of arrays would remain the same regardless of the amount of time-wasting work.
However, as amount of time-wasting work increased, the difference in time-to-run between arrays increased, A LOT.
(X-axis is amount of time-wasting work, Y-axis is time-to-run)
Does anyone understand this behavior? You can see the code I'm running at the following code:
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
static const int s_iArrayLen = 999999;
static const int s_iMaxPipelineLen = 60;
static const int s_iNumTrials = 10;
int doWorkAndReturnMicrosecondsElapsed(int* vals, int pipelineLen){
int* zeroNums = new int[pipelineLen];
int* oneNums = new int[pipelineLen];
for(int i = 0; i < pipelineLen; ++i)
zeroNums[i] = oneNums[i] = 0;
chrono::time_point<chrono::system_clock> start, end;
start = chrono::system_clock::now();
for(int i = 0; i < s_iArrayLen; ++i){
if(vals[i] == 0){
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
else{
for(int i = 0; i < pipelineLen; ++i)
++oneNums[i];
}
}
end = chrono::system_clock::now();
int elapsedMicroseconds = (int)chrono::duration_cast<chrono::microseconds>(end-start).count();
//This should never fire, it just exists to guarantee the compiler doesn't compile out our zeroNums/oneNums
for(int i = 0; i < pipelineLen - 1; ++i)
if(zeroNums[i] != zeroNums[i+1] || oneNums[i] != oneNums[i+1])
return -1;
delete[] zeroNums;
delete[] oneNums;
return elapsedMicroseconds;
}
struct TestMethod{
string name;
void (*func)(int, int&);
int* results;
TestMethod(string _name, void (*_func)(int, int&)) { name = _name; func = _func; results = new int[s_iMaxPipelineLen]; }
};
int main(){
srand( (unsigned int)time(nullptr) );
vector<TestMethod> testMethods;
testMethods.push_back(TestMethod("all-zero", [](int index, int& out) { out = 0; } ));
testMethods.push_back(TestMethod("repeat-0-1", [](int index, int& out) { out = index % 2; } ));
testMethods.push_back(TestMethod("repeat-0-0-0-1", [](int index, int& out) { out = (index % 4 == 0) ? 0 : 1; } ));
testMethods.push_back(TestMethod("rand", [](int index, int& out) { out = rand() % 2; } ));
int* vals = new int[s_iArrayLen];
for(int currentPipelineLen = 0; currentPipelineLen < s_iMaxPipelineLen; ++currentPipelineLen){
for(int currentMethod = 0; currentMethod < (int)testMethods.size(); ++currentMethod){
int resultsSum = 0;
for(int trialNum = 0; trialNum < s_iNumTrials; ++trialNum){
//Generate a new array...
for(int i = 0; i < s_iArrayLen; ++i)
testMethods[currentMethod].func(i, vals[i]);
//And record how long it takes
resultsSum += doWorkAndReturnMicrosecondsElapsed(vals, currentPipelineLen);
}
testMethods[currentMethod].results[currentPipelineLen] = (resultsSum / s_iNumTrials);
}
}
cout << "\t";
for(int i = 0; i < s_iMaxPipelineLen; ++i){
cout << i << "\t";
}
cout << "\n";
for (int i = 0; i < (int)testMethods.size(); ++i){
cout << testMethods[i].name.c_str() << "\t";
for(int j = 0; j < s_iMaxPipelineLen; ++j){
cout << testMethods[i].results[j] << "\t";
}
cout << "\n";
}
int end;
cin >> end;
delete[] vals;
}
Pastebin link: http://pastebin.com/F0JAu3uw

I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.
I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.
Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.
Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:
Is that more what you expected?
If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...
Edit:
And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.
I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.

In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:
if(vals[i] == 0)
{
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.
I'm not sure if this can explain wavy pattern of your results, but:
Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:
Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts
So your loops are not only measuring memory throughput but they are also affecting BTB.
If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).
Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.

Related

How to deal with large sizes of data such as array or just number that causing stack in Cpp?

its my first time dealing with large numbers or arrays and i cant avoid over stacking i tried to use long long to try to avoid it but it shows me that the error is int main line :
CODE:
#include <iostream>
using namespace std;
int main()
{
long long n=0, city[100000], min[100000] = {10^9}, max[100000] = { 0 };
cin >> n;
for (int i = 0; i < n; i++) {
cin >> city[i];
}
for (int i = 0; i < n; i++)
{//min
for (int s = 0; s < n; s++)
{
if (city[i] != city[s])
{
if (min[i] >= abs(city[i] - city[s]))
{
min[i] = abs(city[i] - city[s]);
}
}
}
}
for (int i = 0; i < n; i++)
{//max
for (int s = 0; s < n; s++)
{
if (city[i] != city[s])
{
if (max[i] <= abs(city[i] - city[s]))
{
max[i] = abs(city[i] - city[s]);
}
}
}
}
for (int i = 0; i < n; i++) {
cout << min[i] << " " << max[i] << endl;
}
}
**ERROR:**
Severity Code Description Project File Line Suppression State
Warning C6262 Function uses '2400032' bytes of stack: exceeds /analyze:stacksize '16384'. Consider moving some data to heap.
then it opens chkstk.asm and shows error in :
test dword ptr [eax],eax ; probe page.
Small optimistic remark:
100,000 is not a large number for your computer! (you're also not dealing with that many arrays, but arrays of that size)
Error message describes what goes wrong pretty well:
You're creating arrays on your current function's "scratchpad" (the stack). That has very limited size!
This is C++, so you really should do things the (modern-ish) C++ way and avoid manually handling large data objects when you can.
So, replace
long long n=0, city[100000], min[100000] = {10^9}, max[100000] = { 0 };
with (I don't see any case where you'd want to use long long; presumably, you want a 64bit variable?)
(10^9 is "10 XOR 9", not "10 to the power of 9")
constexpr size_t size = 100000;
constexpr int64_t default_min = 1'000'000'000;
uint64_t n = 0;
std::vector<int64_t> city(size);
std::vector<int64_t> min_value(size, default_min);
std::vector<int64_t> max_value(size, 0);
Additional remarks:
Notice how I took your 100000 and your 10⁹ and made them constexpr constants? Do that! Whenever some non-zero "magic constant" appears in your code, it's a good time to ask yourself "will I ever need that value somewhere else, too?" and "Would it make sense to give this number a name explaining what it is?". And if you answer one of them with "yes": make a new constexpr constant, even just directly above where you use it! The compiler will just deal with that as if you had the literal number where you use it, it's not any extra memory, or CPU cycles, that this will cost.
Matter of fact, that's even bad! You pre-allocating not-really-large-but-still-unneccesarily-large arrays is just a bad idea. Instead, read n first, then use that n to make std::vectors of that size.
Don not using namespace std;, for multiple reasons, chief among them that now your min and max variables would shadow std::min and std::max, and if you call something, you never know whether you're actually calling what you mean to, or just the function of the same name from the std:: namespace. Instead using std::cout; using std::cin; would do for you here!
This might be beyond your current learning level (that's fine!), but
for (int i = 0; i < n; i++) {
cin >> city[i];
}
is inelegant, and with the std::vector approach, if you make your std::vector really have length n, can be written nicely as:
for (auto &value: city) {
cin >> value;
}
This will also make sure you're not accidentally reading more values than you mean when changing the length of that city storage one day.
It looks as if you're trying to find the minimum and maximum absolute distance between city values. But you do it in an incredibly inefficient way, needing multiple loops over 10⁵·10⁵=10¹⁰ iterations.
Start with the maximum distance: assume your city vector, array (whatever!) were sorted. What are the two elements with the greatest absolute distance?
If you had a sorted array/vector: how would you find the two elements with the smallest distance?

Fast construction of unordered_map with nested vector with preallocation std::unordered_map<int, vector<Thing *>>?

I want to create map of int to vector of Things*. I know that Thing will be 1-50 no more. How can I allocate 50 at start to speed up construction of map?
I tried three methods but still not sure if it enough fast. Can you suggest better optimization?
I was using c++ 10 years ago and I am not sure if I do it correctly. Can you help?
All optimization suggestions are welcome. Code is simplified from real problem.
#include <iostream>
#include <vector>
#include <unordered_map>
#include <time.h>
class Thing {
};
int main()
{
clock_t start;
start = clock();
auto int_to_thing = std::unordered_map<int, std::vector<Thing *>>();
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 25; j++) {
int_to_thing[i].push_back(new Thing());
}
}
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 25; j++) {
int_to_thing[i].push_back(new Thing());
}
}
std::cout << (clock() - start) << std::endl;
start = clock();
int_to_thing = std::unordered_map<int, std::vector<Thing *>>();
for (int i = 0; i < 1000; i++) {
int_to_thing[i].reserve(50);
for (int j = 0; j < 25; j++) {
int_to_thing[i].push_back(new Thing());
}
}
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 25; j++) {
int_to_thing[i].push_back(new Thing());
}
}
std::cout << (clock() - start) << std::endl;
start = clock();
int_to_thing = std::unordered_map<int, std::vector<Thing *>>();
for (int i = 0; i < 1000; i++) {
auto it = int_to_thing.find(i);
if (it != int_to_thing.end()) {
auto v = std::vector<Thing *>(50);
auto pair = std::pair<int, std::vector<Thing *>>(i, v);
int_to_thing.insert(pair);
}
}
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 25; j++) {
int_to_thing[i].push_back(new Thing());
}
}
std::cout << (clock() - start) << std::endl;
return 0;
}
Are you concerned about the construction of the map (then see #ShadowRanger's comment) or the construction of the vectors?
I assume that there are 1..50 Thing's in a vector, NOT 1..50 vectors in a map.
Your code:
int_to_thing = std::unordered_map<int, std::vector<Thing *>>();
for (int i = 0; i < 1000; i++) {
int_to_thing[i].reserve(50);
is the best option. It constructs a map of vectors and, inside the loop, creates each vector and pre-allocates room for 50 elements.
Without that reserve() you would likely encounter a couple of reallocation while pushing 50 elements into those vectors.
Using:
auto v = std::vector<Thing *>(50);
actually creates 50 elements in your vector,and default-initializes them. This may or may not cost you extra. Specifically, it will be cheap with your current use of pointers, and expensive if you switch to storing the Thing objects themselves.
If you are unsure that something is fast enough then you are not measuring performance and this is prima facie evidence that you don’t care one iota about it. If you don’t measure it then you cannot claim anything about it. Measure it first before you do anything else. Otherwise you’ll waste everyone’s time. You work on an assumption that such preallocations will help. I have an inkling that they won’t help at all since you make so few of them, and you’re just wasting your time. Again: if you are serious about performance, you stop now, get measurements in place, and come back with some numbers to talk over. And don’t measure debug builds - only release builds with full optimization turned on, including link time code generation (LTCG). If you don’t optimize, you don’t care about performance either. Period. Full stop. Those are the rules.
Yes, you have code that times stuff but that’s not what measurements are about. They need to happen in the context of your use of the data, so that you can see what relative overhead you have. If the task takes an hour and you spend a second doing this “unoptimally”, then there’s no point in optimizing that first - you got bigger fish to fry first. And besides, in most context the code is cache-driven ie data access patterns determine performance so I don’t believe you’re doing anything useful at all at the moment. Such micro optimizations are totally pointless. This code doesn’t exist in a vacuum. If it did, you can just remove it and forget about it all, right?

What would be a better way to write a loop that loops by initial value

int square(int x) {
int looptime = x;
int total = 0;
for (int i=0 ; i < looptime; ++i) {
total += looptime;
}
return total;
}
int main()
{
for (int i = 0; i < 100; ++i)
cout << i << '\t' << square(i) << '\n';
}
I am new to C++ and trying to learn by self study by reading "Programming Principles and Practices". In this particular problem, I have to make a "square()" function. The catch is I have to use addition,no multiplication.
The above code works and returns correct values, but contrary to mantra of readability, I find it hard to read and understand and I'm the one that wrote it.
I need to pass into the loop, the initial integer and have it loop that many times without effecting the original amount. Is there a better way to write this.
There is no need for looptime in the function.
int square(int x) {
int total = 0;
for (int i=0 ; i < x; ++i) {
total += x;
}
return total;
}
In regards to why it was infinitely doubling, it is difficult without seeing the earlier code. The logical explanation would be that the logic condition in the for loop was flawed, so it never exited when expected.
In this portion, the logic likely would have been where i < looptime is now:
for (int i=0 ; i < looptime; ++i) {
total += looptime;
If that condition is always met, it would explain the symptom (it will remain in the loop, adding to it each cycle). If you can reproduce the earlier result and show the earlier code it would allow this theory to be proven out.

Almost same code running much slower

I am trying to solve this problem:
Given a string array words, find the maximum value of length(word[i]) * length(word[j]) where the two words do not share common letters. You may assume that each word will contain only lower case letters. If no such two words exist, return 0.
https://leetcode.com/problems/maximum-product-of-word-lengths/
You can create a bitmap of char for each word to check if they share chars in common and then calc the max product.
I have two method almost equal but the first pass checks, while the second is too slow, can you understand why?
class Solution {
public:
int maxProduct2(vector<string>& words) {
int len = words.size();
int *num = new int[len];
// compute the bit O(n)
for (int i = 0; i < len; i ++) {
int k = 0;
for (int j = 0; j < words[i].length(); j ++) {
k = k | (1 <<(char)(words[i].at(j)));
}
num[i] = k;
}
int c = 0;
// O(n^2)
for (int i = 0; i < len - 1; i ++) {
for (int j = i + 1; j < len; j ++) {
if ((num[i] & num[j]) == 0) { // if no common letters
int x = words[i].length() * words[j].length();
if (x > c) {
c = x;
}
}
}
}
delete []num;
return c;
}
int maxProduct(vector<string>& words) {
vector<int> bitmap(words.size());
for(int i=0;i<words.size();++i) {
int k = 0;
for(int j=0;j<words[i].length();++j) {
k |= 1 << (char)(words[i][j]);
}
bitmap[i] = k;
}
int maxProd = 0;
for(int i=0;i<words.size()-1;++i) {
for(int j=i+1;j<words.size();++j) {
if ( !(bitmap[i] & bitmap[j])) {
int x = words[i].length() * words[j].length();
if ( x > maxProd )
maxProd = x;
}
}
}
return maxProd;
}
};
Why the second function (maxProduct) is too slow for leetcode?
Solution
The second method does repetitive call to words.size(). If you save that in a var than it working fine
Since my comment turned out to be correct I'll turn my comment into an answer and try to explain what I think is happening.
I wrote some simple code to benchmark on my own machine with two solutions of two loops each. The only difference is the call to words.size() is inside the loop versus outside the loop. The first solution is approximately 13.87 seconds versus 16.65 seconds for the second solution. This isn't huge, but it's about 20% slower.
Even though vector.size() is a constant time operation that doesn't mean it's as fast as just checking against a variable that's already in a register. Constant time can still have large variances. When inside nested loops that adds up.
The other thing that could be happening (someone much smarter than me will probably chime in and let us know) is that you're hurting your CPU optimizations like branching and pipelining. Every time it gets to the end of the the loop it has to stop, wait for the call to size() to return, and then check the loop variable against that return value. If the cpu can look ahead and guess that j is still going to be less than len because it hasn't seen len change (len isn't even inside the loop!) it can make a good branch prediction each time and not have to wait.

What is the overhead in splitting a for-loop into multiple for-loops, if the total work inside is the same? [duplicate]

This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
Performance of breaking apart one loop into two loops
(6 answers)
Closed 9 years ago.
What is the overhead in splitting a for-loop like this,
int i;
for (i = 0; i < exchanges; i++)
{
// some code
// some more code
// even more code
}
into multiple for-loops like this?
int i;
for (i = 0; i < exchanges; i++)
{
// some code
}
for (i = 0; i < exchanges; i++)
{
// some more code
}
for (i = 0; i < exchanges; i++)
{
// even more code
}
The code is performance-sensitive, but doing the latter would improve readability significantly. (In case it matters, there are no other loops, variable declarations, or function calls, save for a few accessors, within each loop.)
I'm not exactly a low-level programming guru, so it'd be even better if someone could measure up the performance hit in comparison to basic operations, e.g. "Each additional for-loop would cost the equivalent of two int allocations." But, I understand (and wouldn't be surprised) if it's not that simple.
Many thanks, in advance.
There are often way too many factors at play... And it's easy to demonstrate both ways:
For example, splitting the following loop results in almost a 2x slow-down (full test code at the bottom):
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
And this is almost stating the obvious since you need to loop through 3 times instead of once and you make 3 passes over the entire array instead of 1.
On the other hand, if you take a look at this question: Why are elementwise additions much faster in separate loops than in a combined loop?
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
The opposite is sometimes true due to memory alignment.
What to take from this?
Pretty much anything can happen. Neither way is always faster and it depends heavily on what's inside the loops.
And as such, determining whether such an optimization will increase performance is usually trial-and-error. With enough experience you can make fairly confident (educated) guesses. But in general, expect anything.
"Each additional for-loop would cost the equivalent of two int allocations."
You are correct that it's not that simple. In fact it's so complicated that the numbers don't mean much. A loop iteration may take X cycles in one context, but Y cycles in another due to a multitude of factors such as Out-of-order Execution and data dependencies.
Not only is the performance context-dependent, but it also vary with different processors.
Here's the test code:
#include <time.h>
#include <iostream>
using namespace std;
int main(){
int size = 10000;
int *data = new int[size];
clock_t start = clock();
for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
#else
for (int c = 0; c < size; c++){
data[c] *= 10;
}
for (int c = 0; c < size; c++){
data[c] += 7;
}
for (int c = 0; c < size; c++){
data[c] &= 15;
}
#endif
}
clock_t end = clock();
cout << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
}
Output (one loop): 4.08 seconds
Output (3 loops): 7.17 seconds
Processors prefer to have a higher ratio of data instructions to jump instructions.
Branch instructions may force your processor to clear the instruction pipeline and reload.
Based on the reloading of the instruction pipeline, the first method would be faster, but not significantly. You would add at least 2 new branch instructions by splitting.
A faster optimization is to unroll the loop. Unrolling the loop tries to improve the ratio of data instructions to branch instructions by performing more instructions inside the loop before branching to the top of the loop.
Another significant performance optimization is to organize the data so it fits into one of the processor's cache lines. So for example, you could split have inner loops that process a single cache of data and the outer loop would load new items into the cache.
This optimizations should only be applied after the program runs correctly and robustly and the environment demands more performance. The environment defined as observers (animation / movies), users (waiting for a response) or hardware (performing operations before a critical time event). Any other purpose is a waste of your time, as the OS (running concurrent programs) and storage access will contribute more to your program's performance issues.
This will give you a good indication of whether or not one version is faster than another.
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
#include <string>
const int iterations = 100;
namespace
{
const int exchanges = 200;
template<typename TTest>
void Test(const std::string &name, TTest &&test)
{
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::duration<float, std::milli> ms;
std::array<float, iterations> timings;
for (auto i = 0; i != iterations; ++i)
{
auto t0 = Clock::now();
test();
timings[i] = ms(Clock::now() - t0).count();
}
auto avg = std::accumulate(timings.begin(), timings.end(), 0) / iterations;
std::cout << "Average time, " << name << ": " << avg << std::endl;
}
}
int main()
{
Test("single loop",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
// some more code
// even more code
}
});
Test("separated loops",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
}
for (auto i = 0; i < exchanges; ++i)
{
// some more code
}
for (auto i = 0; i < exchanges; ++i)
{
// even more code
}
});
}
The thing is quite simple. The first code is like taking a single lap on a race track and the other code is like taking a full 3-lap race. So, more time required to take three laps rather than one lap. However, if the loops are doing something that needs to be done in sequence and they depend on each other then second code will do the stuff. for example if first loop is doing some calculations and second loop is doing some work with those calculations then both loops need to be done in sequence otherwise not...