Difference between code generated using a template function and a normal function - c++

I have a vector containing large number of elements. Now I want to write a small function which counts the number of even or odd elements in the vector. Since performance is a major concern I don't want to put an if statement inside the loop. So I wrote two small functions like:
long long countOdd(const std::vector<int>& v)
{
long long count = 0;
const int size = v.size();
for(int i = 0; i < size; ++i)
{
if(v[i] & 1)
{
++count;
}
}
return count;
}
long long countEven(const std::vector<int>& v)
{
long long count = 0;
const int size = v.size();
for(int i = 0; i < size; ++i)
{
if(0 == (v[i] & 1))
{
++count;
}
}
return count;
}
My question is can I get the same result by writing a single template function like this:
template <bool countEven>
long long countTemplate(const std::vector<int>& v1)
{
long long count = 0;
const int size = v1.size();
for(int i = 0; i < size; ++i)
{
if(countEven)
{
if(v1[i] & 1)
{
++count;
}
}
else if(0 == (v1[i] & 1))
{
++count;
}
}
return count;
}
And using it like this:
int main()
{
if(somecondition)
{
countTemplate<true>(vec); //Count even
}
else
{
countTemplate<false>(vec); //Count odd
}
}
Will the code generated for the template and non-template version be the same ? or will there be some additional instructions emitted?
Note that the counting of numbers is just for illustration hence please don't suggest other methods for counting.
EDIT:
Ok. I agree that it may not make much sense from performance point of view. But atleast from maintainability point of view I would like to have only one function to maintain instead of two.

The templated version may and, very probably, will be optimized by the compiler when it sees a certain branch in the code is never reached. The countTemplate code for instance, will have the countEven template argument set to true, so the odd branch will be cut away.
(sorry, I can't help suggesting another counting method)
In this particular case, you could use count_if on your vector:
struct odd { bool operator()( int i )const { return i&1; } };
size_t nbOdd = std::count_if( vec.begin(), vec.end(), odd() );
This can also be optimized, and writes way shorter :) The standard library developers have given possible optimization much thought, so better use it when you can, instead of writing your own counting for-loop.

Your template version will generate code like this:
template <>
long long countTemplate<true>(const std::vector<int>& v1)
{
long long count = 0;
const int size = v1.size();
for(int i = 0; i < size; ++i)
{
if(true)
{
if(v1[i] & 1)
{
++count;
}
}
else if(0 == (v1[i] & 1))
{
++count;
}
}
return count;
}
template <>
long long countTemplate<false>(const std::vector<int>& v1)
{
long long count = 0;
const int size = v1.size();
for(int i = 0; i < size; ++i)
{
if(false)
{
if(v1[i] & 1)
{
++count;
}
}
else if(0 == (v1[i] & 1))
{
++count;
}
}
return count;
}
So if all optimizations are disabled, the if will in theory still be there. But even a very naive compiler will determine that you're testing a constant, and simply remove the if.
So in practice, no, there should be no difference in the generated code. So you can use the template version and don't worry about this.

I guess that good compiler will cut redundant code in your template as countEven is compile time constant and it is very simple to implement such optimization during template instantiation.
Anyway it seems pretty strange. You wrote a template but do "dynamic switching" inside.
May be try something like that:
struct CountEven {}
struct CountOdd {}
inline void CountNum(int & num, long long & count, const CountEven &)
{
if(num & 1)
{
++count;
}
}
inline void CountNum(int & num, long long & count, const CountOdd &)
{
if(0 == (num & 1))
{
++count;
}
}
template <class T>
long long countTemplate(const std::vector<int>& v1)
{
long long count = 0;
const int size = v1.size();
for(int i = 0; i < size; ++i)
{
CountNum(v1[i], count, T());
}
return count;
}
It will select necessary CountNum() function version on compilation stage:
int main()
{
if(somecondition)
{
countTemplate<CountEven>(vec); //Count even
}
else
{
countTemplate<CountOdd>(vec); //Count odd
}
}
Code is messy, but I think you got the idea.

This will depend on how smart the compiler optimizer is. The compiler might be able to see that really the if-statement is redundant and only one branch of it is executed and optimize the whole thing.
The best way to check is to try and look at the assembly - this code will not produce too much of machine code.

The first thing that comes to my mind are the two optimization "rules":
Don't optmized prematurely.
Don't do it yet.
The point is that sometimes we bother about a performance bottleneck which will never happen in practice. There are studies that say that 20 percent of the code is responsible for 80 percent of the software execution time. Of course this doesn't mean you pessimize prematurely, but I don't think that's your case.
In general, you should do this kind of optmization only after you have actually run a profiler on your program and identified the real bottlenecks.
Regarding your function versions, as other have said this depends on your compiler. Just remember that with the template approach you won't be able to switch calls at runtime (template is a compile-time tool).
A final note: long long is not standard C++ (yet).

If you care about optimization issues try to make it like the following:
template <bool countEven>
long long countTemplate(const std::vector<int>& v1)
{
long long count = 0;
const int size = v1.size();
for ( int i = 0; i < size; ++i ) {
// According to C++ Standard 4.5/4:
// An rvalue of type bool can be converted to an rvalue of type int,
// with false becoming zero and true becoming one.
if ( v1[i] & 1 == countEven ) ++count;
}
return count;
}
I believe that the code above will be compiled in the same code as without templates.

Use STL, Luke :-) It's even as example in reference
bool isOdd(int i)
{
return i%2==1;
}
bool isEven(int i)
{
return i%2==0;
}
std::vector<int>::size_type count = 0;
if(somecondition)
{
count = std::count_if(vec.begin(), vec.end(), isEven);
}
else
{
count = std::count_if(vec.begin(), vec.end(), isOdd);
}

In general, the outcome will be much the same. You are describing an O(n) iteration over the linear memory of the vector.
If you had a vector of pointers, suddenly the performance would be way worse because the memory locality of reference would be lost.
However, the more general thing is that even netbook CPUs can do gazallions of operations per second. Looping over your array is most unlikely to be performance-critical code.
You should write for readability, then profile your code, and consider doing more involved hand-tweaked things when the profiling highlights the root cause of any performance issue you have.
And performance gains typically come from algorithmic changes; if you kept count of the number of odds as you added and removed elements from the vector, for example, it would be O(1) to retrieve...

I see that you're using long long for counter, and that probably means that you expect huge number of elements in vector. In that case, I would definitely go for template implementation (because of code readability) and just move that if condition outside for loop.
If we assume that compiler makes no optimization whatsoever, you would have 1 condition and possibly more than 2 billion iterations through vector. Also, since the condition would be if (true) or if (false) the branch prediction would work perfectly and execution would be less than 1 CPU instruction.
I'm pretty sure that all compilers on the market have this optimization, but I would quote my favorite when it comes to performance: "Premature optimization is the root of all evil" and "There're only 3 rules of optimization: Measure, measure and measure".

If you absolutely absurdly care about fast looking code:
(a clever compiler, or one otherwise hinted at using directives or intrinsics, could do this in parallel using SIMD; CUDA and OpenCL would of course eat this for breakfast!)
int count_odd(const int* array,size_t len) {
int count = 0;
const int* const sentinal = array+len;
while(array<sentinal)
count += (*array++ & 1);
return count;
}
int count_even(const int* array,size_t len) {
return len-count_odd(array,len);
}

Related

Fastest way to check if array is equal to? [duplicate]

This question already has answers here:
How to check if all the values of an array are equal to 0?
(5 answers)
Closed 4 years ago.
I am writing a game simulation that tests if any piece is on the board. If a piece is not I would like the AI to place a piece on the board, for this I created a bool function to test if all the pieces are set to 0 which means they are yet to enter the board. The current function boots, but I feel there is a much simpler way to do this:
bool checkPiece(int a[])
{
int n = 0;
bool e = true;
while (e == true && n < 4)
{
if (a[n] == 0 )
{
n++;
}
else
{
return false;
}
}
return true;
}
I'd use the standard library, something on this general order:
bool checkPiece(int const *a) {
return std::all_of(a, a+4, [](int i) { return i == 0; });
}
If you really wanted to do the job on your own, perhaps something on this order:
bool checkPiece(int const *a) {
for (int i=0; i<4; i++)
if (a[i] != 0)
return false;
return true;
}
Most of the time, you'd also rather pass something collection-like, such as an std::array or std::vector (by const reference), or something range-like such as a gsl::span rather than a pointer though. This would (for one obvious example) make it trivial to get the size of what was passed instead of blindly assuming was 4 items.
This is basically all you need:
for (size_t n = 0; n < 4; ++n) {
if (a[n]) return false;
}
return true;
You dont need e and when iterating an array from begin till end (or return before) a for loop is easier to read and write than a while. You could use a algorithm, but I doubt that it will make your code more readable and you should avoid magic numbers like the plague (what if you ever change the size of the array to be something else than 4?).
You can solve this with little code using std::count.
bool checkPiece(int const *a) {
return std::count(a, a+4, 0) == 4;
}
First, don't use magic numbers instead of passing array sizes. If you change the size of your board you'll have to find every 4 in your program and decide whether it's actually the number 4 or the size of the array. Pass the size to your function.
Second, give the function a name that describes what it does. checkPiece doesn't tell me anything.
And when you've made those changes, use standard algorithms:
bool noPieces(int const *a, int size) {
return std::all_of(a, a + size, [](int pc) { return pc == 0; }
}

What is wrong with the logic in my program?

I've been tasked to create a function that identifies the number of occurrences in an array, however i am not getting the correct result. This is the function i wrote, i left out the rest of the program as that works.
int countOccurences(int b[], int size, int x)
{
int occ = x;
for(int i = 0; i < size; i++)
{
if(b[i] == occ)
occ++;
}
cout << occ << endl;
return occ;
}
If occ is meant to be the number of occurrences, it should be initialised to zero rather than x.
And the comparison should be between b[i] and x, not b[i] and occ.
And, as an aside (not affecting your actual logic), it's also very unusual to actually print out the return value in a utility function which is obviously meant to simply return the count but it may be you have that in there just for debug purposes.
And you should both ensure your indentation and use of braces is consistent between your for and your if - it will make your code easier to maintain.
That's all totally aside from the fact that C++ possesses a std::count() method in <algorithm> that will work this out for you without having to write a function to do it (although it may be that this is an educational question and the intent is to learn how to code things like this, rather than use readily made library functions to do the heavy lifting for you).
int countOccurences(int b[], const unsigned int size, const int x)
{
int occ = 0;
for(unsigned int i = 0; i < size; i++)
{
if(b[i] == x)
{
occ++;
}
}
std::cout << occ << std::endl;
return occ;
}
occ should start at zero
You should compare b[i] to x
Array indices should be unsigned
Why not be const-correct?
using namespace std; is bad practice

C++11 vector<bool> performance issue (with code example)

I notice that vector is much slower than bool array when running the following code.
int main()
{
int count = 0;
int n = 1500000;
// slower with c++ vector<bool>
/*vector<bool> isPrime;
isPrime.reserve(n);
isPrime.assign(n, true);
*/
// faster with bool array
bool* isPrime = new bool[n];
for (int i = 0; i < n; ++i)
isPrime[i] = true;
for (int i = 2; i< n; ++i) {
if (isPrime[i])
count++;
for (int j =2; i*j < n; ++j )
isPrime[i*j] = false;
}
cout << count << endl;
return 0;
}
Is there some way that I can do to make vector<bool> faster ? Btw, both std::vector::push_back and std::vector::emplace_back are even slower than std::vector::assign.
std::vector<bool> can have various performance issues (e.g. take a look at https://isocpp.org/blog/2012/11/on-vectorbool).
In general you can:
use std::vector<std::uint8_t> instead of std::vector<bool> (give a try to std::valarray<bool> also).
This requires more memory and is less cache-friendly but there isn't a overhead (in the form of bit manipulation) to access a single value, so there are situations in which it works better (after all it's just like your array of bool but without the nuisance of memory management)
use std::bitset if you know at compile time how large your boolean array is going to be (or if you can at least establish a reasonable upper bound)
if Boost is an option try boost::dynamic_bitset (the size can be specified at runtime)
But for speed optimizations you have to test...
With your specific example I can confirm a performance difference only when optimizations are turned off (of course this isn't the way to go).
Some tests with g++ v4.8.3 and clang++ v3.4.5 on an Intel Xeon system (-O3 optimization level) give a different picture:
time (ms)
G++ CLANG++
array of bool 3103 3010
vector<bool> 2835 2420 // not bad!
vector<char> 3136 3031 // same as array of bool
bitset 2742 2388 // marginally better
(time elapsed for 100 runs of the code in the answer)
std::vector<bool> doesn't look so bad (source code here).
vector<bool> may have a template specialization and may be implemented using bit array to save space. Extracting and saving a bit and converting it from / to bool may cause the performance drop you are observing. If you use std::vector::push_back, you are resizing the vector which will cause even worse performance. Next performance killer may be assign (Worst complexity: Linear of first argument), instead use operator [] (Complexity: constant).
On the other hand, bool [] is guaranteed to be array of bool.
And you should resize to n instead of n-1 to avoid undefined behaviour.
vector<bool> can be high performance, but isn't required to be. For vector<bool> to be efficient, it needs to operate on many bools at a time (e.g. isPrime.assign(n, true)), and the implementor has had to put loving care into it. Indexing individual bools in a vector<bool> is slow.
Here is a prime finder that I wrote a while back using vector<bool> and clang + libc++ (the libc++ part is important):
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>
std::vector<bool>
init_primes()
{
std::vector<bool> primes(0x80000000, true);
primes[0] = false;
primes[1] = false;
const auto pb = primes.begin();
const auto pe = primes.end();
const auto sz = primes.size();
size_t i = 2;
while (true)
{
size_t j = i*i;
if (j >= sz)
break;
do
{
primes[j] = false;
j += i;
} while (j < sz);
i = std::find(pb + (i+1), pe, true) - pb;
}
return primes;
}
int
main()
{
using namespace std::chrono;
using dsec = duration<double>;
auto t0 = steady_clock::now();
auto p = init_primes();
auto t1 = steady_clock::now();
std::cout << dsec(t1-t0).count() << "\n";
}
This executes for me in about 28s (-O3). When I change it to return a vector<char> instead, the execution time goes up to about 44s.
If you run this using some other std::lib, you probably won't see this trend. On libc++ algorithms such as std::find have been optimized to search a word of bits at a time, instead of bit at a time.
See http://howardhinnant.github.io/onvectorbool.html for more details on what std algorithms could be optimized by your vendor.

Prime Sieve class or overloaded function?

Currently I have two functions:
One takes the number of primes to generate.
The second takes the upper limit of primes to generate.
They are coded (In C++) as such:
prime_list erato_sieve(ul_it upper_limit)
{
prime_list primes;
if (upper_limit < 2)
return primes;
primes.push_back(2); // Initialize Array, and add 2 since its unique.
for (uit i = 3; i <= upper_limit; i += 2) // Only count odd numbers
{
flag is_prime = true;
for (uit j = 0; j < primes.size(); ++j)
{
if ((i % primes[j]) == 0)
{
is_prime = false;
break;
}
}
if (is_prime)
{
primes.push_back(i);
}
}
return primes;
}
And:
prime_list erato_sieve_num(ul_it MAX)
{
prime_list primes;
if (MAX == 0)
return primes;
primes.push_back(2); // Initialize Array, and add 2 since its unique.
uit i = 3;
while (primes.size() < MAX) // Only count odd numbers
{
flag is_prime = true;
for (uit j = 0; j < primes.size(); ++j)
{
if ((i % primes[j]) == 0)
{
is_prime = false;
break;
}
}
if (is_prime)
{
primes.push_back(i);
}
++i;
}
return primes;
}
Where the following types are defined:
typedef bool flag;
typedef unsigned int uit;
typedef unsigned long int ul_it;
typedef unsigned long long int ull_it;
typedef long long int ll_it;
typedef long double ld;
typedef std::vector<ull_it> prime_list;
(Feel free to use them if you like, or not. A find-replace will take care of that. I use them to make the code read more how I think)
I am trying to make these into one "function" that is overloaded, but they two have similar arguments. I'm worried that the choice between them will come down to type alone, which will lead to hard-to-debug problems.
My second option would be to create a class, but I'm quite embarrassed to say.., I've never used classes before. At all. So I have no idea how to do it, and the documentation is a little... sparse?
Anyway, if someone would mind helping me out a little bit, it would be greatly appreciated. Documentation is always helpful, and any pointers are welcome as well.
EDIT
As I said, my section option is a class. I'm just entirely sure how to make a class to combine these two.
Never give the same name to functions with different semantics. Overloading is not purposed for that. And these two both take an integer number, if you could overload them how would you tell which function is called at erato_sieve(5)?
Give them different names, e.g. erato_sieve_up_to and erato_sieve_count.
Well, if you still want to make things worse (please don't), you can overload them (please don't), just make them expect different types of arguments. For example, wrap an integer into a class and pass that class, something like
class CountWrapper {
public:
CountWrapper(int n) { n_ = n; }
operator int() { return n_; }
private:
int n_;
};
prime_list erato_sieve(const CountWrapper& MAX) {
// function's body stays the same
And call it like
my_list = erato_sieve(CountWrapper(5));
But once again: please don't!
To group the functions, you can define them as static methods of a class:
class PrimeGenerator {
public:
static prime_list EratoSieveUpTo(ul_it upper_limit) {
// body
}
static prime_list EratoSieveAmount(ul_it MAX) {
// body
}
};
and call the functions like
list1 = PrimeGenerator::EratoSieveUpTo(5);
list2 = PrimeGenerator::EratoSieveAmount(10);
If you want to create overloaded functions, you need a different argument list for each function definition. In the case the actual used arguments are of same type, the following trick can be used:
typedef struct {} flag_type_1;
typedef struct {} flag_type_2;
...
typedef struct {} flag_type_n;
prime_list erato_sieve(ul_it boundary, flag_type_1) { ... }
prime_list erato_sieve(ul_it boundary, flag_type_2) { ... }
...
prime_list erato_sieve(ul_it boundary, flag_type_n) { ... }
The idea is that each typedef-ed structure is of different type signature. This creates completely unrelated argument list for each function overload. Also, as the types are dummy holder, you don't care about the content. That's why you only need to include type in the argument list of the function definition.
I picked this up a while back from Channel 9. Pretty neat trick.
This isn't a direct answer to your question, but it will help answer your question.
You appear to be attempting to implement the Sieve of Eratosthenes. The basic algorithm for that sieve is below:
1) Create a list of numbers from 2 to N (N is the maximum value you are looking for)
2) Start at 2, and eliminate all other even numbers (they are non-prime) less than or equal to N
3) Move to the next non-eliminated number.
4) Eliminate all multiples of that number less than or equal to N.
5) Repeat steps 3 and 4 until you reach the square root of N.
Translating that into C++ code, it would look something like this (not optimized):
std::vector<unsigned int> sieve_of_eratosthenes(unsigned int maximum)
{
std::vector<unsigned int> results; // this is your result set
std::vector<bool> tests(maximum + 1); // this will be your "number list"
// initialize the tests vector
for (unsigned int i = 0; i <= maximum; ++i)
{
if (i == 0 || i == 1)
tests[i] = false;
else
tests[i] = true;
}
// eliminate all even numbers but 2
for (unsigned int i = 4; i <= maximum; i += 2)
{
tests[i] = false;
}
// start with 3 and go to root of maximum
unsigned int i = 3;
while (i * i <= maximum)
{
for (unsigned int j = i + i; j <= maximum; j += i)
{
tests[j] = false;
}
// find the next non-eliminated value
unsigned int k = i + 1;
while (!tests[k])
{
k++;
}
i = k;
}
// create your results list
for (unsigned int j = 0; j <= maximum; ++j)
{
if (tests[j])
{
results.push_back(j);
}
}
return results;
}
Example
Since the sieve requires a maximum value, you do not want to provide a number of primes for this algorithm. There are other prime generating algorithms that do that, but the Sieve of Eratosthenes does not.

Branch Prediction: Writing Code to Understand it; Getting Weird Results

I'm trying to get a good understanding of branch prediction by measuring the time to run loops with predictable branches vs. loops with random branches.
So I wrote a program that takes large arrays of 0's and 1's arranged in different orders (i.e. all 0's, repeating 0-1, all rand), and iterates through the array branching based on if the current index is 0 or 1, doing time-wasting work.
I expected that harder-to-guess arrays would take longer to run on, since the branch predictor would guess wrong more often, and that the time-delta between runs on two sets of arrays would remain the same regardless of the amount of time-wasting work.
However, as amount of time-wasting work increased, the difference in time-to-run between arrays increased, A LOT.
(X-axis is amount of time-wasting work, Y-axis is time-to-run)
Does anyone understand this behavior? You can see the code I'm running at the following code:
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
static const int s_iArrayLen = 999999;
static const int s_iMaxPipelineLen = 60;
static const int s_iNumTrials = 10;
int doWorkAndReturnMicrosecondsElapsed(int* vals, int pipelineLen){
int* zeroNums = new int[pipelineLen];
int* oneNums = new int[pipelineLen];
for(int i = 0; i < pipelineLen; ++i)
zeroNums[i] = oneNums[i] = 0;
chrono::time_point<chrono::system_clock> start, end;
start = chrono::system_clock::now();
for(int i = 0; i < s_iArrayLen; ++i){
if(vals[i] == 0){
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
else{
for(int i = 0; i < pipelineLen; ++i)
++oneNums[i];
}
}
end = chrono::system_clock::now();
int elapsedMicroseconds = (int)chrono::duration_cast<chrono::microseconds>(end-start).count();
//This should never fire, it just exists to guarantee the compiler doesn't compile out our zeroNums/oneNums
for(int i = 0; i < pipelineLen - 1; ++i)
if(zeroNums[i] != zeroNums[i+1] || oneNums[i] != oneNums[i+1])
return -1;
delete[] zeroNums;
delete[] oneNums;
return elapsedMicroseconds;
}
struct TestMethod{
string name;
void (*func)(int, int&);
int* results;
TestMethod(string _name, void (*_func)(int, int&)) { name = _name; func = _func; results = new int[s_iMaxPipelineLen]; }
};
int main(){
srand( (unsigned int)time(nullptr) );
vector<TestMethod> testMethods;
testMethods.push_back(TestMethod("all-zero", [](int index, int& out) { out = 0; } ));
testMethods.push_back(TestMethod("repeat-0-1", [](int index, int& out) { out = index % 2; } ));
testMethods.push_back(TestMethod("repeat-0-0-0-1", [](int index, int& out) { out = (index % 4 == 0) ? 0 : 1; } ));
testMethods.push_back(TestMethod("rand", [](int index, int& out) { out = rand() % 2; } ));
int* vals = new int[s_iArrayLen];
for(int currentPipelineLen = 0; currentPipelineLen < s_iMaxPipelineLen; ++currentPipelineLen){
for(int currentMethod = 0; currentMethod < (int)testMethods.size(); ++currentMethod){
int resultsSum = 0;
for(int trialNum = 0; trialNum < s_iNumTrials; ++trialNum){
//Generate a new array...
for(int i = 0; i < s_iArrayLen; ++i)
testMethods[currentMethod].func(i, vals[i]);
//And record how long it takes
resultsSum += doWorkAndReturnMicrosecondsElapsed(vals, currentPipelineLen);
}
testMethods[currentMethod].results[currentPipelineLen] = (resultsSum / s_iNumTrials);
}
}
cout << "\t";
for(int i = 0; i < s_iMaxPipelineLen; ++i){
cout << i << "\t";
}
cout << "\n";
for (int i = 0; i < (int)testMethods.size(); ++i){
cout << testMethods[i].name.c_str() << "\t";
for(int j = 0; j < s_iMaxPipelineLen; ++j){
cout << testMethods[i].results[j] << "\t";
}
cout << "\n";
}
int end;
cin >> end;
delete[] vals;
}
Pastebin link: http://pastebin.com/F0JAu3uw
I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.
I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.
Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.
Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:
Is that more what you expected?
If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...
Edit:
And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.
I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.
In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:
if(vals[i] == 0)
{
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.
I'm not sure if this can explain wavy pattern of your results, but:
Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:
Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts
So your loops are not only measuring memory throughput but they are also affecting BTB.
If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).
Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.