Is there anyway to make c++ code run faster, im trying to optimize the slowest parts of my code such as this:
void removeTrail ( char floor[][SIZEX],int trail[][SIZEX])
{
for (int y=1; y < SIZEY-1; y++)
for (int x=1; x < SIZEX-1; x++)
{ if (trail [y][x] <= SLIMELIFE && trail [y][x] > 0)
{ trail [y][x] --;
if (trail [y][x] == 0)
floor [y][x] = NONE;
}
}
}
Most of the guides i have found online are for more complex C++.
It really depends on what kind of optimization you are seeking. Is seems to me that you are talking about a more "low-level" optimization, which can be achieved, in combination with compile flags, by techniques such as changing the order of the nested loops, changing where you place your if statements, deciding between recursive vs. iterative approaches, etc.
However, the most effective optimizations are those targeted at the algorithms, which means that you are changing the complexity of your routines and, thus, often diminishing execution time by orders of magnitude. This would be the case, for example, when you decide to implement a Quicksort instead of a Selection Sort. An optimization from an O(n^2) to an O(n lg n) algorithm will hardly be beaten by any micro optimization.
In this particular case, I see that you are trying to "remove" the elements from the matrix when they reach a certain value. Depending on how those values change, simply tracking when they reach that and adding them to a queue for removal right there, instead of always checking the whole matrix, might do it:
trail[y][x]--; // In some part of your code, this happens
if (trail[y][x] == 0) { //add for removal
removalQueueY[yQueueTail++] = y;
removalQueueX[xQueueTail++] = x;
}
//Then, instead of checking for removal as you currently do:
while (yQueueHead < yQueueTail) {
//Remove the current element and advance the heads
floor[removalQueueY[yQueueHead]][removalQueueX[xQueueHead]] = NONE;
yQueueHead++, xQueueHead++;
}
Depending on how those values change (if it is not a simple trail[y][x]--), another data structure might prove more useful. You could try using a heap, for example, or an std::set, std::priority_queue, among other possibilities. It all comes down to what operations your algorithm must support, and which data structures allow you to execute those operations as efficiently as possible (contemplating memory and execution time, depending on your priorities and needs).
The first thing to do is to turn on compiler optimization. The most powerful optimization I know is profile guided optimization. For gcc:
1) g++ -fprofile-generate .... -o my_program
2) run my_program (typical load)
3) g++ -fprofile-use -O3 ... -o optimized_program
With profile O3 does make sense.
The next thing is to perform algorithmic optimization, like in Renato_Ferreira answer. If it doesn't work for your situation you can improve your performance by factor of 2..8 using vectorization. Your code looks vectorizable:
#include <cassert>
#include <emmintrin.h>
#include <iostream>
#define SIZEX 100 // SIZEX % 4 == 0
#define SIZEY 100
#define SLIMELIFE 100
#define NONE 0xFF
void removeTrail(char floor[][SIZEX], int trail[][SIZEX]) {
// check if trail is 16 bytes alligned
assert((((size_t)(&trail[0][0])) & (size_t)0xF) == 0);
static const int lower_a[] = {0,0,0,0};
static const int sub_a[] = {1,1,1,1};
static const int floor_a[] = {1,1,1,1}; // will underflow after decrement
static const int upper_a[] = {SLIMELIFE, SLIMELIFE, SLIMELIFE, SLIMELIFE};
__m128i lower_v = *(__m128i*) lower_a;
__m128i upper_v = *(__m128i*) upper_a;
__m128i sub_v = *(__m128i*) sub_a;
__m128i floor_v = *(__m128i*) floor_a;
for (int i = 0; i < SIZEY; i++) {
for (int j = 0; j < SIZEX; j += 4) { // only for SIZEX % 4 == 0
__m128i x = *(__m128i*)(&trail[i][j]);
__m128i floor_mask = _mm_cmpeq_epi32(x, floor_v); // 32-bit
floor_mask = _mm_packs_epi32(floor_mask, floor_mask); // now 16-bit
floor_mask = _mm_packs_epi16(floor_mask, floor_mask); // now 8-bit
int32_t fl_mask[4];
*(__m128i*)fl_mask = floor_mask;
*(int32_t*)(&floor[i][j]) |= fl_mask[0];
__m128i less_mask = _mm_cmplt_epi32(lower_v, x);
__m128i upper_mask = _mm_cmplt_epi32(x, upper_v);
__m128i mask = less_mask & upper_mask;
*(__m128i*)(&trail[i][j]) = _mm_sub_epi32(x, mask & sub_v);
}
}
}
int main()
{
int T[SIZEY][SIZEX];
char F[SIZEY][SIZEX];
for (int i = 0; i < SIZEY; i++) {
for (int j = 0; j < SIZEX; j++) {
F[i][j] = 0x0;
T[i][j] = j-10;
}
}
removeTrail(F, T);
for (int j = 0; j < SIZEX; j++) {
std::cout << (int) F[2][j] << " " << T[2][j] << '\n';
}
return 0;
}
Looks like it does what it suppose to do. No ifs and 4 values for iteration. Works only for NONE = 0xFF. Could be done for another but it is difficult.
Related
I am learning to program with AVX. So, I wrote a simple program to multiply matrices of size 4. While with no compiler optimizations, the AVX version is slightly faster than the non-AVX version, with O3 optimization, the non-AVX version becomes almost twice as fast as the AVX version. Any tip on how can I improve the performance of the AVX version? Following is the full code.
#include <immintrin.h>
#include <stdio.h>
#include <stdlib.h>
#define MAT_SIZE 4
#define USE_AVX
double A[MAT_SIZE][MAT_SIZE];
double B[MAT_SIZE][MAT_SIZE];
double C[MAT_SIZE][MAT_SIZE];
union {
double m[4][4];
__m256d row[4];
} matB;
void init_matrices()
{
for(int i = 0; i < MAT_SIZE; i++)
for(int j = 0; j < MAT_SIZE; j++)
{
A[i][j] = (float)(i+j);
B[i][j] = (float)(i+j+1);
matB.m[i][j] = B[i][j];
}
}
void print_result()
{
for(int i = 0; i < MAT_SIZE; i++)
{
for(int j = 0; j < MAT_SIZE; j++)
{
printf("%.1f\t", C[i][j]);
}
printf("\n");
}
}
void withoutAVX()
{
for(int row = 0; row < MAT_SIZE; row++)
for(int col = 0; col < MAT_SIZE; col++)
{
float sum = 0;
for(int e = 0; e < MAT_SIZE; e++)
sum += A[row][e] * B[e][col];
C[row][col] = sum;
}
}
void withAVX()
{
for(int row = 0; row < 4; row++)
{
//calculate_resultant_row(row);
const double* rowA = (const double*)&A[row];
__m256d* pr = (__m256d*)(&C[row]);
*pr = _mm256_mul_pd(_mm256_broadcast_sd(&rowA[0]), matB.row[0]);
for(int i = 1; i < 4; i++)
*pr = _mm256_add_pd(*pr, _mm256_mul_pd(_mm256_broadcast_sd(&rowA[i]),
matB.row[i]));
}
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
int main()
{
init_matrices();
// start timer
unsigned long long cycles = rdtsc();
#ifdef USE_AVX
withAVX();
#else
withoutAVX();
#endif
// stop timer
cycles = rdtsc() - cycles;
printf("\nTotal time elapsed : %ld\n\n", cycles);
print_result();
return 0;
}
It's hard to be sure without knowing exactly what compiler and system you are using. You need to check the assembly of generated code to be sure. Below are merely some possible reasons.
The compiler probably generated extra load/store. This will cost.
The inner most loop broadcast elements from A. And thus you have extra loads. The optimal code shall require only 8 loads, 4 for A and B each, and 4 store back in C. However your code will lead to at least 16 extra loads because your use of broadcastsd. These will cost you as much as the computation itself, and probably more.
Edit (too long for comments)
There are situations where compiler won't be able to do smart optimization or sometime it is "too clever" for good. Recently I even had need to use assembly to avoid compiler optimization which actually lead to bad code! That said, if what you need is performance and you don't really care how you get there. I would suggest you first look for good libraries. For example, Eigen for linear algebra will fit your need in this example perfectly. If you do want to learn SIMD programming, I suggest you start with simpler cases, such as adding two vectors. Most likely, you will find that compiler will be able to generate better vectorized binary than your first few attempts. But they are more straightforward and thus you will see where you need improvement more easily. You will learn all kinds of things that you need to write optimal code in the process of attempting to produce code as good as or better than that can be generated by a compiler. And eventually you will be able to provide optimal implementations to code that compiler cannot optimize. One thing you need to keep in mind is that the lower level you go, the less compiler can do for you. You will have more control over what binaries are generated, but it is also your responsibility to make them optimal. These advices are pretty vague. Sorry cannot be of more help.
I am trying to solve this problem:
Given a string array words, find the maximum value of length(word[i]) * length(word[j]) where the two words do not share common letters. You may assume that each word will contain only lower case letters. If no such two words exist, return 0.
https://leetcode.com/problems/maximum-product-of-word-lengths/
You can create a bitmap of char for each word to check if they share chars in common and then calc the max product.
I have two method almost equal but the first pass checks, while the second is too slow, can you understand why?
class Solution {
public:
int maxProduct2(vector<string>& words) {
int len = words.size();
int *num = new int[len];
// compute the bit O(n)
for (int i = 0; i < len; i ++) {
int k = 0;
for (int j = 0; j < words[i].length(); j ++) {
k = k | (1 <<(char)(words[i].at(j)));
}
num[i] = k;
}
int c = 0;
// O(n^2)
for (int i = 0; i < len - 1; i ++) {
for (int j = i + 1; j < len; j ++) {
if ((num[i] & num[j]) == 0) { // if no common letters
int x = words[i].length() * words[j].length();
if (x > c) {
c = x;
}
}
}
}
delete []num;
return c;
}
int maxProduct(vector<string>& words) {
vector<int> bitmap(words.size());
for(int i=0;i<words.size();++i) {
int k = 0;
for(int j=0;j<words[i].length();++j) {
k |= 1 << (char)(words[i][j]);
}
bitmap[i] = k;
}
int maxProd = 0;
for(int i=0;i<words.size()-1;++i) {
for(int j=i+1;j<words.size();++j) {
if ( !(bitmap[i] & bitmap[j])) {
int x = words[i].length() * words[j].length();
if ( x > maxProd )
maxProd = x;
}
}
}
return maxProd;
}
};
Why the second function (maxProduct) is too slow for leetcode?
Solution
The second method does repetitive call to words.size(). If you save that in a var than it working fine
Since my comment turned out to be correct I'll turn my comment into an answer and try to explain what I think is happening.
I wrote some simple code to benchmark on my own machine with two solutions of two loops each. The only difference is the call to words.size() is inside the loop versus outside the loop. The first solution is approximately 13.87 seconds versus 16.65 seconds for the second solution. This isn't huge, but it's about 20% slower.
Even though vector.size() is a constant time operation that doesn't mean it's as fast as just checking against a variable that's already in a register. Constant time can still have large variances. When inside nested loops that adds up.
The other thing that could be happening (someone much smarter than me will probably chime in and let us know) is that you're hurting your CPU optimizations like branching and pipelining. Every time it gets to the end of the the loop it has to stop, wait for the call to size() to return, and then check the loop variable against that return value. If the cpu can look ahead and guess that j is still going to be less than len because it hasn't seen len change (len isn't even inside the loop!) it can make a good branch prediction each time and not have to wait.
According to Improved Bulk-Loading Algorithms for Quadtrees, sorting the coordinates in z-order before insert will result in speedup for QuadTree batch insertion.
I need z-order implementation in C++.
I have x,y coordinates both double. The solution here in Wikipedia for Z-Order-Curves
is kind of unclear to me.
EDIT - assumptions
The coordinates I have are in google coordinates and are floating point numbers.
In the system we currently develop, we assume that any bulk (batch) to be inserted fits in RAM. We don't anticipate the need of external sort operations with swapping between disk and memory.
EDIT 2
with regards to the fact that Z-order works for integers only, I think the trick is to multiple by factors of 10 until all data is integers. Once I have that what is the way to perform z-order on the points?
This is untested code, you should double check if it works.
Also, this code is almost certainly not portable, and might even be undefined behavior. It's certainly implementation defined behavior at the very least, but it's probably unspecified... I'd have to more carefully read the rules regarding reinterpret_cast to and from char* to know for sure.
#include <cstdint>
#include <vector>
uint64_t reinterpretDoubleAsUInt(double d) {
int const doubleSize = sizeof(double);
char* array = reinterpret_cast<char*>(&d);
uint64_t result = 0;
for (auto i = 0; i < doubleSize; ++i) {
result += (uint64_t)array[i] << (8*i);
}
return result;
}
bool lessThanZOrderDouble(std::vector<double> const& a, std::vector<double> const& b) {
uint64_t j = 0;
uint64_t x = 0;
if (a.size() != b.size() || a.size() == 0) {
throw std::exception();
}
int dimensions = a.size();
for (auto i = 0; i < dimensions; ++i) {
auto y = reinterpretDoubleAsUInt(a[i]) ^ reinterpretDoubleAsUInt(b[i]);
if (x < y && x < (x ^ y)) {
j = i;
x = y;
}
}
return (a[j] - b[j]) > 0;
}
int main() {
// blank for compilation's sake
}
I would like to optimize this simple loop:
unsigned int i;
while(j-- != 0){ //j is an unsigned int with a start value of about N = 36.000.000
float sub = 0;
i=1;
unsigned int c = j+s[1];
while(c < N) {
sub += d[i][j]*x[c];//d[][] and x[] are arrays of float
i++;
c = j+s[i];// s[] is an array of unsigned int with 6 entries.
}
x[j] -= sub; // only one memory-write per j
}
The loop has an execution time of about one second with a 4000 MHz AMD Bulldozer. I thought about SIMD and OpenMP (which I normally use to get more speed), but this loop is recursive.
Any suggestions?
think you may want to transpose the matrix d -- means store it in such a way that you can exchange the indices -- make i the outer index:
sub += d[j][i]*x[c];
instead of
sub += d[i][j]*x[c];
This should result in better cache performance.
I agree with transposing for better caching (but see my comments on that at the end), and there's more to do, so let's see what we can do with the full function...
Original function, for reference (with some tidying for my sanity):
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *x, float *b){
//We want to solve L D Lt x = b where D is a diagonal matrix described by Diagonals[0] and L is a unit lower triagular matrix described by the rest of the diagonals.
//Let D Lt x = y. Then, first solve L y = b.
float *y = new float[n];
float **d = IncompleteCholeskyFactorization->Diagonals;
unsigned int *s = IncompleteCholeskyFactorization->StartRows;
unsigned int M = IncompleteCholeskyFactorization->m;
unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i, j;
for(j = 0; j != N; j++){
float sub = 0;
for(i = 1; i != M; i++){
int c = (int)j - (int)s[i];
if(c < 0) break;
if(c==j) {
sub += d[i][c]*b[c];
} else {
sub += d[i][c]*y[c];
}
}
y[j] = b[j] - sub;
}
//Now, solve x from D Lt x = y -> Lt x = D^-1 y
// Took this one out of the while, so it can be parallelized now, which speeds up, because division is expensive
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] = y[j]/d[0][j];
}
while(j-- != 0){
float sub = 0;
for(i = 1; i != M; i++){
if(j + s[i] >= N) break;
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
delete[] y;
}
Because of the comment about parallel divide giving a speed boost (despite being only O(N)), I'm assuming the function itself gets called a lot. So why allocate memory? Just mark x as __restrict__ and change y to x everywhere (__restrict__ is a GCC extension, taken from C99. You might want to use a define for it. Maybe the library already has one).
Similarly, though I guess you can't change the signature, you can make the function take only a single parameter and modify it. b is never used when x or y have been set. That would also mean you can get rid of the branch in the first loop which runs ~N*M times. Use memcpy at the start if you must have 2 parameters.
And why is d an array of pointers? Must it be? This seems too deep in the original code, so I won't touch it, but if there's any possibility of flattening the stored array, it will be a speed boost even if you can't transpose it (multiply, add, dereference is faster than dereference, add, dereference).
So, new code:
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *__restrict__ x){
// comments removed so that suggestions are more visible. Don't remove them in the real code!
// these definitions got long. Feel free to remove const; it does nothing for the optimiser
const float *const __restrict__ *const __restrict__ d = IncompleteCholeskyFactorization->Diagonals;
const unsigned int *const __restrict__ s = IncompleteCholeskyFactorization->StartRows;
const unsigned int M = IncompleteCholeskyFactorization->m;
const unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i;
unsigned int j;
for(j = 0; j < N; j++){ // don't use != as an optimisation; compilers can do more with <
float sub = 0;
for(i = 1; i < M && j >= s[i]; i++){
const unsigned int c = j - s[i];
sub += d[i][c]*x[c];
}
x[j] -= sub;
}
// Consider using processor-specific optimisations for this
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] /= d[0][j];
}
for( j = N; (j --) > 0; ){ // changed for clarity
float sub = 0;
for(i = 1; i < M && j + s[i] < N; i++){
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
}
Well it's looking tidier, and the lack of memory allocation and reduced branching, if nothing else, is a boost. If you can change s to include an extra UINT_MAX value at the end, you can remove more branches (both the i<M checks, which again run ~N*M times).
Now we can't make any more loops parallel, and we can't combine loops. The boost now will be, as suggested in the other answer, to rearrange d. Except… the work required to rearrange d has exactly the same cache issues as the work to do the loop. And it would need memory allocated. Not good. The only options to optimise further are: change the structure of IncompleteCholeskyFactorization->Diagonals itself, which will probably mean a lot of changes, or find a different algorithm which works better with data in this order.
If you want to go further, your optimisations will need to impact quite a lot of the code (not a bad thing; unless there's a good reason for Diagonals being an array of pointers, it seems like it could do with a refactor).
I want to give an answer to my own question: The bad performance was caused by cache conflict misses due to the fact that (at least) Win7 aligns big memory blocks to the same boundary. In my case, for all buffers, the adresses had the same alignment (bufferadress % 4096 was same for all buffers), so they fall into the same cacheset of L1 cache. I changed memory allocation to align the buffers to different boundaries to avoid cache conflict misses and got a speedup of factor 2. Thanks for all the answers, especially the answers from Dave!
I'm trying to get a good understanding of branch prediction by measuring the time to run loops with predictable branches vs. loops with random branches.
So I wrote a program that takes large arrays of 0's and 1's arranged in different orders (i.e. all 0's, repeating 0-1, all rand), and iterates through the array branching based on if the current index is 0 or 1, doing time-wasting work.
I expected that harder-to-guess arrays would take longer to run on, since the branch predictor would guess wrong more often, and that the time-delta between runs on two sets of arrays would remain the same regardless of the amount of time-wasting work.
However, as amount of time-wasting work increased, the difference in time-to-run between arrays increased, A LOT.
(X-axis is amount of time-wasting work, Y-axis is time-to-run)
Does anyone understand this behavior? You can see the code I'm running at the following code:
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
static const int s_iArrayLen = 999999;
static const int s_iMaxPipelineLen = 60;
static const int s_iNumTrials = 10;
int doWorkAndReturnMicrosecondsElapsed(int* vals, int pipelineLen){
int* zeroNums = new int[pipelineLen];
int* oneNums = new int[pipelineLen];
for(int i = 0; i < pipelineLen; ++i)
zeroNums[i] = oneNums[i] = 0;
chrono::time_point<chrono::system_clock> start, end;
start = chrono::system_clock::now();
for(int i = 0; i < s_iArrayLen; ++i){
if(vals[i] == 0){
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
else{
for(int i = 0; i < pipelineLen; ++i)
++oneNums[i];
}
}
end = chrono::system_clock::now();
int elapsedMicroseconds = (int)chrono::duration_cast<chrono::microseconds>(end-start).count();
//This should never fire, it just exists to guarantee the compiler doesn't compile out our zeroNums/oneNums
for(int i = 0; i < pipelineLen - 1; ++i)
if(zeroNums[i] != zeroNums[i+1] || oneNums[i] != oneNums[i+1])
return -1;
delete[] zeroNums;
delete[] oneNums;
return elapsedMicroseconds;
}
struct TestMethod{
string name;
void (*func)(int, int&);
int* results;
TestMethod(string _name, void (*_func)(int, int&)) { name = _name; func = _func; results = new int[s_iMaxPipelineLen]; }
};
int main(){
srand( (unsigned int)time(nullptr) );
vector<TestMethod> testMethods;
testMethods.push_back(TestMethod("all-zero", [](int index, int& out) { out = 0; } ));
testMethods.push_back(TestMethod("repeat-0-1", [](int index, int& out) { out = index % 2; } ));
testMethods.push_back(TestMethod("repeat-0-0-0-1", [](int index, int& out) { out = (index % 4 == 0) ? 0 : 1; } ));
testMethods.push_back(TestMethod("rand", [](int index, int& out) { out = rand() % 2; } ));
int* vals = new int[s_iArrayLen];
for(int currentPipelineLen = 0; currentPipelineLen < s_iMaxPipelineLen; ++currentPipelineLen){
for(int currentMethod = 0; currentMethod < (int)testMethods.size(); ++currentMethod){
int resultsSum = 0;
for(int trialNum = 0; trialNum < s_iNumTrials; ++trialNum){
//Generate a new array...
for(int i = 0; i < s_iArrayLen; ++i)
testMethods[currentMethod].func(i, vals[i]);
//And record how long it takes
resultsSum += doWorkAndReturnMicrosecondsElapsed(vals, currentPipelineLen);
}
testMethods[currentMethod].results[currentPipelineLen] = (resultsSum / s_iNumTrials);
}
}
cout << "\t";
for(int i = 0; i < s_iMaxPipelineLen; ++i){
cout << i << "\t";
}
cout << "\n";
for (int i = 0; i < (int)testMethods.size(); ++i){
cout << testMethods[i].name.c_str() << "\t";
for(int j = 0; j < s_iMaxPipelineLen; ++j){
cout << testMethods[i].results[j] << "\t";
}
cout << "\n";
}
int end;
cin >> end;
delete[] vals;
}
Pastebin link: http://pastebin.com/F0JAu3uw
I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.
I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.
Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.
Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:
Is that more what you expected?
If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...
Edit:
And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.
I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.
In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:
if(vals[i] == 0)
{
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.
I'm not sure if this can explain wavy pattern of your results, but:
Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:
Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts
So your loops are not only measuring memory throughput but they are also affecting BTB.
If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).
Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.