I have this funcition (RotateSlownessTop) and it's called about 800 times computing the corresponding values. But the calculation is slow and is there a way I can make the computations faster.
The number of element in X/Y is 7202. (Fairly large set)
I did the performance analysis and the screenshot has been attached.
void RotateSlownessTop(vector <double> &XR1, vector <double> &YR1, float theta = 0.0)
Matrix2d a;
a(0,0) = cos(theta);
a(0,1) = -sin(theta);
a(1, 0) = sin(theta);
a(1, 1) = cos(theta);
vector <double> XR2(7202), YR2(7202);
for (size_t i = 0; i < X.size(); ++i)
XR2[i] = (a(0, 0)*X[i] + a(0, 1)*Y[i]);
YR2[i] = (a(1, 0)*X[i] + a(1, 1)*Y[i]);
size_t i = 0;
size_t j = 0;
while (i < YR2.size())
if (i > 0)
if ((XR2[i]>0) && (XR2[i-1]<0))
j = i;
if (YR2[i] > (-1e-10) && YR2[i]<0.0)
YR2[i] = 0.0;
if (YR2[i] < (1e-10) && YR2[i]>0.0)
YR2[i] = -YR2[i];
if ( YR2[i]<0.0)
YR2.erase(YR2.begin() + i);
XR2.erase(XR2.begin() + i);
size_t k = 0;
while (j < YR2.size())
YR1[k] = (YR2[j]);
XR1[k] = (XR2[j]);
YR2.erase(YR2.begin() + j);
XR2.erase(XR2.begin() + j);
size_t l = 0;
for (; k < XR1.size(); ++k)
XR1[k] = XR2[l];
YR1[k] = YR2[l];
Edit1: I have updated the code by replacing all push_back() with operator[], since I read somewhere that this is much faster.
However the whole program is still slow. Any suggestions are appreciated.
If the size is large, you can improve the push_back by pre-allocating the space needed. Add this before the loop:
I've been implementing NN recently based on http://neuralnetworksanddeeplearning.com/. I've made whole algorithm for backprop and SGD almost the same way as author of this book. The problem is that while he gets accuracy around 90 % after one epoch i get 30% after 5 epochs even though i have the same hiperparameters. Do you have any idea what might be the cause ?
Here s my respository.
Here is part with algorithm for backprop and SGD implemented in Network.cpp:
void Network::Train(MatrixD_Array& TrainingData, MatrixD_Array& TrainingLabels, int BatchSize,int epochs, double LearningRate)
assert(TrainingData.size() == TrainingLabels.size() && CostFunc != nullptr && CostFuncDer != nullptr && LearningRate > 0);
std::vector<long unsigned int > indexes;
for (int i = 0; i < TrainingData.size(); i++) indexes.push_back(i);
std::random_device rd;
std::mt19937 g(rd());
std::vector<Matrix<double>> NablaWeights;
std::vector<Matrix<double>> NablaBiases;
for (int i = 0; i < Layers.size(); i++)
NablaWeights[i] = Matrix<double>(Layers[i].GetInDim(), Layers[i].GetOutDim());
NablaBiases[i] = Matrix<double>(1, Layers[i].GetOutDim());
//---- Epoch iterating
for (int i = 0; i < epochs; i++)
cout << "Epoch number: " << i << endl;
shuffle(indexes.begin(), indexes.end(), g);
// Batch iterating
for (int batch = 0; batch < TrainingData.size(); batch = batch + BatchSize)
for (int i = 0; i < Layers.size(); i++)
int i = 0;
while( i < BatchSize && (i+batch)< TrainingData.size())
std::vector<Matrix<double>> ActivationOutput;
std::vector<Matrix<double>> Z_Output;
ActivationOutput.resize(Layers.size() + 1);
ActivationOutput[0] = TrainingData[indexes[i + batch]];
int index = 0;
// Pushing values through
for (auto layer : Layers)
Z_Output[index] = layer.Mul(ActivationOutput[index]);
ActivationOutput[index + 1] = layer.ApplyActivation(Z_Output[index]);
// ---- Calculating Nabla that will be later devided by batch size element wise
auto DeltaNabla = BackPropagation(ActivationOutput, Z_Output, TrainingLabels[indexes[i + batch]]);
for (int i = 0; i < Layers.size(); i++)
NablaWeights[i] = NablaWeights[i] + DeltaNabla.first[i];
NablaBiases[i] = NablaBiases[i] + DeltaNabla.second[i];
for (int g = 0; g < Layers.size(); g++)
Layers[g].Weights = Layers[g].Weights - NablaWeights[g] * LearningRate;
Layers[g].Biases = Layers[g].Biases - NablaBiases[g] * LearningRate;
// std::transform(NablaWeights.begin(), NablaWeights.end(), NablaWeights.begin(),[BatchSize, LearningRate](Matrix<double>& Weight) {return Weight * (LearningRate / BatchSize);});
//std::transform(NablaBiases.begin(), NablaBiases.end(), NablaBiases.begin(), [BatchSize, LearningRate](Matrix<double>& Bias) {return Bias * (LearningRate / BatchSize); });
std::pair<MatrixD_Array, MatrixD_Array> Network::BackPropagation( MatrixD_Array& ActivationOutput, MatrixD_Array& Z_Output,Matrix<double>& label)
MatrixD_Array NablaWeight;
MatrixD_Array NablaBias;
auto zs = Layers[Layers.size() - 1].ActivationPrime(Z_Output[Z_Output.size() - 1]);
Matrix<double> Delta_L = Hadamard(CostFuncDer(ActivationOutput[ActivationOutput.size() - 1],label), zs);
NablaWeight[Layers.size() - 1] = Delta_L * ActivationOutput[ActivationOutput.size() - 2].Transpose();
NablaBias[Layers.size() - 1] = Delta_L;
for (int j = 2; j <= Layers.size() ; j++)
auto sp = Layers[Layers.size() - j].ActivationPrime(Z_Output[Layers.size() -j]);
Delta_L = Hadamard(Layers[Layers.size() - j+1 ].Weights.Transpose() * Delta_L, sp);
NablaWeight[Layers.size() - j] = Delta_L * ActivationOutput[ActivationOutput.size() -j-1].Transpose();
NablaBias[Layers.size() - j] = Delta_L;
return make_pair(NablaWeight, NablaBias);
It turned out that mnist loader didnt work correctly.
I struggle a bit with a function. The calculation is wrong if I try to parallelize the outer loop with a
#pragma omp parallel reduction(+:det).
Can someone show me how to solve it and why it is failing?
// template<class T> using vector2D = std::vector<std::vector<T>>;
float Det(vector2DF &a, int n)
vector2DF m(n - 1, vector1DF(n - 1, 0));
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
for (int i = 0; i < n; i++)
int l = 0;
#pragma omp parallel for private(l)
for (int j = 1; j < n; j++)
l = 0;
for (int k = 0; k < n; k++)
if (k == i) continue;
m[j - 1][l] = a[j][k];
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
return det;
If you parallelize the outer loop, there is a race condition on this line:
m[j - 1][l] = a[j][k];
Also you likely want a parallel for reduction instead of just a parallel reduction.
The issue is, that m is shared, even though that wouldn't be necessary given that it is completely overwritten in the inner loop. Always declare variables as locally as possible, this avoids issues with wrongly shared variables, e.g.:
float Det(vector2DF &a, int n)
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
#pragma omp parallel reduction(+:det)
for (int i = 0; i < n; i++)
vector2DF m(n - 1, vector1DF(n - 1, 0));
for (int j = 1; j < n; j++)
int l = 0;
for (int k = 0; k < n; k++)
if (k == i) continue;
m[j - 1][l] = a[j][k];
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
return det;
Now that is correct, but since m can be expensive to allocate, performance could benefit from not doing it in each and every iteration. This can be done by splitting parallel and for directives as such:
float Det(vector2DF &a, int n)
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
#pragma omp parallel reduction(+:det)
vector2DF m(n - 1, vector1DF(n - 1, 0));
#pragma omp parallel for
for (int i = 0; i < n; i++)
for (int j = 1; j < n; j++)
int l = 0;
for (int k = 0; k < n; k++)
if (k == i) continue;
m[j - 1][l] = a[j][k];
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
return det;
Now you could also just declare m as firstprivate, but that would assume that the copy constructor makes a completely independent deep-copy and thus make the code more difficult to reason about.
Please be aware that you should always include expected output, actual output and a minimal complete and verifiable example.
I have a problem with my knapsack algorithm. To be honest I dont have idea what is wrong. When I use program once, all works wrong, but when I gonna use my program in loop (for test) I have a lot problem.
For example:
Weight/Val in file : 100
max knapsack capacity: 1000
First iteration:
Max profit: 2597
The resulting weight: 994/1000
And its fine, but now another iteration.
Second iteration:
Max profit: 2538
The resulting weight: 1004/1000 <- and there is my problem, its over my max cap.
3rd,4th were okey, then 5th was wrong (1355/1000), and so on.
My function where is possible problem:
void intoKnapsack(int k, float actual_profit, float actual_weight)
if (actual_weight + weight[k] <= cap)
tmp[k] = 1;
if (k <= number_items)
intoKnapsack(k + 1, actual_profit + value[k], actual_weight + weight[k]);
if (((actual_profit + value[k]) > final_profit) && (k == number_items))
final_profit = actual_profit + value[k];
final_weight = actual_weight + weight[k];
for (j = 0; j <= k; j++)
knap[j] = tmp[j];
else if ((bound(actual_profit, actual_weight, k) >= final_profit))
tmp[k] = 0;
if (k <= number_items)
intoKnapsack(k + 1, actual_profit, actual_weight);
if ((actual_profit > final_profit) && (k == number_items))
final_profit = actual_profit;
final_weight = actual_weight;
for (j = 0; j <= k; j++)
knap[j] = tmp[j];
Can someone help with my problem?
Ok, so when I ready only once the same N (like 100 in example above) then it works fine, but when I do it in loop:
srand((unsigned int) time(NULL));
algorytm a;
fstream wynik;
wynik.open("result.txt",ios::out | ios::app);
for(int i=0; i<how_test; i++){ //how many tests
write(how_n); //how many n in my file, and create file
a.read() //read from file (n, and weight / val)
a.sort(); //I sort it
a.intoKnapsack(0, 0.0, 0.0); //my function above, so I give here a 3x to do it properly over and over in loop
get_time(); //stop time
result<<get_time()<<" s."<<endl; //just for
so when I do by myself for example write(50), then in same program write(51) and so on it works good, but when I do write(50), then another write(50), then I have wrong algorithm.
Maybe when I do sort, before clear Knapsack it in another loop doesnt work, but in other hand I first need to do sort.
There is my sort function
void algorytm::sort() {
int a;
int b;
float c;
for (i = 0; i < number_items; i++)
factor[i] = (float) val[i] / (float) weight[i]; //to sort from best to worst
for (i = 0; i < number_items - 1; i++) {
for (j = i + 1; j < number_items; j++) {
if (factor[i] < factor[j]) {
c = factor[i]; //
factor[i] = factor[j];
factor[j] = c;
a = val[i]; //
val[i] = val[j];
val[j] = a;
b = weight[i]; //
weight[i] = weight[j];
weight[j] = b;
Context: Multichannel real time digital audio processing.
Access pattern: "Column-major", like so:
for (int sample = 0; sample < size; ++sample)
for (int channel = 0; channel < size; ++channel)
auto data = arr[channel][sample];
// do some computations
I'm seeking advice on how to make the life easier for the CPU and memory, in general. I realize interleaving the data would be better, but it's not possible.
My theory is, that as long as you sequentially access memory for a while, the CPU will prefetch it - will this hold for N (channel) buffers? What about size of the buffers, any "breaking points"?
Will it be very beneficial to have the channels in contiguous memory (increasing locality), or does that only hold for very small buffers (like, size of cache lines)? We could be talking buffersizes > 100 kb apart.
I guess there would also be a point where the time of the computational part makes memory optimizations negligible - ?
Is this a case, where manual prefetching makes sense?
I could test/profile my own system, but I only have that - 1 system. So any design choices I make may only positively affect that particular system. Any knowledge on these matters are appreciated, links, literature etc., platform specific knowledge.
Let me know if the question is too vague, I primarily thought it would be nice to have some wiki-ish experience / info on this area.
I created a program, that tests the three cases I mentioned (distant, adjecant and contiguous mentioned in supposedly increasing performance order), which tests these patterns on small and big data sets. Maybe people will run it and report anomalies.
#include <iostream>
#include <chrono>
#include <algorithm>
const int b = 196000;
const int s = 64 / sizeof(float);
const int extra_it = 16;
float sbuf1[s];
float bbuf1[b];
int main()
float sbuf2[s];
float bbuf2[b];
float * sbuf3 = new float[s];
float * bbuf3 = new float[b];
float * sbuf4 = new float[s * 3];
float * bbuf4 = new float[b * 3];
float use = 0;
while (1)
using namespace std;
int c;
bool sorb;
cout << "small or big test (0/1)? ";
if (!(cin >> sorb))
return -1;
cout << endl << "test distant buffers (0), contiguous access (1) or adjecant access (2)? ";
if (!(cin >> c))
return -1;
auto t = std::chrono::high_resolution_clock::now();
if (c == 0)
// "worst case scenario", 3 distant buffers constantly touched
if (sorb)
for (int k = 0; k < b * extra_it; ++k)
for (int i = 0; i < s; ++i)
sbuf1[i] = k; // static memory
sbuf2[i] = k; // stack memory
sbuf3[i] = k; // heap memory
for (int k = 0; k < s * extra_it; ++k)
for (int i = 0; i < b; ++i)
bbuf1[i] = k; // static memory
bbuf2[i] = k; // stack memory
bbuf3[i] = k; // heap memory
else if (c == 1)
// "best case scenario", only contiguous memory touched, interleaved
if (sorb)
for (int k = 0; k < b * extra_it; ++k)
for (int i = 0; i < s * 3; i += 3)
sbuf4[i] = k;
sbuf4[i + 1] = k;
sbuf4[i + 2] = k;
for (int k = 0; k < s * extra_it; ++k)
for (int i = 0; i < b * 3; i += 3)
bbuf4[i] = k;
bbuf4[i + 1] = k;
bbuf4[i + 2] = k;
else if (c == 2)
// "compromise", adjecant memory buffers touched
if (sorb)
auto b1 = sbuf4;
auto b2 = sbuf4 + s;
auto b3 = sbuf4 + s * 2;
for (int k = 0; k < b * extra_it; ++k)
for (int i = 0; i < s; ++i)
b1[i] = k;
b2[i] = k;
b3[i] = k;
auto b1 = bbuf4;
auto b2 = bbuf4 + b;
auto b3 = bbuf4 + b * 2;
for (int k = 0; k < s * extra_it; ++k)
for (int i = 0; i < b; ++i)
b1[i] = k;
b2[i] = k;
b3[i] = k;
cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - t).count() << " ms" << endl;
// basically just touching the buffers, avoiding clever optimizations
use += std::accumulate(sbuf1, sbuf1 + s, 0);
use += std::accumulate(sbuf2, sbuf2 + s, 0);
use += std::accumulate(sbuf3, sbuf3 + s, 0);
use += std::accumulate(sbuf4, sbuf4 + s * 3, 0);
use -= std::accumulate(bbuf1, bbuf1 + b, 0);
use -= std::accumulate(bbuf2, bbuf2 + b, 0);
use -= std::accumulate(bbuf3, bbuf3 + b, 0);
use -= std::accumulate(bbuf4, bbuf4 + b * 3, 0);
std::cout << use;
On my Intel i7-3740qm surprisingly, distant buffers consistently outperforms the more locality-friendly tests. It is close, however.
Integer Range = 1;
for(Integer k = -Range; k <= Range; ++k)
for(Integer j = -Range; j <= Range; ++j)
for(Integer i = -Range; i <= Range; ++i)
+ k);
if(MCID < 0 || MCID >= c_CellNum)
unsigned int TriangleNum = c_daCell[MCID].m_TriangleNum;
for(unsigned int l = 0; l < TriangleNum; ++l)
if( TriangleID >= 0 && TriangleID < c_TriangleNum && TriangleID
!= NearestID)// No need to calculate again for the same triangle
CDistance Distance ;
Distance.Magnitude = CalcDistance(&c_daTriangles[TriangleID], &TargetPosition,
if(Distance.Magnitude < NearestDistance.Magnitude)
NearestDistance = Distance;
NearestID = TriangleID;
c_daSTLDistance[ID] = NearestDistance;
c_daSTLID[ID] = NearestID;
GetCellID is the function to return the cellid in the variable CID with CIDX,CIDY,CIDZ with its position in the 3 axes
here the above code is a function to calculate the distance ,actually STL distance between a point and the triangles of the stl. This code runs fine however the problem is it is too slow as it has large number of loops within the code. Now my concern is to optimize the loop. Is there any technique of optimizing the loops within the code?