Program compiled with vs2012 runs slower than on vs2010 - c++

I have a simple piece of code that generates random graphs by considering all possible edges and randomly selecting which ones to add to a file (based on a parameter p - the chance for each edge to be added). The algorithm looks like this:
for (unsigned int i = 0; i < nodeCount; ++i)
{
for (unsigned int j = i+1; j < nodeCount; ++j)
{
uniform_int_distribution<> d(0, 999999);
int randNum = d(randEngine);
if (randNum < p * 1000000)
{
//code that adds the edge to the file
}
}
}
I used to compile this code on vs2010, and when creating graphs with 50000 nodes and p=0.00002 it took ~26 seconds. I now upgraded to vs2012, compiled, and I get ~74 seconds! Almost triple the time!
I isolated the part that runs slower - it seems to be the random number generation (I commented out the code that is not included above, which writes to the file, and the same times were recorded).
The project definitions are exactly the same (except for the platform - vs100 vs. vs110). Any ideas why the random number generator has become so much worse in VS2012?? Or is it me that's doing something wrong?
Either way - what can I do except for going back to vs2010...?
Thanks.

Related

Why does position of code affect performance in C++?

I am running a test performance, and found out that changing the order of the code makes it faster without compromising the result.
Performance is measured by time execution using chrono library.
vector< vector<float> > U(matrix_size, vector<float>(matrix_size,14));
vector< vector<float> > L(matrix_size, vector<float>(matrix_size,12));
vector< vector<float> > matrix_positive_definite(matrix_size, vector<float>(matrix_size,23));
for (i = 0; i < matrix_size; ++i) {
for(j= 0; j < matrix_size; ++j){
//Part II : ________________________________________
float sum2=0;
for(k= 0; k <= (i-1); ++k){
float sum2_temp=L[i][k]*U[k][j];
sum2+=sum2_temp;
}
//Part I : _____________________________________________
float sum1=0;
for(k= 0; k <= (j-1); ++k){
float sum1_temp=L[i][k]*U[k][j];
sum1+=sum1_temp;
}
//__________________________________________
if(i>j){
L[i][j]=(matrix_positive_definite[i][j]-sum1)/U[j][j];
}
else{
U[i][j]=matrix_positive_definite[i][j]-sum2;
}
}
}
I compile with g++ -O3 (GCC 7.4.0 in Intel i5/Win10).
I changed the order of Part I & Part II and got faster result if Part II executed before Part I. What's going on?
This is the link to the whole program.
I would try running both versions with perf stat -d <app> and see where the difference of performance counters is.
When benchmarking you may like to fix the CPU frequency, so it doesn't affect your scores.
Aligning loops on a 32-byte boundary often increases performance by 8-30%. See Causes of Performance Instability due to Code Placement in X86 - Zia Ansari, Intel for more details.
Try compiling your code with -O3 -falign-loops=32 -falign-functions=32 -march=native -mtune=native.
Running perf stat -ddd while playing around with the provided program shows that the major difference between the two versions stands mainly in the prefetch.
part II -> part I and part I -> part II (original program)
73,069,502 L1-dcache-prefetch-misses
part II -> part I and part II -> part I (only the efficient version)
31,719,117 L1-dcache-prefetch-misses
part I -> part II and part I -> part II (only the less efficient version)
114,520,949 L1-dcache-prefetch-misses
nb: according to the compiler explorer, part II -> part I is very similar to part I -> part II.
I guess that, on the first iterations over i, part II does almost nothing, but iterations over j make part I access U[k][j] according to a pattern that will ease prefetch for the next iterations over i.
The faster version is similar to the performance you get when you move the loops inside the if (i > j).
if (i > j) {
float sum1 = 0;
for (std::size_t k = 0; k < j; ++k){
sum1 += L_series[i][k] * U_series[k][j];
}
L_parallel[i][j] = matrix_positive_definite[i][j] - sum1;
L[i][j] /= U[j][j];
}
if (i <= j) {
float sum2 = 0;
for (std::size_t k = 0; k < i; ++k){
sum2 += L_series[i][k] * U_series[k][j];
}
U_parallel[i][j] = matrix_positive_definite[i][j] - sum2;
}
So i would assume in one case the compiler is able to do that transformation itself. It only happens at -O3 for me. (1950X, msys2/GCC 8.3.0, Win10)
I don’t know which optimization this is exactly and what conditions have to be met for it to apply. It’s none of the options explicitly listed for -O3 (-O2 + all of them is not enough). Apparently it already doesn’t do it when std::size_t instead of int is used for the loop counters.

My C++ program gets slower as computation proceeds

I wrote a neural network program in C++ to test something, and I found that my program gets slower as computation proceeds. Since this kind of phenomenon is somewhat I've never seen before, I checked possible causes. Memory used by program did not change when it got slower. RAM and CPU status were fine when I ran the program.
Fortunately, the previous version of the program did not have such problem. So I finally found that a single statement that makes the program slow. The program does not get slower when I use this statement:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
However, the program gets slower and slower as soon as I replace above statement with:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
To solve this problem, I did my best to find and remove the cause but I failed... The below is the simple code structure. For the case that this is not the problem that is related to local statement, I uploaded my code to google drive. The URL is located at the end of this question.
MLP.h
class MLP
{
private:
...
double lambda;
double ***w;
double ***dw;
neuron **hidden;
...
MLP.cpp
...
for(k = n_depth - 1; k > 0; k--)
{
if(k == n_depth - 1)
...
else
{
...
for(j = 1; n_neuron > j; j++)
{
for(i = 0; n_neuron > i; i++)
{
//dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
}
}
}
}
...
Full source code: https://drive.google.com/open?id=1A8Uw0hNDADp3-3VWAgO4eTtj4sVk_LZh
I'm not sure exactly why it gets slower and slower, but I do see where you can gain some performance.
Two and higher dimensional arrays are still stored in one dimensional
memory. This means (for C/C++ arrays) array[i][j] and array[i][j+1]
are adjacent to each other, whereas array[i][j] and array[i+1][j] may
be arbitrarily far apart.
Accessing data in a more-or-less sequential fashion, as stored in
physical memory, can dramatically speed up your code (sometimes by an
order of magnitude, or more)!
When modern CPUs load data from main memory into processor cache,
they fetch more than a single value. Instead they fetch a block of
memory containing the requested data and adjacent data (a cache line
). This means after array[i][j] is in the CPU cache, array[i][j+1] has
a good chance of already being in cache, whereas array[i+1][j] is
likely to still be in main memory.
Source: https://people.cs.clemson.edu/~dhouse/courses/405/papers/optimize.pdf
With your current code, w[k][i][j] will be read, and on the next iteration, w[k][i+1][j] will be read. You should invert i and j so that w is read in sequential order:
for(j = 1; n_neuron > j; ++j)
{
for(i = 0; n_neuron > i; ++i)
{
dw[k][j][i] = hidden[k-1][j].y * hidden[k][i].phi - lambda*w[k][j][i];
}
}
Also note that ++x should be slightly faster than x++, since x++ has to create a temporary containing the old value of x as the expression result. The compiler might optimize it when the value is unused though, but do not count on it.

How to parallelise a 2D time evolution

I have a C++ code that performs a time evolution of four variables that live on a 2D spatial grid. To save some time, I tried to parallelise my code with OpenMP but I just cannot get it to work: No matter how many cores I use, the runtime stays basically the same or increases. (My code does use 24 cores or however many I specify, so the compilation is not a problem.)
I have the feeling that the runtime for one individual time-step is too short and the overhead of producing threads kills the potential speed-up.
The layout of my code is:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
rhs[0][i][j] = A0[i][j] + B0[i][j] + ...;
rhs[1][i][j] = A1[i][j] + B1[i][j] + ...;
rhs[2][i][j] = A2[i][j] + B2[i][j] + ...;
rhs[3][i][j] = A3[i][j] + B3[i][j] + ...;
}
}
// (2) perform Euler step (or Runge-Kutta, ...)
for (int d = 0; d < 4; d++) {
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
next[d][i][j] = current[d][i][j] + time_step * rhs[d][i][j];
}
}
}
}
I thought this code should be fairly easy to parallelise... I put "#pragma omp parellel for" in front of the (1) and (2) loops, and I also specified the number of cores (e.g. 4 cores for loop (2) since there are four variables) but there is simply no speed-up whatsoever.
I have found that OpenMP is fairly smart about when to create/destroy the threads. I.e. it realises that threads are required soon again and then they're only put asleep to save overhead time.
I think one "problem" is that my time step is coded in a subroutine (I'm using RK4 instead of Euler) and the computation of the righthand-side is again in another subroutine that is called by the time_step() function. So, I believe that due to this, OpenMP cannot see that the threads should be kept open for longer and hence the threads are created and destroyed at every time step.
Would it be helpful to put a "#pragma omp parallel" in front of the time-loop so that the threads are created at the very beginning? And then do the actual parallelisation for the righthand-side (1) and the Euler step (2)? But how do I do that?
I have found numerous examples for how to parallelise nested for loops, but none of them were concerned with the setup where the inner loops have been sourced out to separate modules. Would this an obstacle for parallelising?
I have now removed the d loops (by making the indices explicit) and collapsed the i and j loops (by running over the entire 2D array with one variable only).
The code looks like:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
rhs[0][0][i] = A0[0][i] + B0[0][i] + ...;
rhs[1][0][i] = A1[0][i] + B1[0][i] + ...;
rhs[2][0][i] = A2[0][i] + B2[0][i] + ...;
rhs[3][0][i] = A3[0][i] + B3[0][i] + ...;
}
// (2) perform Euler step (or Runge-Kutta, ...)
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
next[0][0][i] = current[0][0][i] + time_step * rhs[0][0][i];
next[1][0][i] = current[1][0][i] + time_step * rhs[1][0][i];
next[2][0][i] = current[2][0][i] + time_step * rhs[2][0][i];
next[3][0][i] = current[3][0][i] + time_step * rhs[3][0][i];
}
}
The size of nr*ntheta is 400*40=1600 and I a make max_time_steps=1000 time steps. Still, the parallelisation does not result in a speed-up:
Runtime without OpenMP (result of time on the command line):
real 0m23.597s
user 0m23.496s
sys 0m0.076s
Runtime with OpenMP (24 cores)
real 0m23.162s
user 7m47.026s
sys 0m0.905s
I do not understand what's happening here.
One peculiarity that I don't show in my code snippet above is that my variables are not actually doubles but a self-defined struct of two doubles which resemble real and imaginary part. But I think this should not make a difference.
Just wanted to report some success after I left the parallelisation alone for a while. The code evolved for a year and now I went back to parallelisation. This time, I can say that OpenMP does it's job and reduces the required walltime.
While the code evolved overall, this particular loop that I've shown above did not really change; merely two things: a) The resolution is higher so that it covers about 10 times as many points and b) the number of calculations per loop also is about 10-fold (maybe even more).
My only explanation why it works now and didn't work a little over a year ago, is that, when I tried to parallelise the code last time, it wasn't computationally expensive enough and the speed-up was killed by the OpenMP overhead. One single loop now requires about 200-300ms whereas that time required must have been in the single digit ms last time.
I can see such effect when comparing gcc and the Intel compiler (which are doing a very different job when vectorizing):
a) Using gcc, one loop needs about 300ms without OpenMP, and on two cores only 52% of the time is required --> near perfect optimization.
b) Using icpc, one loop needs about 160ms without OpenMP, and on two cores it needs 60% of the time --> good optimization but about 20% less effective.
When going for more than two cores, the speed-up is not large enough to make it worthwhile.

First random number is always smaller than rest

I happen to notice that in C++ the first random number being called with the std rand() method is most of the time significant smaller than the second one. Concerning the Qt implementation the first one is nearly always several magnitudes smaller.
qsrand(QTime::currentTime().msec());
qDebug() << "qt1: " << qrand();
qDebug() << "qt2: " << qrand();
srand((unsigned int) time(0));
std::cout << "std1: " << rand() << std::endl;
std::cout << "std2: " << rand() << std::endl;
output:
qt1: 7109361
qt2: 1375429742
std1: 871649082
std2: 1820164987
Is this intended, due to error in seeding or a bug?
Also while the qrand() output varies strongly the first rand() output seems to change linearly with time. Just wonder why.
I'm not sure that could be classified as a bug, but it has an explanation. Let's examine the situation:
Look at rand's implementation. You'll see it's just a calculation using the last generated value.
You're seeding using QTime::currentTime().msec(), which is by nature bounded by the small range of values 0..999, but qsrand accepts an uint variable, on the range 0..4294967295.
By combining those two factors, you have a pattern.
Just out of curiosity: try seeding with QTime::currentTime().msec() + 100000000
Now the first value will probably be bigger than the second most of the time.
I wouldn't worry too much. This "pattern" seems to happen only on the first two generated values. After that, everything seems to go back to normal.
EDIT:
To make things more clear, try running the code below. It'll compare the first two generated values to see which one is smaller, using all possible millisecond values (range: 0..999) as the seed:
int totalCalls, leftIsSmaller = 0;
for (totalCalls = 0; totalCalls < 1000; totalCalls++)
{
qsrand(totalCalls);
if (qrand() < qrand())
leftIsSmaller++;
}
qDebug() << (100.0 * leftIsSmaller) / totalCalls;
It will print 94.8, which means 94.8% of the time the first value will be smaller than the second.
Conclusion: when using the current millisecond to seed, you'll see that pattern for the first two values. I did some tests here and the pattern seems to disappear after the second value is generated. My advice: find a "good" value to call qsrand (which should obviously be called only once, at the beginning of your program). A good value should span the whole range of the uint class. Take a look at this other question for some ideas:
Recommended way to initialize srand?
Also, take a look at this:
PCG: A Family of Better Random Number Generators
Neither current Qt nor C standard run-time have a quality randomizer and your test shows. Qt seems to use C run-time for that (this is easy to check but why). If C++ 11 is available in your project, use much better and way more reliable method:
#include <random>
#include <chrono>
auto seed = std::chrono::system_clock::now().time_since_epoch().count();
std::default_random_engine generator(seed);
std::uniform_int_distribution<uint> distribution;
uint randomUint = distribution(generator);
There is good video that covers the topic. As noted by commenter user2357112 we can apply different random engines and then different distributions but for my specific use the above worked really well.
Keeping in mind that making judgments about a statistical phenomena based on a small number of samples might be misleading, I decided to run a small experiment. I run the following code:
int main()
{
int i = 0;
int j = 0;
while (i < RAND_MAX)
{
srand(time(NULL));
int r1 = rand();
int r2 = rand();
if (r1 < r2)
++j;
++i;
if (i%10000 == 0) {
printf("%g\n", (float)j / (float)i);
}
}
}
which basically printed the percentage of times the first generated number was smaller than the second. Below you see the plot of that ratio:
and as you can see it actually approaches 0.5 after less than 50 actual new seeds.
As suggested in the comment, we could modify the code to use consecutive seeds every iteration and speed up the convergence:
int main()
{
int i = 0;
int j = 0;
int t = time(NULL);
while (i < RAND_MAX)
{
srand(t);
int r1 = rand();
int r2 = rand();
if (r1 < r2)
++j;
++i;
if (i%10000 == 0) {
printf("%g\n", (float)j / (float)i);
}
++t;
}
}
This gives us:
which stays pretty close to 0.5 as well.
While rand is certainly not the best pseudo random number generator, the claim that it often generates a smaller number during the first run does not seem to be warranted.

Is continue instant?

In the follow two code snippets, is there actually any different according to the speed of compiling or running?
for (int i = 0; i < 50; i++)
{
if (i % 3 == 0)
continue;
printf("Yay");
}
and
for (int i = 0; i < 50; i++)
{
if (i % 3 != 0)
printf("Yay");
}
Personally, in the situations where there is a lot more than a print statement, I've been using the first method as to reduce the amount of indentation for the containing code. Been wondering for a while so found it about time I ask whether it's actually having an effect other than visually.
Reply to Alf (i couldn't get code working in comments...)
More accurate to my usage is something along the lines of a "handleObjectMovement" function which would include
for each object
if object position is static
continue
deal with velocity and jazz
compared with
for each object
if object position is not static
deal with velocity and jazz
Hence me not using return. Essentially "if it's not relevant to this iteration, move on"
The behaviour is the same, so the runtime speed should be the same unless the compiler does something stupid (or unless you disable optimisation).
It's impossible to say whether there's a difference in compilation speed, since it depends on the details of how the compiler parses, analyses and translates the two variations.
If speed is important, measure it.
If you know which branch of the condition has higher probability you may use GCC likely/unlikely macro
How about getting rid of the check altogether?
for (int t = 0; t < 33; t++)
{
int i = t + (t >> 1) + 1;
printf("%d\n", i);
}