Why does position of code affect performance in C++? - c++

I am running a test performance, and found out that changing the order of the code makes it faster without compromising the result.
Performance is measured by time execution using chrono library.
vector< vector<float> > U(matrix_size, vector<float>(matrix_size,14));
vector< vector<float> > L(matrix_size, vector<float>(matrix_size,12));
vector< vector<float> > matrix_positive_definite(matrix_size, vector<float>(matrix_size,23));
for (i = 0; i < matrix_size; ++i) {
for(j= 0; j < matrix_size; ++j){
//Part II : ________________________________________
float sum2=0;
for(k= 0; k <= (i-1); ++k){
float sum2_temp=L[i][k]*U[k][j];
sum2+=sum2_temp;
}
//Part I : _____________________________________________
float sum1=0;
for(k= 0; k <= (j-1); ++k){
float sum1_temp=L[i][k]*U[k][j];
sum1+=sum1_temp;
}
//__________________________________________
if(i>j){
L[i][j]=(matrix_positive_definite[i][j]-sum1)/U[j][j];
}
else{
U[i][j]=matrix_positive_definite[i][j]-sum2;
}
}
}
I compile with g++ -O3 (GCC 7.4.0 in Intel i5/Win10).
I changed the order of Part I & Part II and got faster result if Part II executed before Part I. What's going on?
This is the link to the whole program.

I would try running both versions with perf stat -d <app> and see where the difference of performance counters is.
When benchmarking you may like to fix the CPU frequency, so it doesn't affect your scores.
Aligning loops on a 32-byte boundary often increases performance by 8-30%. See Causes of Performance Instability due to Code Placement in X86 - Zia Ansari, Intel for more details.
Try compiling your code with -O3 -falign-loops=32 -falign-functions=32 -march=native -mtune=native.

Running perf stat -ddd while playing around with the provided program shows that the major difference between the two versions stands mainly in the prefetch.
part II -> part I and part I -> part II (original program)
73,069,502 L1-dcache-prefetch-misses
part II -> part I and part II -> part I (only the efficient version)
31,719,117 L1-dcache-prefetch-misses
part I -> part II and part I -> part II (only the less efficient version)
114,520,949 L1-dcache-prefetch-misses
nb: according to the compiler explorer, part II -> part I is very similar to part I -> part II.
I guess that, on the first iterations over i, part II does almost nothing, but iterations over j make part I access U[k][j] according to a pattern that will ease prefetch for the next iterations over i.

The faster version is similar to the performance you get when you move the loops inside the if (i > j).
if (i > j) {
float sum1 = 0;
for (std::size_t k = 0; k < j; ++k){
sum1 += L_series[i][k] * U_series[k][j];
}
L_parallel[i][j] = matrix_positive_definite[i][j] - sum1;
L[i][j] /= U[j][j];
}
if (i <= j) {
float sum2 = 0;
for (std::size_t k = 0; k < i; ++k){
sum2 += L_series[i][k] * U_series[k][j];
}
U_parallel[i][j] = matrix_positive_definite[i][j] - sum2;
}
So i would assume in one case the compiler is able to do that transformation itself. It only happens at -O3 for me. (1950X, msys2/GCC 8.3.0, Win10)
I don’t know which optimization this is exactly and what conditions have to be met for it to apply. It’s none of the options explicitly listed for -O3 (-O2 + all of them is not enough). Apparently it already doesn’t do it when std::size_t instead of int is used for the loop counters.

Related

Is branch prediction still significantly speeding up array processing? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I was reading a interesting post about why is it faster to process a sorted array than an unsorted array? and saw a comment made by #mp31415 that said:
Just for the record. On Windows / VS2017 / i7-6700K 4GHz there is NO difference between two versions. It takes 0.6s for both cases. If number of iterations in the external loop is increased 10 times the execution time increases 10 times too to 6s in both cases
So I tried it on a online c/c++ compiler (with, I suppose, modern server architecture), I get, for the sorted and unsorted, respectively, ~1.9s and ~1.85s, not so much different but repeatable.
So I wonder if it is still true for modern architectures?
Question was from 2012, not so far from now...
Or where am I wrong?
Question precision for reopening:
Please forget about me adding the C code as example. This was a terrible mistake. Not only erroneous was the code, posting it misled people who were focusing on the code itself, rather than on the question.
When I tried first the C++ code used in the link above and got only 2% difference (1.9s & 1.85s).
My first question and intent was about the previous post, its c++ code and the comment of #mp31415.
#rustyx made an interesting comment, and I wondered if it could explain what I observed.
Interestingly, a debug build exhibits 400% difference between sorted/unsorted, and a release build at most 5% difference (i7-7700).
In other words, my question is:
Why does the c++ code in the previous post did not worked with as good performances as those claimed by the previous OP?
precised by:
Does the timing difference between the release build and debug build could explain it?
You're a victim of the as-if rule:
... conforming implementations are required to emulate (only) the observable behavior of the abstract machine ...
Consider the function under test ...
const size_t arraySize = 32768;
int *data;
long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}
And the generated assembly (VS 2017, x86_64 /O2 mode)
The machine does not execute your loops, instead it executes a similar program that does this:
long long test()
{
long long sum = 0;
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
for (size_t i = 0; i < 20000; ++i)
{
if (data[c] >= 128)
sum += data[c] * 5;
}
}
return sum;
}
Observe how the optimizer reversed the order of the loops and defeated your benchmark.
Obviously the latter version is much more branch-predictor-friendly.
You can in turn defeat the loop hoisting optimization by introducing a dependency in the outer loop:
long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
sum += data[sum % 15]; // <== dependency!
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}
Now this version again exhibits a massive difference between sorted/unsorted data. On my system (i7-7700) 1.6s vs 11s (or 700%).
Conclusion: branch predictor is more important than ever these days when we are facing unprecedented pipeline depths and instruction-level parallelism.

Adding a print statement speeds up code by an order of magnitude

I've hit a piece of extremely bizarre performance behavior in a piece of C/C++ code, as suggested in the title, which I have no idea how to explain.
Here's an as-close-as-I've-found-to-minimal working example [EDIT: see below for a shorter one]:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp], gg[pp];
for(int ii = 0; ii < pp; ii++) {
ff[ii] = gg[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl dual[pp];
for(int ii = 0; ii < pp; ii++) {
dual[ii] = 0.0;
}
for(int h1 = 0; h1 < pp; h1 ++) {
for(int h2 = 0; h2 < pp; h2 ++) {
cdbl avg_right = 0.0;
for(int xx = 0; xx < pp; xx ++) {
int c00 = xx, c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
avg_right += ff[c00] * conj(ff[c01]) * conj(ff[c10]) * gg[c11];
}
avg_right /= static_cast<cdbl>(pp);
for(int xx = 0; xx < pp; xx ++) {
int c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
dual[xx] += conj(ff[c01]) * conj(ff[c10]) * ff[c11] * conj(avg_right);
}
}
}
for(int ii = 0; ii < pp; ii++) {
dual[ii] = conj(dual[ii]) / static_cast<double>(pp*pp);
}
for(int ii = 0; ii < pp; ii++) {
gg[ii] = dual[ii];
}
#ifdef I_WANT_THIS_TO_RUN_REALLY_FAST
printf("%.15lf\n", gg[0].real());
#else // I_WANT_THIS_TO_RUN_REALLY_SLOWLY
#endif
}
printf("%.15lf\n", gg[0].real());
return 0;
}
Here are the results of running this on my system:
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2
me#mine $ time ./test.elf > /dev/null
real 0m7.329s
user 0m7.328s
sys 0m0.000s
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2 -DI_WANT_THIS_TO_RUN_REALLY_FAST
me#mine $ time ./test.elf > /dev/null
real 0m0.492s
user 0m0.490s
sys 0m0.001s
me#mine $ g++ --version
g++ (Gentoo 4.9.4 p1.0, pie-0.6.4) 4.9.4 [snip]
It's not terribly important what this code computes: it's just a tonne of complex arithmetic on arrays of length 29. It's been "simplified" from a much larger tonne of complex arithmetic that I care about.
So, the behavior seems to be, as claimed in the title: if I put this print statement back in, the code gets a lot faster.
I've played around a bit: e.g, printing a constant string doesn't give the speedup, but printing the clock time does. There's a pretty clear threshold: the code is either fast or slow.
I considered the possibility that some bizarre compiler optimization either does or doesn't kick in, maybe depending on whether the code does or doesn't have side effects. But, if so it's pretty subtle: when I looked at the disassembled binaries, they're seemingly identical except that one has an extra print statement in and they use different interchangeable registers. I may (must?) have missed something important.
I'm at a total loss to explain what an earth could be causing this. Worse, it does actually affect my life because I'm running related code a lot, and going round inserting extra print statements does not feel like a good solution.
Any plausible theories would be very welcome. Responses along the lines of "your computer's broken" are acceptable if you can explain how that might explain anything.
UPDATE: with apologies for the increasing length of the question, I've shrunk the example to
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp];
cdbl blah = 0.0;
for(int ii = 0; ii < pp; ii++) {
ff[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl xx = 0.0;
for(int kk = 0; kk < 100; kk++) {
for(int ii = 0; ii < pp; ii++) {
for(int jj = 0; jj < pp; jj++) {
xx += conj(ff[ii]) * conj(ff[jj]) * ff[ii];
}
}
}
blah += xx;
printf("%.15lf\n", blah.real());
}
printf("%.15lf\n", blah.real());
return 0;
}
I could make it even smaller but already the machine code is manageable. If I change five bytes of the binary corresponding to the callq instruction for that first printf, to 0x90, the execution goes from fast to slow.
The compiled code is very heavy with function calls to __muldc3(). I feel it must be to do with how the Broadwell architecture does or doesn't handle these jumps well: both versions run the same number of instructions so it's a difference in instructions / cycle (about 0.16 vs about 2.8).
Also, compiling -static makes things fast again.
Further shameless update: I'm conscious I'm the only one who can play with this, so here are some more observations:
It seems like calling any library function — including some foolish ones I made up that do nothing — for the first time, puts the execution into slow state. A subsequent call to printf, fprintf or sprintf somehow clears the state and execution is fast again. So, importantly the first time __muldc3() is called we go into slow state, and the next {,f,s}printf resets everything.
Once a library function has been called once, and the state has been reset, that function becomes free and you can use it as much as you like without changing the state.
So, e.g.:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
int main() {
complex<double> foo = 0.0;
foo += foo * foo; // 1
char str[10];
sprintf(str, "%c\n", 'c');
//fflush(stdout); // 2
for(int it = 0; it < 100000000; it++) {
foo += foo * foo;
}
return (foo.real() > 10.0);
}
is fast, but commenting out line 1 or uncommenting line 2 makes it slow again.
It must be relevant that the first time a library call is run the "trampoline" in the PLT is initialized to point to the shared library. So, maybe somehow this dynamic loading code leaves the processor frontend in a bad place until it's "rescued".
For the record, I finally figured this out.
It turns out this is to do with AVX–SSE transition penalties. To quote this exposition from Intel:
When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.
The compiled version of my main loops above includes legacy SSE instructions (movapd and friends, I think), whereas the implementation of __muldc3 in libgcc_s uses a lot of fancy AVX instructions (vmovapd, vmulsd etc.).
This is the ultimate cause of the slowdown.
Indeed, Intel performance diagnostics show that this AVX/SSE switching happens almost exactly once each way per call of `__muldc3' (in the last version of the code posted above):
$ perf stat -e cpu/event=0xc1,umask=0x08/ -e cpu/event=0xc1,umask=0x10/ ./slow.elf
Performance counter stats for './slow.elf':
100,000,064 cpu/event=0xc1,umask=0x08/
100,000,118 cpu/event=0xc1,umask=0x10/
(event codes taken from table 19.5 of another Intel manual).
That leaves the question of why the slowdown turns on when you call a library function for the first time, and turns off again when you call printf, sprintf or whatever. The clue is in the first document again:
When it is not possible to remove the transitions, it is often possible to avoid the penalty by explicitly zeroing the upper 128-bits of the YMM registers, in which case the hardware does not save these values.
I think the full story is therefore as follows. When you call a library function for the first time, the trampoline code in ld-linux-x86-64.so that sets up the PLT leaves the upper bits of the MMY registers in a non-zero state. When you call sprintf among other things it zeros out the upper bits of the MMY registers (whether by chance or design, I'm not sure).
Replacing the sprintf call with asm("vzeroupper") — which instructs the processor explicitly to zero these high bits — has the same effect.
The effect can be eliminated by adding -mavx or -march=native to the compile flags, which is how the rest of the system was built. Why this doesn't happen by default is just a mystery of my system I guess.
I'm not quite sure what we learn here, but there it is.

Program compiled with vs2012 runs slower than on vs2010

I have a simple piece of code that generates random graphs by considering all possible edges and randomly selecting which ones to add to a file (based on a parameter p - the chance for each edge to be added). The algorithm looks like this:
for (unsigned int i = 0; i < nodeCount; ++i)
{
for (unsigned int j = i+1; j < nodeCount; ++j)
{
uniform_int_distribution<> d(0, 999999);
int randNum = d(randEngine);
if (randNum < p * 1000000)
{
//code that adds the edge to the file
}
}
}
I used to compile this code on vs2010, and when creating graphs with 50000 nodes and p=0.00002 it took ~26 seconds. I now upgraded to vs2012, compiled, and I get ~74 seconds! Almost triple the time!
I isolated the part that runs slower - it seems to be the random number generation (I commented out the code that is not included above, which writes to the file, and the same times were recorded).
The project definitions are exactly the same (except for the platform - vs100 vs. vs110). Any ideas why the random number generator has become so much worse in VS2012?? Or is it me that's doing something wrong?
Either way - what can I do except for going back to vs2010...?
Thanks.

How fast is D compared to C++?

I like some features of D, but would be interested if they come with a
runtime penalty?
To compare, I implemented a simple program that computes scalar products of many short vectors both in C++ and in D. The result is surprising:
D: 18.9 s [see below for final runtime]
C++: 3.8 s
Is C++ really almost five times as fast or did I make a mistake in the D
program?
I compiled C++ with g++ -O3 (gcc-snapshot 2011-02-19) and D with dmd -O (dmd 2.052) on a moderate recent linux desktop. The results are reproducible over several runs and standard deviations negligible.
Here the C++ program:
#include <iostream>
#include <random>
#include <chrono>
#include <string>
#include <vector>
#include <array>
typedef std::chrono::duration<long, std::ratio<1, 1000>> millisecs;
template <typename _T>
long time_since(std::chrono::time_point<_T>& time) {
long tm = std::chrono::duration_cast<millisecs>( std::chrono::system_clock::now() - time).count();
time = std::chrono::system_clock::now();
return tm;
}
const long N = 20000;
const int size = 10;
typedef int value_type;
typedef long long result_type;
typedef std::vector<value_type> vector_t;
typedef typename vector_t::size_type size_type;
inline value_type scalar_product(const vector_t& x, const vector_t& y) {
value_type res = 0;
size_type siz = x.size();
for (size_type i = 0; i < siz; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = std::chrono::system_clock::now();
// 1. allocate and fill randomly many short vectors
vector_t* xs = new vector_t [N];
for (int i = 0; i < N; ++i) {
xs[i] = vector_t(size);
}
std::cerr << "allocation: " << time_since(tm_before) << " ms" << std::endl;
std::mt19937 rnd_engine;
std::uniform_int_distribution<value_type> runif_gen(-1000, 1000);
for (int i = 0; i < N; ++i)
for (int j = 0; j < size; ++j)
xs[i][j] = runif_gen(rnd_engine);
std::cerr << "random generation: " << time_since(tm_before) << " ms" << std::endl;
// 2. compute all pairwise scalar products:
time_since(tm_before);
result_type avg = 0;
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg = avg / N*N;
auto time = time_since(tm_before);
std::cout << "result: " << avg << std::endl;
std::cout << "time: " << time << " ms" << std::endl;
}
And here the D version:
import std.stdio;
import std.datetime;
import std.random;
const long N = 20000;
const int size = 10;
alias int value_type;
alias long result_type;
alias value_type[] vector_t;
alias uint size_type;
value_type scalar_product(const ref vector_t x, const ref vector_t y) {
value_type res = 0;
size_type siz = x.length;
for (size_type i = 0; i < siz; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime();
// 1. allocate and fill randomly many short vectors
vector_t[] xs;
xs.length = N;
for (int i = 0; i < N; ++i) {
xs[i].length = size;
}
writefln("allocation: %i ", (Clock.currTime() - tm_before));
tm_before = Clock.currTime();
for (int i = 0; i < N; ++i)
for (int j = 0; j < size; ++j)
xs[i][j] = uniform(-1000, 1000);
writefln("random: %i ", (Clock.currTime() - tm_before));
tm_before = Clock.currTime();
// 2. compute all pairwise scalar products:
result_type avg = cast(result_type) 0;
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg = avg / N*N;
writefln("result: %d", avg);
auto time = Clock.currTime() - tm_before;
writefln("scalar products: %i ", time);
return 0;
}
To enable all optimizations and disable all safety checks, compile your D program with the following DMD flags:
-O -inline -release -noboundscheck
EDIT: I've tried your programs with g++, dmd and gdc. dmd does lag behind, but gdc achieves performance very close to g++. The commandline I used was gdmd -O -release -inline (gdmd is a wrapper around gdc which accepts dmd options).
Looking at the assembler listing, it looks like neither dmd nor gdc inlined scalar_product, but g++/gdc did emit MMX instructions, so they might be auto-vectorizing the loop.
One big thing that slows D down is a subpar garbage collection implementation. Benchmarks that don't heavily stress the GC will show very similar performance to C and C++ code compiled with the same compiler backend. Benchmarks that do heavily stress the GC will show that D performs abysmally. Rest assured, though, this is a single (albeit severe) quality-of-implementation issue, not a baked-in guarantee of slowness. Also, D gives you the ability to opt out of GC and tune memory management in performance-critical bits, while still using it in the less performance-critical 95% of your code.
I've put some effort into improving GC performance lately and the results have been rather dramatic, at least on synthetic benchmarks. Hopefully these changes will be integrated into one of the next few releases and will mitigate the issue.
This is a very instructive thread, thanks for all the work to the OP and helpers.
One note - this test is not assessing the general question of abstraction/feature penalty or even that of backend quality. It focuses on virtually one optimization (loop optimization). I think it's fair to say that gcc's backend is somewhat more refined than dmd's, but it would be a mistake to assume that the gap between them is as large for all tasks.
Definitely seems like a quality-of-implementation issue.
I ran some tests with the OP's code and made some changes. I actually got D going faster for LDC/clang++, operating on the assumption that arrays must be allocated dynamically (xs and associated scalars). See below for some numbers.
Questions for the OP
Is it intentional that the same seed be used for each iteration of C++, while not so for D?
Setup
I have tweaked the original D source (dubbed scalar.d) to make it portable between platforms. This only involved changing the type of the numbers used to access and modify the size of arrays.
After this, I made the following changes:
Used uninitializedArray to avoid default inits for scalars in xs (probably made the biggest difference). This is important because D normally default-inits everything silently, which C++ does not.
Factored out printing code and replaced writefln with writeln
Changed imports to be selective
Used pow operator (^^) instead of manual multiplication for final step of calculating average
Removed the size_type and replaced appropriately with the new index_type alias
...thus resulting in scalar2.cpp (pastebin):
import std.stdio : writeln;
import std.datetime : Clock, Duration;
import std.array : uninitializedArray;
import std.random : uniform;
alias result_type = long;
alias value_type = int;
alias vector_t = value_type[];
alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
immutable long N = 20000;
immutable int size = 10;
// Replaced for loops with appropriate foreach versions
value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
value_type res = 0;
for(index_type i = 0; i < size; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime;
auto countElapsed(in string taskName) { // Factor out printing code
writeln(taskName, ": ", Clock.currTime - tm_before);
tm_before = Clock.currTime;
}
// 1. allocate and fill randomly many short vectors
vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
for(index_type i = 0; i < N; ++i)
xs[i] = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
countElapsed("allocation");
for(index_type i = 0; i < N; ++i)
for(index_type j = 0; j < size; ++j)
xs[i][j] = uniform(-1000, 1000);
countElapsed("random");
// 2. compute all pairwise scalar products:
result_type avg = 0;
for(index_type i = 0; i < N; ++i)
for(index_type j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg /= N ^^ 2;// Replace manual multiplication with pow operator
writeln("result: ", avg);
countElapsed("scalar products");
return 0;
}
After testing scalar2.d (which prioritized optimization for speed), out of curiousity I replaced the loops in main with foreach equivalents, and called it scalar3.d (pastebin):
import std.stdio : writeln;
import std.datetime : Clock, Duration;
import std.array : uninitializedArray;
import std.random : uniform;
alias result_type = long;
alias value_type = int;
alias vector_t = value_type[];
alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
immutable long N = 20000;
immutable int size = 10;
// Replaced for loops with appropriate foreach versions
value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
value_type res = 0;
for(index_type i = 0; i < size; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime;
auto countElapsed(in string taskName) { // Factor out printing code
writeln(taskName, ": ", Clock.currTime - tm_before);
tm_before = Clock.currTime;
}
// 1. allocate and fill randomly many short vectors
vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
foreach(ref x; xs)
x = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
countElapsed("allocation");
foreach(ref x; xs)
foreach(ref val; x)
val = uniform(-1000, 1000);
countElapsed("random");
// 2. compute all pairwise scalar products:
result_type avg = 0;
foreach(const ref x; xs)
foreach(const ref y; xs)
avg += scalar_product(x, y);
avg /= N ^^ 2;// Replace manual multiplication with pow operator
writeln("result: ", avg);
countElapsed("scalar products");
return 0;
}
I compiled each of these tests using an LLVM-based compiler, since LDC seems to be the best option for D compilation in terms of performance. On my x86_64 Arch Linux installation I used the following packages:
clang 3.6.0-3
ldc 1:0.15.1-4
dtools 2.067.0-2
I used the following commands to compile each:
C++: clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
D: rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>
Results
The results (screenshot of raw console output) of each version of the source as follows:
scalar.cpp (original C++):
allocation: 2 ms
random generation: 12 ms
result: 29248300000
time: 2582 ms
C++ sets the standard at 2582 ms.
scalar.d (modified OP source):
allocation: 5 ms, 293 μs, and 5 hnsecs
random: 10 ms, 866 μs, and 4 hnsecs
result: 53237080000
scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs
This ran for ~2957 ms. Slower than the C++ implementation, but not too much.
scalar2.d (index/length type change and uninitializedArray optimization):
allocation: 2 ms, 464 μs, and 2 hnsecs
random: 5 ms, 792 μs, and 6 hnsecs
result: 59
scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs
In other words, ~1860 ms. So far this is in the lead.
scalar3.d (foreaches):
allocation: 2 ms, 911 μs, and 3 hnsecs
random: 7 ms, 567 μs, and 8 hnsecs
result: 189
scalar products: 2 secs, 182 ms, and 366 μs
~2182 ms is slower than scalar2.d, but faster than the C++ version.
Conclusion
With the correct optimizations, the D implementation actually went faster than its equivalent C++ implementation using the LLVM-based compilers available. The current gap between D and C++ for most applications seems only to be based on limitations of current implementations.
dmd is the reference implementation of the language and thus most work is put into the frontend to fix bugs rather than optimizing the backend.
"in" is faster in your case cause you are using dynamic arrays which are reference types. With ref you introduce another level of indirection (which is normally used to alter the array itself and not only the contents).
Vectors are usually implemented with structs where const ref makes perfect sense. See smallptD vs. smallpt for a real-world example featuring loads of vector operations and randomness.
Note that 64-Bit can also make a difference. I once missed that on x64 gcc compiles 64-Bit code while dmd still defaults to 32 (will change when the 64-Bit codegen matures). There was a remarkable speedup with "dmd -m64 ...".
Whether C++ or D is faster is likely to be highly dependent on what you're doing. I would think that when comparing well-written C++ to well-written D code, they would generally either be of similar speed, or C++ would be faster, but what the particular compiler manages to optimize could have a big effect completely aside from the language itself.
However, there are a few cases where D stands a good chance of beating C++ for speed. The main one which comes to mind would be string processing. Thanks to D's array slicing capabalities, strings (and arrays in general) can be processed much faster than you can readily do in C++. For D1, Tango's XML processor is extremely fast, thanks primarily to D's array slicing capabilities (and hopefully D2 will have a similarly fast XML parser once the one that's currently being worked on for Phobos has been completed). So, ultimately whether D or C++ is going to be faster is going to be very dependent on what you're doing.
Now, I am suprised that you're seeing such a difference in speed in this particular case, but it is the sort of thing that I would expect to improve as dmd improves. Using gdc might yield better results and would likely be a closer comparison of the language itself (rather than the backend) given that it's gcc-based. But it wouldn't surprise me at all if there are a number of things which could be done to speed up the code that dmd generates. I don't think that there's much question that gcc is more mature than dmd at this point. And code optimizations are one of the prime fruits of code maturity.
Ultimately, what matters is how well dmd performs for your particular application, but I do agree that it would definitely be nice to know how well C++ and D compare in general. In theory, they should be pretty much the same, but it really depends on the implementation. I think that a comprehensive set of benchmarks would be required to really test how well the two presently compare however.
You can write C code is D so as far as which is faster, it will depend on a lot of things:
What compiler you use
What feature you use
how aggressively you optimize
Differences in the first aren't fair to drag in. The second might give C++ an advantage as it, if anything, has fewer heavy features. The third is the fun one: D code in some ways is easier to optimize because in general it is easier to understand. Also it has the ability to do a large degree of generative programing allowing things like verbose and repetitive but fast code to be written in a shorter forms.
Seems like a quality of implementation issue. For example, here's what I've been testing with:
import std.datetime, std.stdio, std.random;
version = ManualInline;
immutable N = 20000;
immutable Size = 10;
alias int value_type;
alias long result_type;
alias value_type[] vector_type;
result_type scalar_product(in vector_type x, in vector_type y)
in
{
assert(x.length == y.length);
}
body
{
result_type result = 0;
foreach(i; 0 .. x.length)
result += x[i] * y[i];
return result;
}
void main()
{
auto startTime = Clock.currTime();
// 1. allocate vectors
vector_type[] vectors = new vector_type[N];
foreach(ref vec; vectors)
vec = new value_type[Size];
auto time = Clock.currTime() - startTime;
writefln("allocation: %s ", time);
startTime = Clock.currTime();
// 2. randomize vectors
foreach(ref vec; vectors)
foreach(ref e; vec)
e = uniform(-1000, 1000);
time = Clock.currTime() - startTime;
writefln("random: %s ", time);
startTime = Clock.currTime();
// 3. compute all pairwise scalar products
result_type avg = 0;
foreach(vecA; vectors)
foreach(vecB; vectors)
{
version(ManualInline)
{
result_type result = 0;
foreach(i; 0 .. vecA.length)
result += vecA[i] * vecB[i];
avg += result;
}
else
{
avg += scalar_product(vecA, vecB);
}
}
avg = avg / (N * N);
time = Clock.currTime() - startTime;
writefln("scalar products: %s ", time);
writefln("result: %s", avg);
}
With ManualInline defined I get 28 seconds, but without I get 32. So the compiler isn't even inlining this simple function, which I think it's clear it should be.
(My command line is dmd -O -noboundscheck -inline -release ....)

Optimize indexed array summation

I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.
This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.
Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.
Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.
Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.
Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.
This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];