I have a very long (in number of iterations) for loop, and I like to make it possible to personalize some of its parts. The code looks as following:
function expensive_loop( void (*do_true)(int), void (*do_false)(int)){
for(i=0; i<VeryLargeN; i++){
element=elements[i]
// long computation that produce a boolean condition
if (condition){
do_true(element);
}else{
do_false(element);
}
}
}
Now, the problem is that every time do_true and do_false are called, there is an overhead due to the push/pop of the stack that ruins the high performance of the code.
To solve this I could simply create several copies of the expensive_loop function, each with its own do_true and do_false implementation. This will make impossible the code to mantain.
So, how does someone make the internal part of an iteration so it can be personalized, and still mantain high performance?
Note that the function accepts pointers to functions, so those get called through a pointer. The optimizer may inline those calls through the function pointers if the definitions of expensive_loop and those functions are available and the compiler inlining limits have not been breached.
Another option is to make this algorithm a function template that accepts callable objects (function pointers, objects with a call operator, lambdas), just like standard algorithms do. This way the compiler may have more optimization opportunities. E.g.:
template<class DoTrue, class DoFalse>
void expensive_loop(DoTrue do_true, DoFalse do_false) {
// Original function body here.
}
There is -Winline compiler switch for g++:
-Winline
Warn if a function can not be inlined and it was declared as inline. Even with this option, the compiler will not warn about failures to inline functions declared in system headers.
The compiler uses a variety of heuristics to determine whether or not to inline a function. For example, the compiler takes into account the size of the function being inlined and the the amount of inlining that has already been done in the current function. Therefore, seemingly insignificant changes in the source program can cause the warnings produced by -Winline to appear or disappear.
It probably does not warn about a function not being inlined when it is called through a pointer.
The problem is that the function address (what actually is set in do_true and do_false is not resolved until link time, where there are not many opportunities for optimization.
If you are explicitly setting both functions in the code (i.e., the functions themselves don't come from an external library, etc.), you can declare your function with C++ templates, so that the compiler knows exactly which functions you want to call at that time.
struct function_one {
void operator()( int element ) {
}
};
extern int elements[];
extern bool condition();
template < typename DoTrue, typename DoFalse >
void expensive_loop(){
DoTrue do_true;
DoFalse do_false;
for(int i=0; i<50; i++){
int element=elements[i];
// long computation that produce a boolean condition
if (condition()){
do_true(element); // call DoTrue's operator()
}else{
do_false(element); // call DoFalse's operator()
}
}
}
int main( int argc, char* argv[] ) {
expensive_loop<function_one,function_one>();
return 0;
}
The compiler will instantiate an expensive_loop function for each combination of DoTrue and DoFalse types you specify. It will increase the size of the executable if you use more than one combination, but each of them should do what you expect.
For the example I shown, note how the function is empty.
The compiler just strips away the function call and leaves the loop:
main:
push rbx
mov ebx, 50
.L2:
call condition()
sub ebx, 1
jne .L2
xor eax, eax
pop rbx
ret
See example in https://godbolt.org/g/hV52Nn
Using function pointers as in your example, may not inline the function calls. This is the produced assembler for main and expensive_loop in a program where expensive_loop
// File A.cpp
void foo( int arg );
void bar( int arg );
extern bool condition();
extern int elements[];
void expensive_loop( void (*do_true)(int), void (*do_false)(int)){
for(int i=0; i<50; i++){
int element=elements[i];
// long computation that produce a boolean condition
if (condition()){
do_true(element);
}else{
do_false(element);
}
}
}
int main( int argc, char* argv[] ) {
expensive_loop( foo, bar );
return 0;
}
and the functions passed by argument
// File B.cpp
#include <math.h>
int elements[50];
bool condition() {
return elements[0] == 1;
}
inline int foo( int arg ) {
return arg%3;
}
inline int bar( int arg ) {
return 1234%arg;
}
are defined in different translation units.
0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>:
400620: 41 55 push %r13
400622: 49 89 fd mov %rdi,%r13
400625: 41 54 push %r12
400627: 49 89 f4 mov %rsi,%r12
40062a: 55 push %rbp
40062b: 53 push %rbx
40062c: bb 60 10 60 00 mov $0x601060,%ebx
400631: 48 83 ec 08 sub $0x8,%rsp
400635: eb 19 jmp 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
400637: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
40063e: 00 00
400640: 48 83 c3 04 add $0x4,%rbx
400644: 41 ff d5 callq *%r13
400647: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx
40064e: 74 1d je 40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d>
400650: 8b 2b mov (%rbx),%ebp
400652: e8 79 ff ff ff callq 4005d0 <condition()>
400657: 84 c0 test %al,%al
400659: 89 ef mov %ebp,%edi
40065b: 75 e3 jne 400640 <expensive_loop(void (*)(int), void (*)(int))+0x20>
40065d: 48 83 c3 04 add $0x4,%rbx
400661: 41 ff d4 callq *%r12
400664: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx
40066b: 75 e3 jne 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
40066d: 48 83 c4 08 add $0x8,%rsp
400671: 5b pop %rbx
400672: 5d pop %rbp
400673: 41 5c pop %r12
400675: 41 5d pop %r13
400677: c3 retq
400678: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40067f: 00
You can see how the calls are still performed even when using -O3 optimization level:
400644: 41 ff d5 callq *%r13
This question already has answers here:
Can we change the value of an object defined with const through pointers?
(11 answers)
Closed 6 years ago.
Not a Duplicate. Please read Full question.
#include<iostream>
using namespace std;
int main()
{
const int a = 5;
const int *ptr1 = &a;
int *ptr = (int *)ptr1;
*ptr = 10;
cout<<ptr<<" = "<<*ptr<<endl;
cout<<ptr1<<" = "<<*ptr1<<endl;
cout<<&a<<" = "<<a;
return 0;
}
Output:
0x7ffe13455fb4 = 10
0x7ffe13455fb4 = 10
0x7ffe13455fb4 = 5
How is this possible?
You shouldn't rely on undefined behaviour. Look what the compiler does with your code, particularly the last part:
cout<<&a<<" = "<<a;
b6: 48 8d 45 ac lea -0x54(%rbp),%rax
ba: 48 89 c2 mov %rax,%rdx
bd: 48 8b 0d 00 00 00 00 mov 0x0(%rip),%rcx # c4 <main+0xc4>
c4: e8 00 00 00 00 callq c9 <main+0xc9>
c9: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx # d0 <main+0xd0>
d0: 48 89 c1 mov %rax,%rcx
d3: e8 00 00 00 00 callq d8 <main+0xd8>
d8: ba 05 00 00 00 mov $0x5,%edx <=== direct insert of 5 in the register to display 5
dd: 48 89 c1 mov %rax,%rcx
e0: e8 00 00 00 00 callq e5 <main+0xe5>
return 0;
e5: b8 00 00 00 00 mov $0x0,%eax
ea: 90 nop
eb: 48 83 c4 48 add $0x48,%rsp
ef: 5b pop %rbx
f0: 5d pop %rbp
f1: c3 retq
When the compiler sees a constant expression, it can decide (implementation-dependent) to replace it with the actual value.
In that particular case, g++ did that without even -O1 option!
When you invoke undefined behavior anything is possible.
In this case, you are casting the constness away with this line:
int *ptr = (int *)ptr1;
And you're lucky enough that there is an address on the stack to be changed, that explains why the first two prints output a 10.
The third print outputs a 5 because the compiler optimized it by hardcoding a 5 making the assumption that a wouldn't be changed.
It is certainly undefined behavior, but I am strong proponent of understanding symptoms of undefined behavior for the benefit of spotting one. The results observed can be explained in following manner:
const int a = 5
defined integer constant. Compiler now assumes that value will never be modified for the duration of the whole function, so when it sees
cout<<&a<<" = "<<a;
it doesn't generate the code to reload the current value of a, instead it just uses the number it was initialized with - it is much faster, than loading from memory.
This is a very common optimization technique - when a certain condition can only happen when the program exhibits undefined behavior, optimizers assume that condition never happens.
This page recommends "loop unrolling" as an optimization:
Loop overhead can be reduced by reducing the number of iterations and
replicating the body of the loop.
Example:
In the code fragment below, the body of the loop can be replicated
once and the number of iterations can be reduced from 100 to 50.
for (i = 0; i < 100; i++)
g ();
Below is the code fragment after loop unrolling.
for (i = 0; i < 100; i += 2)
{
g ();
g ();
}
With GCC 5.2, loop unrolling isn't enabled unless you use -funroll-loops (it's not enabled in either -O2 or -O3). I've inspected the assembly to see if there's a significant difference.
g++ -std=c++14 -O3 -funroll-loops -c -Wall -pedantic -pthread main.cpp && objdump -d main.o
Version 1:
0: ba 64 00 00 00 mov $0x64,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
# ... etc ...
a1: 83 c1 01 add $0x1,%ecx
a4: 83 ea 0a sub $0xa,%edx
a7: 89 0d 00 00 00 00 mov %ecx,0x0(%rip) # ad <main+0xad>
ad: 0f 85 55 ff ff ff jne 8 <main+0x8>
b3: 31 c0 xor %eax,%eax
b5: c3 retq
Version 2:
0: ba 32 00 00 00 mov $0x32,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
11: 89 05 00 00 00 00 mov %eax,0x0(%rip) # 17 <main+0x17>
17: 8b 0d 00 00 00 00 mov 0x0(%rip),%ecx # 1d <main+0x1d>
1d: 83 c1 01 add $0x1,%ecx
# ... etc ...
143: 83 c7 01 add $0x1,%edi
146: 83 ea 0a sub $0xa,%edx
149: 89 3d 00 00 00 00 mov %edi,0x0(%rip) # 14f <main+0x14f>
14f: 0f 85 b3 fe ff ff jne 8 <main+0x8>
155: 31 c0 xor %eax,%eax
157: c3 retq
Version 2 produces more iterations. What am I missing?
Yes, there are cases where loop unrolling will make the code more efficient.
The theory is reduce the less overhead (branching to top of loop and incrementing loop counter).
Most processors hate branch instructions. They love data processing instructions. For every iteration, there is a minimum of one branch instruction. By "duplicating" a set of code, the number of branches is reduced and the data processing instructions is increased between branches.
Many modern compilers have optimization settings to perform loop unrolling.
It doesn’t produce more iterations; you’ll notice that the loop that calls g() twice runs half as many times. (What if you have to call g() an odd number of times? Look up Duff’s Device.)
In your listings, you’ll notice that the assembly-language instruction jne 8 <main+0x8> appears once in both. This tells the processor to go back to the start of the loop. In the original loop, this instruction will run 99 times. In the rolled loop, it will only run 49 times. Imagine if the body of the loop is something very short, just one or two instructions. These jumps might be a third or even half of the instructions in the most performance-critical part of your program! (And there is even a useful loop with zero instructions: BogoMIPS. But the article about optimizing that was a joke.)
So, unrolling the loop trades speed for code size, right? Not so fast. Maybe you’ve made your unrolled loop so big that the code at the top of the loop is no longer in the cache, and the CPU needs to fetch it. In the real world, the only way to know if it helps is to profile your program.
Consider the following code:
typedef void (*Fn)();
volatile long sum = 0;
inline void accu() {
sum+=4;
}
static const Fn map[4] = {&accu, &accu, &accu, &accu};
int main(int argc, char** argv) {
static const long N = 10000000L;
if (argc == 1)
{
for (long i = 0; i < N; i++)
{
accu();
accu();
accu();
accu();
}
}
else
{
for (long i = 0; i < N; i++)
{
for (int j = 0; j < 4; j++)
(*map[j])();
}
}
}
When I compiled it with:
g++ -O3 test.cpp
I'm expecting the first branch to run faster because the compiler could inline the function call to accu. And the second branch cannot be inlined because accu is called through function pointer stored in an array.
But the results surprised me:
time ./a.out
real 0m0.108s
user 0m0.104s
sys 0m0.000s
time ./a.out 1
real 0m0.095s
user 0m0.088s
sys 0m0.004s
I don't understand why, so I did an objdump:
objdump -DStTrR a.out > a.s
and the disassembly doesn't seem to explain the performance result I got:
8048300 <main>:
8048300: 55 push %ebp
8048301: 89 e5 mov %esp,%ebp
8048303: 53 push %ebx
8048304: bb 80 96 98 00 mov $0x989680,%ebx
8048309: 83 e4 f0 and $0xfffffff0,%esp
804830c: 83 7d 08 01 cmpl $0x1,0x8(%ebp)
8048310: 74 27 je 8048339 <main+0x39>
8048312: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
8048318: e8 23 01 00 00 call 8048440 <_Z4accuv>
804831d: e8 1e 01 00 00 call 8048440 <_Z4accuv>
8048322: e8 19 01 00 00 call 8048440 <_Z4accuv>
8048327: e8 14 01 00 00 call 8048440 <_Z4accuv>
804832c: 83 eb 01 sub $0x1,%ebx
804832f: 90 nop
8048330: 75 e6 jne 8048318 <main+0x18>
8048332: 31 c0 xor %eax,%eax
8048334: 8b 5d fc mov -0x4(%ebp),%ebx
8048337: c9 leave
8048338: c3 ret
8048339: b8 80 96 98 00 mov $0x989680,%eax
804833e: 66 90 xchg %ax,%ax
8048340: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048346: 83 c2 04 add $0x4,%edx
8048349: 89 15 18 a0 04 08 mov %edx,0x804a018
804834f: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048355: 83 c2 04 add $0x4,%edx
8048358: 89 15 18 a0 04 08 mov %edx,0x804a018
804835e: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048364: 83 c2 04 add $0x4,%edx
8048367: 89 15 18 a0 04 08 mov %edx,0x804a018
804836d: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048373: 83 c2 04 add $0x4,%edx
8048376: 83 e8 01 sub $0x1,%eax
8048379: 89 15 18 a0 04 08 mov %edx,0x804a018
804837f: 75 bf jne 8048340 <main+0x40>
8048381: eb af jmp 8048332 <main+0x32>
8048383: 90 nop
...
8048440 <_Z4accuv>:
8048440: a1 18 a0 04 08 mov 0x804a018,%eax
8048445: 83 c0 04 add $0x4,%eax
8048448: a3 18 a0 04 08 mov %eax,0x804a018
804844d: c3 ret
804844e: 90 nop
804844f: 90 nop
It seems the direct call branch is definitely doing less than the function pointer branch.
But why does the function pointer branch run faster than the direct call?
And note that I only used "time" for measuring the time. I've used clock_gettime to do the measurement and got similar results.
It is not completely true that the second branch cannot be inlined. In fact, all the function pointers stored in the array are seen at compile time. So compiler can substitute indirect function calls by direct calls (and it does so). In theory it can go further and inline them (and in this case we have two identical branches). But this particular compiler is not smart enough to do so.
As a result, the first branch is optimized "better". But with one exception. Compiler is not allowed to optimize volatile variable sum. As you can see from disassembled code, this produces store instructions immediately followed by load instructions (depending on these store instructions):
mov %edx,0x804a018
mov 0x804a018,%edx
Intel's Software Optimization Manual (section 3.6.5.2) does not recommend arranging instructions like this:
... if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty.
The second branch avoids this problem because of additional call/return instructions between store and load. So it performs better.
Similar improvements may be done for the first branch if we add some (not very expensive) calculations in-between:
long x1 = 0;
for (long i = 0; i < N; i++)
{
x1 ^= i<<8;
accu();
x1 ^= i<<1;
accu();
x1 ^= i<<2;
accu();
x1 ^= i<<4;
accu();
}
sum += x1;
I am taking a look at large matrix multiplication and ran the following experiment to form a baseline test:
Randomly generate two 4096x4096 matrixes X, Y from std normal (0 mean, 1 stddev).
Z = X*Y
Sum elements of Z (to make sure they are accessed) and output.
Here is the naïve C++ implementatation:
#include <iostream>
#include <algorithm>
using namespace std;
int main()
{
constexpr size_t dim = 4096;
float* x = new float[dim*dim];
float* y = new float[dim*dim];
float* z = new float[dim*dim];
random_device rd;
mt19937 gen(rd());
normal_distribution<float> dist(0, 1);
for (size_t i = 0; i < dim*dim; i++)
{
x[i] = dist(gen);
y[i] = dist(gen);
}
for (size_t row = 0; row < dim; row++)
for (size_t col = 0; col < dim; col++)
{
float acc = 0;
for (size_t k = 0; k < dim; k++)
acc += x[row*dim + k] * y[k*dim + col];
z[row*dim + col] = acc;
}
float t = 0;
for (size_t i = 0; i < dim*dim; i++)
t += z[i];
cout << t << endl;
delete x;
delete y;
delete z;
}
Compile and run:
$ g++ -std=gnu++11 -O3 test.cpp
$ time ./a.out
Here is the Octave/matlab implementation:
X = stdnormal_rnd(4096, 4096);
Y = stdnormal_rnd(4096, 4096);
Z = X*Y;
sum(sum(Z))
Run:
$ time octave < test.octave
Octave under the hood is using BLAS (I assume the sgemm function)
The hardware is i7 3930X on Linux x86-64 with 24 GB of ram. BLAS appears to be using two cores. Perhaps a hyperthreaded pair?
I found that the C++ version compiled with GCC 4.7 on -O3 took 9 minutes to execute:
real 9m2.126s
user 9m0.302s
sys 0m0.052s
The octave version took 6 seconds:
real 0m5.985s
user 0m10.881s
sys 0m0.144s
I understand that BLAS is optimized to all hell, and the naïve algorithm is totally ignoring caches and so on, but seriously -- 90 times?
Can anyone explain this difference? What exactly is the architecture of the BLAS implementation? I see it is using Fortran, but what is happening at the CPU level? What algorithm is it using? How is it using the CPU caches? What x86-64 machine instructions does it call? (Is it using advanced CPU features like AVX?) Where does it get this extra speed from?
Which key optimizations to the C++ algorithm could get it on par with the BLAS version?
I ran octave under gdb and stopped it half way through computation a few times. It had started a second thread and here are the stacks (all stops it looked similar):
(gdb) thread 1
#0 0x00007ffff6e17148 in pthread_join () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007ffff1626721 in ATL_join_tree () from /usr/lib/libblas.so.3
#2 0x00007ffff1626702 in ATL_join_tree () from /usr/lib/libblas.so.3
#3 0x00007ffff15ae357 in ATL_dptgemm () from /usr/lib/libblas.so.3
#4 0x00007ffff1384b59 in atl_f77wrap_dgemm_ () from /usr/lib/libblas.so.3
#5 0x00007ffff193effa in dgemm_ () from /usr/lib/libblas.so.3
#6 0x00007ffff6049727 in xgemm(Matrix const&, Matrix const&, blas_trans_type, blas_trans_type) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#7 0x00007ffff6049954 in operator*(Matrix const&, Matrix const&) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#8 0x00007ffff7839e4e in ?? () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#9 0x00007ffff765a93a in do_binary_op(octave_value::binary_op, octave_value const&, octave_value const&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#10 0x00007ffff76c4190 in tree_binary_expression::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#11 0x00007ffff76c33a5 in tree_simple_assignment::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#12 0x00007ffff76d0864 in tree_evaluator::visit_statement(tree_statement&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#13 0x00007ffff76cffae in tree_evaluator::visit_statement_list(tree_statement_list&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#14 0x00007ffff757f6d4 in main_loop() () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#15 0x00007ffff7527abf in octave_main () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
(gdb) thread 2
#0 0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
(gdb) bt
#0 0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
#1 0x00007ffff15a5fd7 in ATL_dmmIJK2 () from /usr/lib/libblas.so.3
#2 0x00007ffff15a6ae4 in ATL_dmmIJK () from /usr/lib/libblas.so.3
#3 0x00007ffff1518e65 in ATL_dgemm () from /usr/lib/libblas.so.3
#4 0x00007ffff15adf7a in ATL_dptgemm0 () from /usr/lib/libblas.so.3
#5 0x00007ffff6e15e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007ffff6b41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x0000000000000000 in ?? ()
It is calling BLAS gemm as expected.
The first thread appears to be joining the second, so I am not sure whether these two threads account for the 200% CPU usage observed or not.
Which library is ATL_dgemm libblas.so.3 and where is its code?
$ ls -al /usr/lib/libblas.so.3
/usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
$ ls -al /etc/alternatives/libblas.so.3
/etc/alternatives/libblas.so.3 -> /usr/lib/atlas-base/atlas/libblas.so.3
$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3
/usr/lib/atlas-base/atlas/libblas.so.3 -> libblas.so.3.0
$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3.0
/usr/lib/atlas-base/atlas/libblas.so.3.0
$ dpkg -S /usr/lib/atlas-base/atlas/libblas.so.3.0
libatlas3-base: /usr/lib/atlas-base/atlas/libblas.so.3.0
$ apt-get source libatlas3-base
It is ATLAS 3.8.4
Here are the optimizations I later implemented:
Using a tiled approach where I preload 64x64 blocks of X, Y and Z into separate arrays.
Changing the calculation of each block so that the inner loop looks like this:
for (size_t tcol = 0; tcol < block_width; tcol++)
bufz[trow][tcol] += B * bufy[tk][tcol];
This allows GCC to autovectorize to SIMD instructions and also allows for instruction level parallelism (I think).
Turning on march=corei7-avx. This gains 30% extra speed but is cheating because I think the BLAS library is prebuilt.
Here is the code:
#include <iostream>
#include <algorithm>
using namespace std;
constexpr size_t dim = 4096;
constexpr size_t block_width = 64;
constexpr size_t num_blocks = dim / block_width;
double X[dim][dim], Y[dim][dim], Z[dim][dim];
double bufx[block_width][block_width];
double bufy[block_width][block_width];
double bufz[block_width][block_width];
void calc_block()
{
for (size_t trow = 0; trow < block_width; trow++)
for (size_t tk = 0; tk < block_width; tk++)
{
double B = bufx[trow][tk];
for (size_t tcol = 0; tcol < block_width; tcol++)
bufz[trow][tcol] += B * bufy[tk][tcol];
}
}
int main()
{
random_device rd;
mt19937 gen(rd());
normal_distribution<double> dist(0, 1);
for (size_t row = 0; row < dim; row++)
for (size_t col = 0; col < dim; col++)
{
X[row][col] = dist(gen);
Y[row][col] = dist(gen);
Z[row][col] = 0;
}
for (size_t block_row = 0; block_row < num_blocks; block_row++)
for (size_t block_col = 0; block_col < num_blocks; block_col++)
{
for (size_t trow = 0; trow < block_width; trow++)
for (size_t tcol = 0; tcol < block_width; tcol++)
bufz[trow][tcol] = 0;
for (size_t block_k = 0; block_k < num_blocks; block_k++)
{
for (size_t trow = 0; trow < block_width; trow++)
for (size_t tcol = 0; tcol < block_width; tcol++)
{
bufx[trow][tcol] = X[block_row*block_width + trow][block_k*block_width + tcol];
bufy[trow][tcol] = Y[block_k*block_width + trow][block_col*block_width + tcol];
}
calc_block();
}
for (size_t trow = 0; trow < block_width; trow++)
for (size_t tcol = 0; tcol < block_width; tcol++)
Z[block_row*block_width + trow][block_col*block_width + tcol] = bufz[trow][tcol];
}
double t = 0;
for (size_t row = 0; row < dim; row++)
for (size_t col = 0; col < dim; col++)
t += Z[row][col];
cout << t << endl;
}
All the action is in the calc_block function - over 90% of the time is spent in it.
The new time is:
real 0m17.370s
user 0m17.213s
sys 0m0.092s
Which is much closer.
The decompile of the calc_block function is as follows:
0000000000401460 <_Z10calc_blockv>:
401460: b8 e0 21 60 00 mov $0x6021e0,%eax
401465: 41 b8 e0 23 61 00 mov $0x6123e0,%r8d
40146b: 31 ff xor %edi,%edi
40146d: 49 29 c0 sub %rax,%r8
401470: 49 8d 34 00 lea (%r8,%rax,1),%rsi
401474: 48 89 f9 mov %rdi,%rcx
401477: ba e0 a1 60 00 mov $0x60a1e0,%edx
40147c: 48 c1 e1 09 shl $0x9,%rcx
401480: 48 81 c1 e0 21 61 00 add $0x6121e0,%rcx
401487: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
40148e: 00 00
401490: c4 e2 7d 19 01 vbroadcastsd (%rcx),%ymm0
401495: 48 83 c1 08 add $0x8,%rcx
401499: c5 fd 59 0a vmulpd (%rdx),%ymm0,%ymm1
40149d: c5 f5 58 08 vaddpd (%rax),%ymm1,%ymm1
4014a1: c5 fd 29 08 vmovapd %ymm1,(%rax)
4014a5: c5 fd 59 4a 20 vmulpd 0x20(%rdx),%ymm0,%ymm1
4014aa: c5 f5 58 48 20 vaddpd 0x20(%rax),%ymm1,%ymm1
4014af: c5 fd 29 48 20 vmovapd %ymm1,0x20(%rax)
4014b4: c5 fd 59 4a 40 vmulpd 0x40(%rdx),%ymm0,%ymm1
4014b9: c5 f5 58 48 40 vaddpd 0x40(%rax),%ymm1,%ymm1
4014be: c5 fd 29 48 40 vmovapd %ymm1,0x40(%rax)
4014c3: c5 fd 59 4a 60 vmulpd 0x60(%rdx),%ymm0,%ymm1
4014c8: c5 f5 58 48 60 vaddpd 0x60(%rax),%ymm1,%ymm1
4014cd: c5 fd 29 48 60 vmovapd %ymm1,0x60(%rax)
4014d2: c5 fd 59 8a 80 00 00 vmulpd 0x80(%rdx),%ymm0,%ymm1
4014d9: 00
4014da: c5 f5 58 88 80 00 00 vaddpd 0x80(%rax),%ymm1,%ymm1
4014e1: 00
4014e2: c5 fd 29 88 80 00 00 vmovapd %ymm1,0x80(%rax)
4014e9: 00
4014ea: c5 fd 59 8a a0 00 00 vmulpd 0xa0(%rdx),%ymm0,%ymm1
4014f1: 00
4014f2: c5 f5 58 88 a0 00 00 vaddpd 0xa0(%rax),%ymm1,%ymm1
4014f9: 00
4014fa: c5 fd 29 88 a0 00 00 vmovapd %ymm1,0xa0(%rax)
401501: 00
401502: c5 fd 59 8a c0 00 00 vmulpd 0xc0(%rdx),%ymm0,%ymm1
401509: 00
40150a: c5 f5 58 88 c0 00 00 vaddpd 0xc0(%rax),%ymm1,%ymm1
401511: 00
401512: c5 fd 29 88 c0 00 00 vmovapd %ymm1,0xc0(%rax)
401519: 00
40151a: c5 fd 59 8a e0 00 00 vmulpd 0xe0(%rdx),%ymm0,%ymm1
401521: 00
401522: c5 f5 58 88 e0 00 00 vaddpd 0xe0(%rax),%ymm1,%ymm1
401529: 00
40152a: c5 fd 29 88 e0 00 00 vmovapd %ymm1,0xe0(%rax)
401531: 00
401532: c5 fd 59 8a 00 01 00 vmulpd 0x100(%rdx),%ymm0,%ymm1
401539: 00
40153a: c5 f5 58 88 00 01 00 vaddpd 0x100(%rax),%ymm1,%ymm1
401541: 00
401542: c5 fd 29 88 00 01 00 vmovapd %ymm1,0x100(%rax)
401549: 00
40154a: c5 fd 59 8a 20 01 00 vmulpd 0x120(%rdx),%ymm0,%ymm1
401551: 00
401552: c5 f5 58 88 20 01 00 vaddpd 0x120(%rax),%ymm1,%ymm1
401559: 00
40155a: c5 fd 29 88 20 01 00 vmovapd %ymm1,0x120(%rax)
401561: 00
401562: c5 fd 59 8a 40 01 00 vmulpd 0x140(%rdx),%ymm0,%ymm1
401569: 00
40156a: c5 f5 58 88 40 01 00 vaddpd 0x140(%rax),%ymm1,%ymm1
401571: 00
401572: c5 fd 29 88 40 01 00 vmovapd %ymm1,0x140(%rax)
401579: 00
40157a: c5 fd 59 8a 60 01 00 vmulpd 0x160(%rdx),%ymm0,%ymm1
401581: 00
401582: c5 f5 58 88 60 01 00 vaddpd 0x160(%rax),%ymm1,%ymm1
401589: 00
40158a: c5 fd 29 88 60 01 00 vmovapd %ymm1,0x160(%rax)
401591: 00
401592: c5 fd 59 8a 80 01 00 vmulpd 0x180(%rdx),%ymm0,%ymm1
401599: 00
40159a: c5 f5 58 88 80 01 00 vaddpd 0x180(%rax),%ymm1,%ymm1
4015a1: 00
4015a2: c5 fd 29 88 80 01 00 vmovapd %ymm1,0x180(%rax)
4015a9: 00
4015aa: c5 fd 59 8a a0 01 00 vmulpd 0x1a0(%rdx),%ymm0,%ymm1
4015b1: 00
4015b2: c5 f5 58 88 a0 01 00 vaddpd 0x1a0(%rax),%ymm1,%ymm1
4015b9: 00
4015ba: c5 fd 29 88 a0 01 00 vmovapd %ymm1,0x1a0(%rax)
4015c1: 00
4015c2: c5 fd 59 8a c0 01 00 vmulpd 0x1c0(%rdx),%ymm0,%ymm1
4015c9: 00
4015ca: c5 f5 58 88 c0 01 00 vaddpd 0x1c0(%rax),%ymm1,%ymm1
4015d1: 00
4015d2: c5 fd 29 88 c0 01 00 vmovapd %ymm1,0x1c0(%rax)
4015d9: 00
4015da: c5 fd 59 82 e0 01 00 vmulpd 0x1e0(%rdx),%ymm0,%ymm0
4015e1: 00
4015e2: c5 fd 58 80 e0 01 00 vaddpd 0x1e0(%rax),%ymm0,%ymm0
4015e9: 00
4015ea: 48 81 c2 00 02 00 00 add $0x200,%rdx
4015f1: 48 39 ce cmp %rcx,%rsi
4015f4: c5 fd 29 80 e0 01 00 vmovapd %ymm0,0x1e0(%rax)
4015fb: 00
4015fc: 0f 85 8e fe ff ff jne 401490 <_Z10calc_blockv+0x30>
401602: 48 83 c7 01 add $0x1,%rdi
401606: 48 05 00 02 00 00 add $0x200,%rax
40160c: 48 83 ff 40 cmp $0x40,%rdi
401610: 0f 85 5a fe ff ff jne 401470 <_Z10calc_blockv+0x10>
401616: c5 f8 77 vzeroupper
401619: c3 retq
40161a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
Here are three factors responsible for the performance difference between your code and BLAS (plus a note on Strassen’s algorithm).
In your inner loop, on k, you have y[k*dim + col]. Because of the way memory cache is arranged, consecutive values of k with the same dim and col map to the same cache set. The way cache is structured, each memory address has one cache set where its contents must be held while it is in cache. Each cache set has several lines (four is a typical number), and each of those lines can hold any of the memory addresses that map to that particular cache set.
Because your inner loop iterates through y in this way, each time it uses an element from y, it must load the memory for that element into the same set as the previous iteration did. This forces one of the previous cache lines in the set to be evicted. Then, in the next iteration of the col loop, all of the elements of y have been evicted from cache, so they must be reloaded again.
Thus, every time your loop loads an element of y, it must be loaded from memory, which takes many CPU cycles.
High-performance code avoids this in two ways. One, it divides the work into smaller blocks. The rows and the columns are partitioned into smaller sizes, and processed with shorter loops that are able to use all the elements in a cache line and to use each element several times before they go on to the next block. Thus, most of the references to elements of x and elements of y come from cache, often in a single processor cycle. Two, in some situations, the code will copy data out of a column of a matrix (which thrashes cache due to the geometry) into a row of a temporary buffer (which avoids thrashing). This again allows most of the memory references to be served from cache instead of from memory.
Another factor is the use of Single Instruction Multiple Data (SIMD) features. Many modern processors have instructions that load multiple elements (four float elements is typical, but some now do eight) in one instruction, store multiple elements, add multiple elements (e.g., for each of these four, add it to the corresponding one of those four), multiply multiple elements, and so on. Simply using such instructions immediately makes your code four times faster, provided you are able to arrange your work to use those instructions.
These instructions are not directly accessible in standard C. Some optimizers now try to use such instructions when they can, but this optimization is difficult, and it is not common to gain much benefit from it. Many compilers provide extensions to the language that give access to these instructions. Personally, I usually prefer to write in assembly language to use SIMD.
Another factor is using instruction-level parallel execution features on a processor. Observe that in your inner loop, acc is updated. The next iteration cannot add to acc until the previous iteration has finished updating acc. High-performance code will instead keep multiple sums running in parallel (even multiple SIMD sums). The result of this will be that while the addition for one sum is executing, the addition for another sum will be started. It is common on today’s processors to support four or more floating-point operations in progress at a time. As written, your code cannot do this at all. Some compilers will try to optimize the code by rearranging loops, but this requires the compiler to be able to see that iterations of a particular loop are independent from each other or can be commuted with another loop, et cetera.
It is quite feasible that using cache effectively provides a factor of ten performance improvement, SIMD provides another four, and instruction-level parallelism provides another four, giving 160 altogether.
Here is a very crude estimate of the effect of Strassen’s algorithm, based on this Wikipedia page. The Wikipedia page says Strassen is slightly better than direct multiplication around n = 100. This suggests the ratio of the constant factors of the execution times is 1003 / 1002.807 ≈ 2.4. Obviously, this will vary tremendously depending on processor model, matrix sizes interacting with cache effects, and so on. However, simple extrapolation shows that Strassen is about twice as good as direct multiplication at n = 4096 ((4096/100)3-2.807 ≈ 2.05). Again, that is just a ballpark estimate.
As for the later optimizations, consider this code in the inner loop:
bufz[trow][tcol] += B * bufy[tk][tcol];
One potential issue with this is that bufz could, in general, overlap bufy. Since you use global definitions for bufz and bufy, the compiler likely knows they do not overlap in this case. However, if you move this code into a subroutine that is passed bufz and bufy as parameters, and especially if you compile that subroutine in a separate source file, then the compiler is less likely to know that bufz and bufy do not overlap. In that case, the compiler cannot vectorize or otherwise reorder the code, because the bufz[trow][tcol] in this iteration might be the same as bufy[tk][tcol] in another iteration.
Even if the compiler can see that the subroutine is called with different bufz and bufy in the current source module, if the routine has extern linkage (the default), then the compiler has to allow for the routine to be called from an external module, so it must generate code that works correctly if bufz and bufy overlap. (One way the compiler can deal with this is to generate two versions of the routine, one to be called from external modules and one to be called from the module currently being compiled. Whether it does that depends on your compiler, the optimization switches, et cetera.) If you declare the routine as static, then the compiler knows it cannot be called from an external module (unless you take its address and there is a possibility the address is passed outside of the current module).
Another potential issue is that, even if the compiler vectorizes this code, it does not necessarily generate the best code for the processor you execute on. Looking at the generated assembly code, it appears the compiler is using only %ymm1 repeatedly. Over and over again, it multiplies a value from memory into %ymm1, adds a value from memory to %ymm1, and stores a value from %ymm1 to memory. There are two problems with this.
One, you do not want these partial sums stored to memory frequently. You want many additions accumulated into a register, and the register will be written to memory only infrequently. Convincing the compiler to do this likely requires rewriting the code to be explicit about keeping partial sums in temporary objects and writing them to memory after a loop has completed.
Two, these instructions are nominally serially dependent. The add cannot start until the multiply completes, and the store cannot write to memory until the add completes. The Core i7 has great capabilities for out-of-order execution. So, while it has that add waiting to start execution, it looks at the multiply later in the instruction stream and starts it. (Even though that multiply also uses %ymm1, the processor remaps the registers on the fly, so that it uses a different internal register to do this multiply.) Even though your code is filled with consecutive dependencies, the processor tries to execute several instructions at once. However, a number of things can interfere with this. You can run out of the internal registers the processor uses for renaming. The memory addresses you use might run into false conflicts. (The processor looks at a dozen or so of the low bits of memory addresses to see if the address might be the same as another one that it is trying to load or store from an earlier instruction. If the bits are equal, the processor has to delay the current load or store until it can verify the entire address is different. This delay can bollux up more than just the current load or store.) So, it is better to have instructions that are overtly independent.
That is one more reason I prefer to write high-performance code in assembly. To do it in C, you have to convince the compiler to give you instructions like this, by doing things such as writing some of your own SIMD code (using the language extensions for them) and manually unrolling loops (writing out multiple iterations).
When copying into and out of buffers, there might be similar issues. However, you report 90% of the time is spent in calc_block, so I have not looked at this closely.
Strassen's algorithm has two advantages over the naïve algorithm:
Better time complexity in terms of number of operations, as other answers correctly point out;
It is a cache-oblivious algorithm. The difference in number of cache misses is in the order of B*M½, where B is the cache line size and M is the cache size.
I think that the second point accounts for a lot for the slowdown you are experiencing. If you are running your application under Linux, I suggest you run them with the perf tool, which tells you how many cache misses the program is experiencing.
I don't know how reliable the information is but Wikipedia says that BLAS uses Strassen's algorithm for big matrixes. And yours are big indeed. That is around O(n^2.807) which is better than your O(n^3) naïve alogrithm.
This is quite complex topic, and well answered by Eric, in the post above. I just want to point to a useful reference in this direction, page 84:
http://www.rrze.fau.de/dienste/arbeiten-rechnen/hpc/HPC4SE/
which suggests to make "loop unroll and jam" on top of blocking.
Can anyone explain this difference?
A general explanation is that, the ratio of the number of operations/number of data is O(N^3)/O(N^2). Thus matrix-matrix multiplication is a cache-bound algorithm, which means that you don't suffer from common memory-bandwidth bottleneck, for large matrix sizes.
You can get up to 90% of peak performance of your CPU if the code well-optimized. So the optimization potential, elaborated by Eric, is tremendous as you observed. Actually, it would be very interesting to see the best performing code, and compile your final program with another compiler (intel usually brags to be the best).
About half of the difference is accounted for in algorithmic improvement. (4096*4096)^3 is the complexity of your algorithm, or 4.7x10^21, and (4096*4096)^2.807 is 1x10^20. That's a difference of about 47x.
The other 2x will be accounted for by more intelligent use of the cache, SSE instructions, and other such low-level optimizations.
Edit: I lie, n is width, not width^2. The algorithm would only actually account for about 4x, so there's still about another 22x to go. Threads, cache, and SSE instructions may well account for such things.