I'm trying to use Intel intrinsics to beat the compiler optimized code. Sometimes I can do it, other times I can't.
I guess the question is, why can I sometimes beat the compiler, but other times not? I got a time of 0.006 seconds for operator+= below using Intel intrinsics, (vs 0.009 when using bare C++), but a time of 0.07 s for operator+ using intrinsics, while bare C++ was only 0.03 s.
#include <windows.h>
#include <stdio.h>
#include <intrin.h>
class Timer
{
LARGE_INTEGER startTime ;
double fFreq ;
public:
Timer() {
LARGE_INTEGER freq ;
QueryPerformanceFrequency( &freq ) ;
fFreq = (double)freq.QuadPart ;
reset();
}
void reset() { QueryPerformanceCounter( &startTime ) ; }
double getTime() {
LARGE_INTEGER endTime ;
QueryPerformanceCounter( &endTime ) ;
return ( endTime.QuadPart - startTime.QuadPart ) / fFreq ; // as double
}
} ;
inline float randFloat(){
return (float)rand()/RAND_MAX ;
}
// Use my optimized code,
#define OPTIMIZED_PLUS_EQUALS
#define OPTIMIZED_PLUS
union Vector
{
struct { float x,y,z,w ; } ;
__m128 reg ;
Vector():x(0.f),y(0.f),z(0.f),w(0.f) {}
Vector( float ix, float iy, float iz, float iw ):x(ix),y(iy),z(iz),w(iw) {}
//Vector( __m128 val ):x(val.m128_f32[0]),y(val.m128_f32[1]),z(val.m128_f32[2]),w(val.m128_f32[3]) {}
Vector( __m128 val ):reg( val ) {} // 2x speed, above
inline Vector& operator+=( const Vector& o ) {
#ifdef OPTIMIZED_PLUS_EQUALS
// YES! I beat it! Using this intrinsic is faster than just C++.
reg = _mm_add_ps( reg, o.reg ) ;
#else
x+=o.x, y+=o.y, z+=o.z, w+=o.w ;
#endif
return *this ;
}
inline Vector operator+( const Vector& o )
{
#ifdef OPTIMIZED_PLUS
// This is slower
return Vector( _mm_add_ps( reg, o.reg ) ) ;
#else
return Vector( x+o.x, y+o.y, z+o.z, w+o.w ) ;
#endif
}
static Vector random(){
return Vector( randFloat(), randFloat(), randFloat(), randFloat() ) ;
}
void print() {
printf( "%.2f %.2f %.2f\n", x,y,z,w ) ;
}
} ;
int runs = 8000000 ;
Vector sum ;
// OPTIMIZED_PLUS_EQUALS (intrinsics) runs FASTER 0.006 intrinsics, vs 0.009 (std C++)
void test1(){
for( int i = 0 ; i < runs ; i++ )
sum += Vector(1.f,0.25f,0.5f,0.5f) ;//Vector::random() ;
}
// OPTIMIZED* runs SLOWER (0.03 for reg.C++, vs 0.07 for intrinsics)
void test2(){
float j = 27.f ;
for( int i = 0 ; i < runs ; i++ )
{
sum += Vector( j*i, i, i/j, i ) + Vector( i, 2*i*j, 3*i*j*j, 4*i ) ;
}
}
int main()
{
Timer timer ;
//test1() ;
test2() ;
printf( "Time: %f\n", timer.getTime() ) ;
sum.print() ;
}
Edit
Why am I doing this? The VS 2012 profiler is telling me my vector arithmetic operations could use some tuning.
As noted by Mysticial, the union hack is the most likely culprit in test2. It forces the data to go through L1 cache, which, while fast, has some latency that is much more than your gain of 2 cycles that the vector code offers (see below).
But also consider that the CPU can run multiple instructions out of order and in parallel (superscalar CPU). For example, Sandy Bridge has 6 execution units, p0--p5, floating point multiplication/division runs on p0, floating point addition and integer multiplication runs on p1. Also, division takes 3-4 times more cycles then multiplication/addition, and is not pipelined (i.e. the execution unit cannot start another instruction while division is being performed). So in test2, while the vector code is waiting for the expensive division and some multiplications to finish on unit p0, the scalar code can be performing the extra 2 add instructions on p1, which most likely obliterates any advantage of vector instructions.
test1 is different, the constant vector can be stored in xmm register and in that case the loop contains only the add instruction. But the code is not 3x faster as might be expected. The reason is pipelined instructions: each add instruction has latency 3 cycles, but the CPU can start a new one every cycle when they are independent of each other. This is the case of the per-component vector addition. Therefore the vector code executes one add instruction per loop iteration with 3 cycle latency, and the scalar code executes 3 add instructions, taking only 5 cycles (1 started/cycle, and the 3rd has latency 3: 2 + 3 = 5).
A very good resource on CPU architectures and optimization is http://www.agner.org/optimize/
Related
I needed a way of initializing a scalar value given either a single float, or three floating point values (corresponding to RGB). So I just threw together a very simple struct:
struct Mono {
float value;
Mono(){
this->value = 0;
}
Mono(float value) {
this->value = value;
};
Mono(float red, float green, float blue){
this->value = (red+green+blue)/3;
};
};
// Multiplication operator overloads:
Mono operator*( Mono const& lhs, Mono const& rhs){
return Mono(lhs.value*rhs.value);
};
Mono operator*( float const& lhs, Mono const& rhs){
return Mono(lhs*rhs.value);
};
Mono operator*( Mono const& lhs, float const& rhs){
return Mono(lhs.value*rhs);
};
This worked as expected, but then I wanted to benchmark to see if this wrapper is going to impact performance at all so I wrote the following benchmark test where I simply multiplied a float by the struct 100,000,000 times, and multipled a float by a float 100,000,000 times:
#include <vector>
#include <chrono>
#include <iostream>
using namespace std::chrono;
int main() {
size_t N = 100000000;
std::vector<float> inputs(N);
std::vector<Mono> outputs_c(N);
std::vector<float> outputs_f(N);
Mono color(3.24);
float color_f = 3.24;
for (size_t i = 0; i < N; i++){
inputs[i] = i;
};
auto start_c = high_resolution_clock::now();
for (size_t i = 0; i < N; i++){
outputs_c[i] = color*inputs[i];
}
auto stop_c = high_resolution_clock::now();
auto duration_c = duration_cast<microseconds>(stop_c - start_c);
std::cout << "Mono*float duration: " << duration_c.count() << "\n";
auto start_f = high_resolution_clock::now();
for (size_t i = 0; i < N; i++){
outputs_f[i] = color_f*inputs[i];
}
auto stop_f = high_resolution_clock::now();
auto duration_f = duration_cast<microseconds>(stop_f - start_f);
std::cout << "float*float duration: " << duration_f.count() << "\n";
return 0;
}
When I compile it without any optimizations: g++ test.cpp, it prints the following times (in microseconds) very reliably:
Mono*float duration: 841122
float*float duration: 656197
So the Mono*float is clearly slower in that case. But then if I turn on optimizations (g++ test.cpp -O3), it prints the following times (in microseconds) very reliably:
Mono*float duration: 75494
float*float duration: 86176
I'm assuming that something is getting optimized weirdly here and it is NOT actually faster to wrap a float in a struct like this... but I'm struggling to see what is going wrong with my test.
On my system (i7-6700k with GCC 12.2.1), whichever loop I do second runs slower, and the asm for the two loops is identical.
Perhaps because cache is still partly primed from the inputs[i] = i; init loop when the first loop runs. (See https://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ re: Intel's adaptive replacement policy for L3 which might explain some but not all of the entries surviving that big init loop. 100000000 floats is 400 MB per array, and my CPU has 8 MiB of L3 cache.)
So as expected from the low computational intensity (one vector math instruction per 16 bytes loaded + stored), it's just a cache / memory bandwidth benchmark since you used one huge array instead of repeated passes over a smaller array. Nothing to do with whether you have a bare float or a struct {float; }
As expected, both loops compile to the same asm - https://godbolt.org/z/7eTh4ojYf - doing a movups load, mulps to multiply 4 floats, and a movups unaligned store. For some reason, GCC reloads the vector constant of 3.24 instead of hoisting it out of the loop, so it's doing 2 loads and 1 store per multiply. Cache misses on the big arrays should give plenty of time for out-of-order exec to do those extra loads from the same .rodata address that hit in L1d cache every time.
I tried How can I mitigate the impact of the Intel jcc erratum on gcc? but it didn't make a difference; still about the same performance delta with -Wa,-mbranches-within-32B-boundaries, so as expected it's not a front-end bottleneck; IPC is plenty low. Maybe some quirk of cache.
On my system (Linux 6.1.8 on i7-6700k at 3.9GHz, compiled with GCC 12.2.1 -O3 without -march=native or -ffast-math), your whole program spends nearly half its time in the kernel's page fault handler. (perf stat vs. perf stat --all-user cycle counts). So that's not great; if you're not trying to benchmark memory allocation and TLB misses.
But that's total time; you do touch the input and output arrays before the loop (std::vector<float> outputs_c(N); allocates and zeros space for N elements, same for your custom struct with a constructor.) There shouldn't be page faults inside your timed regions, only potentially TLB misses. And of course lots of cache misses.
BTW, clang correctly optimizes away all the loops, because none of the results are ever used. Benchmark::DoNotOptimize(outputs_c[argc]) might help with that. Or some manual use of asm with dummy memory inputs / outputs to force the compiler to materialize arrays in memory and forget their contents.
See also Idiomatic way of performance evaluation?
I am trying to modify a vector in place using multiple threads. It is a very simple operation— subtracting 1 from each index, but speed is highest priority here since both the vector size and number of times I need to do this operation can be quite large (10k elements, 500 increments). Right now I have loops of the sort:
#include<vector>
using namespace std;
int main() {
vector<int> my_vec(10000);
fill(my_vec.begin(), my_vec.end(), 10);
for (int i = my_vec.size(); i--; ;) {
my_vec[i] -= 1;
}
}
I am coming back to C/C++ after several years working primarily in R, where splitting this embarrassingly parallel loop across multiple threads is trivial (i.e., each of n threads operates over a portion of the indices, and then concatenate the results).
How can I do this best in C++ that a) avoids copying of the entire vector, and b) is not ultimately slower than the original loop?
As I expected, this is purely a memory I/O problem for the given size of your vector.
I took your initial example and built an AVX2-enabled version and it does not fare much better than a simple loop - it might be that the simple loop gets optimized with AVX too btw.
The reverse loop:
for (int i = my_vec.size(); i--;) {
my_vec[i] -= 1;
}
The forward loop:
for ( int i = 0; i<my_vec.size(); ++i ) {
my_vec[i] -= 1;
}
The AVX2 unaligned loop:
__m256i* ptr = (__m256i*)my_vec.data();
constexpr size_t per_block = sizeof(__m256i)/sizeof(int);
size_t num_blocks = my_vec.size() / per_block;
size_t remaining = my_vec.size() % per_block;
__m256i ones = _mm256_set1_epi32( 1 );
for ( size_t j=0; j<num_blocks; ++j, ++ptr ) {
__m256i val = _mm256_lddqu_si256( ptr );
val = _mm256_sub_epi32( val, ones );
_mm256_storeu_si256( ptr, val );
}
The AVX2 aligned loop:
__m256i* ptr = (__m256i*)my_vec.data();
constexpr size_t per_block = sizeof(__m256i)/sizeof(int);
size_t num_blocks = my_vec.size() / per_block;
size_t remaining = my_vec.size() % per_block;
__m256i ones = _mm256_set1_epi32( 1 );
for ( size_t j=0; j<num_blocks; ++j, ++ptr ) {
__m256i val = _mm256_load_si256( ptr );
val = _mm256_sub_epi32( val, ones );
_mm256_store_si256( ptr, val );
}
The tests run pretty much at the same elapsed time range
Test:reverse Elapsed:0.295494 ticks/int
Test:forward Elapsed:0.313866 ticks/int
Test:avx2forward Elapsed:0.367432 ticks/int
Test:avx2aligned Elapsed:0.298912 ticks/int
The entire test is here: https://godbolt.org/z/MWjrorncs
As far as why this is a memory-limited problem, your array fits entirely in any L1 cache existent today. As the L1 cache is local to a core, adding threads will only make it worse because it will increase the contention for memory between threads. You can see for example the main solution for high throughput fizzbuzz at Code Golf is single threaded for this exact reason.
I am trying to find the data race in my code but I just can't seem to grasp why it happens. The data in the threads is used read-only and the only variable that is written to is protected by a critical region.
I tried using the Intel Inspector but I am compiling with g++ 9.3.0 and apparently even the 2021 version can't deal with the OpenMP implementation for it. The release notes do not explicitly state it as exception as it was for older versions but there is a warning about false positives because it is not supported. It also always shows a data race for the pragma statements which isn't helpful at all.
My current suspects are either Eigen or the fact that I use a reference to a std::vector. Eigen itself I compile with EIGEN_DONT_PARALLELIZE to not mess with nested parallelism although I think I don't use anything that would use it anyway.
Edit:
Not sure if it is really a "data race" (or wrong memory access?) but the example produces non-deterministic output in the form of that the result differs for the same input. If this happens the loop in the main breaks. With more than one thread this happens early (after 5-12 iterations usually). If I run it with one thread only or compile without OpenMP, I have to manually end the example program.
Minimal (not) working example below.
#include <Eigen/Dense>
#include <vector>
#include <iostream>
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_set_num_threads(number)
#endif
typedef Eigen::Matrix<double, 9, 1> Vector9d;
typedef std::vector<Vector9d, Eigen::aligned_allocator<Vector9d>> Vector9dList;
Vector9d derivPath(const Vector9dList& pathPositions, int index){
int n = pathPositions.size()-1;
if(index >= 0 && index < n+1){
// path is one point, no derivative possible
if(n == 0){
return Vector9d::Zero();
}
else if(index == n){
return Vector9d::Zero();
}
// path is a line, derivative is in the direction of start to end
else {
return n * (pathPositions[index+1] - pathPositions[index]);
}
}
else{
return Vector9d::Zero();
}
}
// ********************************
// data race occurs here somewhere
double errorFunc(const Vector9dList& pathPositions){
int n = pathPositions.size()-1;
double err = 0.0;
#pragma omp parallel default(none) shared(pathPositions, err, n)
{
double err_private = 0;
#pragma omp for schedule(static)
for(int i = 0; i < n+1; ++i){
Vector9d derivX_i = derivPath(pathPositions, i);
// when I replace this with pathPositions[i][0] the loop in the main doesn't break
// (or at least I always had to manually end the program)
// but it does break if I use derivX_i[0];
double err_i = derivX_i.norm();
err_private = err_private + err_i;
}
#pragma omp critical
{
err += err_private;
}
}
err = err / static_cast<double>(n);
return err;
}
// ***************************************
int main(int argc, char **argv){
// setup data
int n = 100;
Vector9dList pathPositions;
pathPositions.reserve(n+1);
double a = 5.0;
double b = 1.0;
double c = 1.0;
Eigen::Vector3d f, u;
f << 0, 0, -1;//-p;
u << 0, 1, 0;
for(int i = 0; i<n+1; ++i){
double t = static_cast<double>(i)/static_cast<double>(n);
Eigen::Vector3d p;
double x = 2*t*a - a;
double z = -b/(a*a) * x*x + b + c;
p << x, 0, z;
Vector9d cam;
cam << p, f, u;
pathPositions.push_back(cam);
}
omp_set_num_threads(8);
//reference value
double pe = errorFunc(pathPositions);
int i = 0;
do{
double pe_i = errorFunc(pathPositions);
// there is a data race
if(std::abs(pe-pe_i) > std::numeric_limits<double>::epsilon()){
std::cout << "Difference detected at iteration " << i << " diff:" << std::abs(pe-pe_i);
break;
}
i++;
}
while(true);
}
Output for running the example multiple times
Difference detected at iteration 13 diff:1.77636e-15
Difference detected at iteration 1 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 7 diff:1.77636e-15
Difference detected at iteration 8 diff:1.77636e-15
Difference detected at iteration 6 diff:1.77636e-15
As you can see, the difference is minor but there and it doesn't always happen in the same iteration which makes it non-deterministic. There is no output if I run it single threaded as I usually end the program after letting it run for a couple of minutes. Therefore, it has to have to do with the parallelization somehow.
I know I could use a reduction in this case but in the original code in my project I have to compute other things in the parallel region as well and I wanted to keep the minimal example as close to the original structure as possible.
I use OpenMP in other parts of my program too where I am not sure if I have a data race there too but the structure is similar (except that I use #pragma omp parallel for and the collapse statement). I have some variable or vector I write to but it's always either in a critical region or each thread only writes to it's own subset of the vector. Data that is used by multiple threads is always read-only. The read-only data is always a std::vector, a reference to a std::vector or a numerical data type like int or double. The vectors always contain an Eigen type or double.
There are no race conditions. You are observing a natural consequence of the non-commutative algebra of truncated floating-point representations. (A + B) + C is not always the same as A + (B + C) when A, B, and C are finite-precision floating-point numbers due to rounding errors. 1.77636E-15 x 100 (the absolute error when commenting out err = err / static_cast<double>(n);) in binary is:
0 | 01010101 | 00000000000000000001100
S exponent mantissa
As you can see, the error is in the least significant bits of the mantissa, hinting at it being the result of accumulation of rounding errors.
The problem occurs here:
#pragma omp parallel default(none) shared(pathPositions, err, n)
{
...
#pragma omp critical
{
err += err_private;
}
}
The final value of err depends on the order in which the different threads arrive at the critical section and their contributions get added, which is why sometimes you see discrepancy right away and sometimes it takes a couple of iterations.
To demonstrate that it is not an OpenMP problem per se, simply modify the function to read:
double errorFunc(const Vector9dList& pathPositions){
int n = pathPositions.size()-1;
double err = 0.0;
std::vector<double> errs(n+1);
#pragma omp parallel default(none) shared(pathPositions, errs, n)
{
#pragma omp for schedule(static)
for(int i = 0; i < n+1; ++i){
Vector9d derivX_i = derivPath(pathPositions, i);
errs[i] = derivX_i.norm();
}
}
for (int i = 0; i < n+1; ++i)
err += errs[i];
err = err / static_cast<double>(n);
return err;
}
This removes the dependency on how the sub-sums are computed and added together and the return value will always be the same no matter the number of OpenMP threads.
Another version only fixes the order in which err_private are reduced into err:
double errorFunc(const Vector9dList& pathPositions){
int n = pathPositions.size()-1;
double err = 0.0;
std::vector<double> errs(omp_get_max_threads());
int nthreads;
#pragma omp parallel default(none) shared(pathPositions, errs, n, nthreads)
{
#pragma omp master
nthreads = omp_get_num_threads();
double err_private = 0;
#pragma omp for schedule(static)
for(int i = 0; i < n+1; ++i){
Vector9d derivX_i = derivPath(pathPositions, i);
double err_i = derivX_i.norm();
err_private = err_private + err_i;
}
errs[omp_get_thread_num()] = err_private;
}
for (int i = 0; i < nthreads; i++)
err += errs[i];
err = err / static_cast<double>(n);
return err;
}
Again, this code produces the same result each and every time as long as the number of threads is kept constant. The value may differ slightly (in the LSBs) with different number of threads.
You can't get easily around such discrepancy and only learn to live with it and take precautions to minimise its influence on the rest of the computation. In fact, you are really lucky to stumble upon it in 2021, a year in the post-x87 era, when virtually all commodity FPUs use 64-bit IEEE 754 operands and not in the 1990's when x87 FPUs used 80-bit operands and the result of a repeated accumulation would depend on whether you keep the value in an FPU register all the time or periodically store it in and then load it back from memory, which rounds the 80-bit representation to a 64-bit one.
In the mean time, mandatory reading for anyone dealing with math on digital computers.
P.S. Although it is 2021 and we've been living for 21 years in the post-x87 era (started when Pentium 4 introduced the SSE2 instruction set back in 2000), if your CPU is an x86 one, you can still partake in the x87 madness. Just compile your code with -mfpmath=387 :)
I am trying to understand how the auto-parallelization works to speed up the execution of a program I am writing. I have created a simpler example:
#include <iostream>
#include <vector>
#include <chrono>
using namespace std;
using namespace std::chrono;
class matrix
{
public:
matrix(int size, double value)
{
A.resize(size, vector<double>(size, value));
B.resize(size, vector<double>(size, value));
};
void prodScal(double valore)
{
for (int m = 0; m < A.size(); m++)
for (int n = 0; n < A.size(); n++)
{
B[m][n] = A[m][n] * valore;
};
};
double elemento(int riga, int column) { return B[riga][column]; }
protected:
vector<vector<double>> A, B;
};
void main()
{
matrix* M;
M = new matrix(1000, 174.9);
high_resolution_clock::time_point t1 = high_resolution_clock::now();
#pragma loop(hint_parallel(4))
for (int i = 0; i < 1000; i++)
M->prodScal(567.3);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "execution time [ms]: " << duration << endl;
}
When I try to compile this code using cl main.cpp /O2 /Qpar /Qpar-report:2, I get the following message:
c:\users\utente\documents\visual studio 2017\projects\parallel\parallel\main.cpp(39) : info C5012: ciclo non parallelizzato a causa del motivo '500'
c:\users\utente\documents\visual studio 2017\projects\parallel\parallel\main.cpp(39) : info C5012: ciclo non parallelizzato a causa del motivo '500'
c:\users\utente\documents\visual studio 2017\projects\parallel\parallel\main.cpp(38) : info C5012: ciclo non parallelizzato a causa del motivo '1000'
Can you help me with the correct way to parallelize this loop?
Thanks.
Auto-parallelisation efforts ( or Beliefs? ) are a rather dual-edge sword :
A machine can "guess" an intent only to a certain degree ( and can give-up, whenever such intent was not clear to a pre-wired set of transformation strategies ), so rather do not expect any bright tricks on a large scale of different approaches possible. Marketing people will beat all their drums and blow all their whistles to sell auto-"thinking"-PRODUCTs, but the reality is different. Even the best of the bests admit, that best performance comes from instruction-level profiling and sometimes they even avoid superscalar pipelined processor-knit tricks, so as to gain the last few nanoseconds, lost in parallelised-code performance at the very last level of CPU uop instruction flow. So, better never expect such expertise to happen just by using a #pragma code-section in a belief, the "machine"-will-invent a smartest way ahead.
So, test it ( always and thoroughly ):
An attempt to "parallelise" an outermost for(){...} is not the best step to start with. Both performance-wise and resources-wise. Let's tackle the case from a different side, the calculation itself:
#include <iostream> // https://stackoverflow.com/questions/48033769/auto-parallelization-with-vs
#include <vector>
#include <chrono> // g++ FLAGS.ADD: -std=c++11
#include <omp.h> // g++ FLAGS.ADD: -fopenmp -lm
#define OMP_NUM_OF_THREADS 4
using namespace std;
using namespace std::chrono;
class matrix {
public:
matrix( int size, double value ) {
A.resize( size, vector<double>( size, value ) );
B.resize( size, vector<double>( size, value ) );
}
void prodScal( double aScalarVALORE ) {
// #pragma loop( hint_parallel(4) ) // matrix_(hint_parallel(4)).cpp:18:0: warning: ignoring #pragma loop [-Wunknown-pragmas]
#pragma omp parallel num_threads( OMP_NUM_OF_THREADS ) // _____ YET, AGNOSTIC TO ANY BETTER CACHE-LINE RE-USE POLICY
for ( unsigned int m = 0; m < A.size(); m++ )
for ( unsigned int n = 0; n < A.size(); n++ )
B[m][n] = A[m][n] * aScalarVALORE;
}
double elemento( int riga, int column ) { return B[riga][column]; }
protected:
vector<vector<double>> A, B;
};
int main() { // matrix_(hint_parallel(4)).cpp:31:11: error: ‘::main’ must return ‘int’
matrix* M;
M = new matrix( 1000, 174.9 );
high_resolution_clock::time_point t1 = high_resolution_clock::now();
// *******************
// DEFINITELY NOT HERE
// *******************
// #pragma loop(hint_parallel(4)) // JUST A TEST EXECUTION, NOT ANY PARALLELISATION BENEFIT FOR A PROCESS-PER-SE PERFORMANCE
for ( int i = 0; i < 1000; i++ )
M->prodScal( 567.3 );
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>( t2 - t1 ).count();
cout << "execution time [ms]: " << duration << endl;
/*
* execution time [ms]: 21601
------------------
(program exited with code: 0)
* */
return 0;
}
Once having a working code, the performance tweaking, to gain max, is the next hurdle.
A better stepping through a for(){...} can dramatically improve the sum of costs of all the MEM-fetches ( paying ~ +100 [ns] for each non-cached reference ) v/s CACHE-re-use ( paying just ~ +1.5 [ns] for any cache-re-use ).
It depends on the global size of the matrices, on the L3, L2 and L1 cache-sizes and on the cache-lines lengths / associativity, not to mention an additional performance skew(s) if the code is to be run on a virtual device.
The static sizing and an approximate NUMA-topology could be depicted using lstopo ( lscpu in an absence of the smart hwloc service ).
Here, you can read the cache capacities, that can hold the matrix cells for any potential speedup from a smart re-use ( obeying the cache-line striding of the for(){...} indexes ).
Best performance could be gained from tuning the for()-loop stepping, best near the CPU-hardware available ILP-level ( using another degree of parallelism possible from CPU Instruction-Level-Parallelism, permitted for co-executed micro-instruction chains ( ref. Intel CPU publications on these details ) and best tested on the target platform ( cross-compilation would fail to allow such optimisation without performance benchmarks on the target CPU-architecture, best on the target platform in-vivo ).
Details are way beyond the limited scope of this media format here, on StackOverflow, but definitely if interested in performance tuning, you will find both sources and your own experimentation hands-on experience will govern your further steps. To just somehow sense the powers, 've made a large-matrix linear algebra project to finish a few [TB] matrix processing from ~ 126 hours down to a few minutes ( not counting the loading phase, to get the matrix data into the RAM ), right by the very careful parallel code-design, so indeed worth doing the design "right".
For even higher performance, one will have to also avoid O/S from evicting the expensively pre-fetched data, so even more efforts are needed for ultimate performance, than just to rely on an automated "auto-parallelisation"-toy.
Epilogue:
If still in doubts, if that were indeed possible, why would HPC centers still take care for and nourish the HPC-experts for designing ultimately performant code, if the "auto-parallelisation"-toys would do it better or at least the same as these expert nerdy geeks?? They would not, if they indeed could.
UPDATED - Check Below
Will keep this as short as possible. Happy to add any more details if required.
I have some sse code for normalising a vector. I'm using QueryPerformanceCounter() (wrapped in a helper struct) to measure performance.
If I measure like this
for( int j = 0; j < NUM_VECTORS; ++j )
{
Timer t(norm_sse);
NormaliseSSE( vectors_sse+j);
}
The results I get are often slower than just doing a standard normalise with 4 doubles representing a vector (testing in the same configuration).
for( int j = 0; j < NUM_VECTORS; ++j )
{
Timer t(norm_dbl);
NormaliseDBL( vectors_dbl+j);
}
However, timing just the entirety of the loop like this
{
Timer t(norm_sse);
for( int j = 0; j < NUM_VECTORS; ++j ){
NormaliseSSE( vectors_sse+j );
}
}
shows the SSE code to be an order of magnitude faster, but doesn't really affect the measurements for the double version.
I've done a fair bit of experimentation and searching, and can't seem to find a reasonable answer as to why.
For example, I know there can be penalities when casting the results to float, but none of that is going on here.
Can anyone offer any insight? What is it about calling QueryPerformanceCounter between each normalise that slows the SIMD code down so much?
Thanks for reading :)
More details below:
Both normalise methods are inlined (verified in disassembly)
Running in release
32 bit compilation
Simple Vector struct
_declspec(align(16)) struct FVECTOR{
typedef float REAL;
union{
struct { REAL x, y, z, w; };
__m128 Vec;
};
};
Code to Normalise SSE:
__m128 Vec = _v->Vec;
__m128 sqr = _mm_mul_ps( Vec, Vec ); // Vec * Vec
__m128 yxwz = _mm_shuffle_ps( sqr, sqr , 0x4e );
__m128 addOne = _mm_add_ps( sqr, yxwz );
__m128 swapPairs = _mm_shuffle_ps( addOne, addOne , 0x11 );
__m128 addTwo = _mm_add_ps( addOne, swapPairs );
__m128 invSqrOne = _mm_rsqrt_ps( addTwo );
_v->Vec = _mm_mul_ps( invSqrOne, Vec );
Code to normalise doubles
double len_recip = 1./sqrt(v->x*v->x + v->y*v->y + v->z*v->z);
v->x *= len_recip;
v->y *= len_recip;
v->z *= len_recip;
Helper struct
struct Timer{
Timer( LARGE_INTEGER & a_Storage ): Storage( a_Storage ){
QueryPerformanceCounter( &PStart );
}
~Timer(){
LARGE_INTEGER PEnd;
QueryPerformanceCounter( &PEnd );
Storage.QuadPart += ( PEnd.QuadPart - PStart.QuadPart );
}
LARGE_INTEGER& Storage;
LARGE_INTEGER PStart;
};
Update
So thanks to Johns comments, I think I've managed to confirm that it is QueryPerformanceCounter thats doing bad things to my simd code.
I added a new timer struct that uses RDTSC directly, and it seems to give results consistent to what I would expect. The result is still far slower than timing the entire loop, rather than each iteration separately, but I expect that that's because Getting the RDTSC involves flushing the instruction pipeline (Check http://www.strchr.com/performance_measurements_with_rdtsc for more info).
struct PreciseTimer{
PreciseTimer( LARGE_INTEGER& a_Storage ) : Storage(a_Storage){
StartVal.QuadPart = GetRDTSC();
}
~PreciseTimer(){
Storage.QuadPart += ( GetRDTSC() - StartVal.QuadPart );
}
unsigned __int64 inline GetRDTSC() {
unsigned int lo, hi;
__asm {
; Flush the pipeline
xor eax, eax
CPUID
; Get RDTSC counter in edx:eax
RDTSC
mov DWORD PTR [hi], edx
mov DWORD PTR [lo], eax
}
return (unsigned __int64)(hi << 32 | lo);
}
LARGE_INTEGER StartVal;
LARGE_INTEGER& Storage;
};
When it's only the SSE code running the loop, the processor should be able to keep its pipelines full and executing a huge number of SIMD instructions per unit time. When you add the timer code within the loop, now there's a whole bunch of non-SIMD instructions, possibly less predictable, between each of the easy-to-optimize operations. It's likely that the QueryPerformanceCounter call is either expensive enough to make the data manipulation part insignificant, or the nature of the code it executes wreaks havoc with the processor's ability to keep executing instructions at the maximum rate (possibly due to cache evictions or branches that are not well-predicted).
You might try commenting out the actual calls to QPC in your Timer class and see how it performs--this may help you discover if it's the construction and destruction of the Timer objects that is the problem, or the QPC calls. Likewise, try just calling QPC directly in the loop instead of making a Timer and see how that compares.
QPC is a kernel function, and calling it causes a context switch, which is inherently far more expensive and destructive than any equivalent user-mode function call, and will definitely annihilate the processor's ability to process at it's normal speed. In addition to that, remember that QPC/QPF are abstractions and require their own processing- which likely involves the use of SSE itself.