Efficiency of STL algorithms with fixed size arrays - c++

In general, I assume that the STL implementation of any algorithm is at least as efficient as anything I can come up with (with the additional benefit of being error free). However, I came to wonder whether the STL's focus on iterators might be harmful in some situations.
Lets assume I want to calculate the inner product of two fixed size arrays. My naive implementation would look like this:
std::array<double, 100000> v1;
std::array<double, 100000> v2;
//fill with arbitrary numbers
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
sum += v1[i] * v2[i];
}
As the number of iterations and the memory layout are known during compile time and all operations can directly be mapped to native processor instructions, the compiler should easily be able to generate the "optimal" machine code from this (loop unrolling, vectorization / FMA instructions ...).
The STL version
double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);
on the other hand adds some additional indirections and even if everything is inlined, the compiler still has to deduce that it is working on a continuous memory region and where this region lies. While this is certainly possible in principle, I wonder, whether the typical c++ compiler will actually do it.
So my question is: Do you think, there can be a performance benefit of implementing standard algorithms that operate on fixed size arrays on my own, or will the STL Version always outperform a manual implementation?

As suggested I did some measurements and
for the code below
compiled with VS2013 for x64 in release mode
executed on a Win8.1 Machine with an i7-2640M,
the algorithm version is consistently slower by about 20% (15.6-15.7s vs 12.9-13.1s). The relative difference, also stays roughly constant over two orders of magnitude for N and REPS.
So I guess the answer is: Using standard library algorithms CAN hurt performance.
It would still be interesting, if this is a general problem or if it is specific to my platform, compiler and benchmark. You are welcome to post your own resutls or comment on the benchmark.
#include <iostream>
#include <numeric>
#include <array>
#include <chrono>
#include <cstdlib>
#define USE_STD_ALGORITHM
using namespace std;
using namespace std::chrono;
static const size_t N = 10000000; //size of the arrays
static const size_t REPS = 1000; //number of repitions
array<double, N> a1;
array<double, N> a2;
int main(){
srand(10);
for (size_t i = 0; i < N; ++i) {
a1[i] = static_cast<double>(rand())*0.01;
a2[i] = static_cast<double>(rand())*0.01;
}
double res = 0.0;
auto start=high_resolution_clock::now();
for (size_t z = 0; z < REPS; z++) {
#ifdef USE_STD_ALGORITHM
res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res);
#else
for (size_t t = 0; t < N; ++t) {
res+= a1[t] * a2[t];
}
#endif
}
auto end = high_resolution_clock::now();
std::cout << res << " "; // <-- necessary, so that loop isn't optimized away
std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl;
}
/*
* Update: Results (ubuntu 14.04 , haswell)
* STL: algorithm
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3551 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3567 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9378 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8505 ms
*
* loop:
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3543 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3551 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9613 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8642 ms
*/
EDIT:
I did a quick check with g++-4.9.2 and clang++-3.5 with O3and std=c++11 on a fedora 21 Virtual Box VM on the same machine and apparently those compilers don't have the same problem (the time is almost the same for both versions). However, gcc's version is about twice as fast as clang's (7.5s vs 14s).

Related

How to improve the speed of merkle root calculation in C++?

I am trying to optimise the merkle root calculation as much as possible. So far, I implemented it in Python which resulted in this question and the suggestion to rewrite it in C++.
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>
std::vector<unsigned char> double_sha256(std::vector<unsigned char> a, std::vector<unsigned char> b)
{
unsigned char inp[64];
int j=0;
for (int i=0; i<32; i++)
{
inp[j] = a[i];
j++;
}
for (int i=0; i<32; i++)
{
inp[j] = b[i];
j++;
}
const EVP_MD *md_algo = EVP_sha256();
unsigned int md_len = EVP_MD_size(md_algo);
std::vector<unsigned char> out( md_len );
EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
return out;
}
std::vector<std::vector<unsigned char> > calculate_merkle_root(std::vector<std::vector<unsigned char> > inp_list)
{
std::vector<std::vector<unsigned char> > out;
int len = inp_list.size();
if (len == 1)
{
out.push_back(inp_list[0]);
return out;
}
for (int i=0; i<len-1; i+=2)
{
out.push_back(
double_sha256(inp_list[i], inp_list[i+1])
);
}
if (len % 2 == 1)
{
out.push_back(
double_sha256(inp_list[len-1], inp_list[len-1])
);
}
return calculate_merkle_root(out);
}
int main()
{
std::ifstream infile("txids.txt");
std::vector<std::vector<unsigned char> > txids;
std::string line;
int count = 0;
while (std::getline(infile, line))
{
unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
std::vector<unsigned char> buf2;
for (int i=31; i>=0; i--)
{
buf2.push_back(
buf[i]
);
}
txids.push_back(
buf2
);
count++;
}
infile.close();
std::cout << count << std::endl;
std::vector<std::vector<unsigned char> > merkle_root_hash;
for (int k=0; k<1000; k++)
{
merkle_root_hash = calculate_merkle_root(txids);
}
std::vector<unsigned char> out0 = merkle_root_hash[0];
std::vector<unsigned char> out;
for (int i=31; i>=0; i--)
{
out.push_back(
out0[i]
);
}
static const char alpha[] = "0123456789abcdef";
for (int i=0; i<32; i++)
{
unsigned char c = out[i];
std::cout << alpha[ (c >> 4) & 0xF];
std::cout << alpha[ c & 0xF];
}
std::cout.put('\n');
return 0;
}
However, the performance is worse compared to the Python implementation (~4s):
$ g++ test.cpp -L/usr/local/opt/openssl/lib -I/usr/local/opt/openssl/include -lcrypto
$ time ./a.out
1452
289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e
real 0m9.245s
user 0m9.235s
sys 0m0.008s
The complete implementation and the input file are available here: test.cpp and txids.txt.
How can I improve the performance? Are the compiler optimizations enabled by default? Are there faster sha256 libraries than openssl available?
There are plenty of things you can do to optimize the code.
Here is the list of the important points:
compiler optimizations need to be enabled (using -O3 in GCC);
std::array can be used instead of the slower dynamically-sized std::vector (since the size of a hash is 32), one can even define a new Hash type for clarity;
parameters should be passed by reference (C++ pass parameter by copy by default)
the C++ vectors can be reserved to pre-allocate the memory space and avoid unneeded copies;
OPENSSL_free must be called to release the allocated memory of OPENSSL_hexstr2buf;
push_back should be avoided when the size is a constant known at compile time;
using std::copy is often faster (and cleaner) than a manual copy;
std::reverse is often faster (and cleaner) than a manual loop;
the size of a hash is supposed to be 32, but one can check that using assertions to be sure it is fine;
count is not needed as it is the size of the txids vector;
Here is the resulting code:
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <cstring>
#include <array>
#include <algorithm>
#include <cassert>
#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>
using Hash = std::array<unsigned char, 32>;
Hash double_sha256(const Hash& a, const Hash& b)
{
assert(a.size() == 32 && b.size() == 32);
unsigned char inp[64];
std::copy(a.begin(), a.end(), inp);
std::copy(b.begin(), b.end(), inp+32);
const EVP_MD *md_algo = EVP_sha256();
assert(EVP_MD_size(md_algo) == 32);
unsigned int md_len = 32;
Hash out;
EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
return out;
}
std::vector<Hash> calculate_merkle_root(const std::vector<Hash>& inp_list)
{
std::vector<Hash> out;
int len = inp_list.size();
out.reserve(len/2+2);
if (len == 1)
{
out.push_back(inp_list[0]);
return out;
}
for (int i=0; i<len-1; i+=2)
{
out.push_back(double_sha256(inp_list[i], inp_list[i+1]));
}
if (len % 2 == 1)
{
out.push_back(double_sha256(inp_list[len-1], inp_list[len-1]));
}
return calculate_merkle_root(out);
}
int main()
{
std::ifstream infile("txids.txt");
std::vector<Hash> txids;
std::string line;
while (std::getline(infile, line))
{
unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
Hash buf2;
std::copy(buf, buf+32, buf2.begin());
std::reverse(buf2.begin(), buf2.end());
txids.push_back(buf2);
OPENSSL_free(buf);
}
infile.close();
std::cout << txids.size() << std::endl;
std::vector<Hash> merkle_root_hash;
for (int k=0; k<1000; k++)
{
merkle_root_hash = calculate_merkle_root(txids);
}
Hash out0 = merkle_root_hash[0];
Hash out = out0;
std::reverse(out.begin(), out.end());
static const char alpha[] = "0123456789abcdef";
for (int i=0; i<32; i++)
{
unsigned char c = out[i];
std::cout << alpha[ (c >> 4) & 0xF];
std::cout << alpha[ c & 0xF];
}
std::cout.put('\n');
return 0;
}
On my machine, this code is 3 times faster than the initial version and 2 times faster than the Python implementation.
This implementation spends >98% of its time in EVP_Digest. As a result, if you want a faster code, you could try to find a faster hashing library although OpenSSL should be already pretty fast. The current code already succeed to compute 1.7 millions hashes per second in sequential on a mainstream CPU. This is quite good. Alternatively, you can also parallelize the program using OpenMP (this is roughly 5 times faster on my 6 core machine).
I decided to implement Merkle Root and SHA-256 computation from scratch, full SHA-256 implemented, using SIMD (Single Instruction Multiple Data) approach, known for SSE2, AVX2, AVX512.
My code below for AVX2 case has speed 3.5x times faster than OpenSSL version, and 7.3x times faster than Python's hashlib implementation.
Here I provide C++ implementation, also I made Python implementation with same speed (because it uses C++ code in the core), for Python implementation see related post. Python implementation is definitely easier to use than C++.
My code is quite complex, both because it has full SHA-256 implementation and also because it has a class for abstracting any SIMD operations, also many tests.
First I provide timings, made on Google Colab because they have quite advanced AVX2 processor there:
MerkleRoot-Ossl 1274 ms
MerkleRoot-Simd-GEN-1 1613 ms
MerkleRoot-Simd-GEN-2 1795 ms
MerkleRoot-Simd-GEN-4 788 ms
MerkleRoot-Simd-GEN-8 423 ms
MerkleRoot-Simd-SSE2-1 647 ms
MerkleRoot-Simd-SSE2-2 626 ms
MerkleRoot-Simd-SSE2-4 690 ms
MerkleRoot-Simd-AVX2-1 407 ms
MerkleRoot-Simd-AVX2-2 403 ms
MerkleRoot-Simd-AVX2-4 489 ms
Ossl is for testing OpenSSL implementation, the rest is mine implementation. AVX512 has even more improvement in speed, here it is not tested because Colab has no AVX512 support. Actual improvement in speed depends on processor capabilities.
Compilation is tested both in Windows (MSVC) and Linux (CLang), using following commands:
Windows with OpenSSL support cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1 -DSHS_HAS_OPENSSL=1 /MD -Id:/bin/OpenSSL/include/ /link /LIBPATH:d:/bin/OpenSSL/lib/ libcrypto_static.lib libssl_static.lib Advapi32.lib User32.lib Ws2_32.lib, provide your directory with installed OpenSSL. If OpenSSL support is not needed use cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1. Here also instead of of AVX2 you may use SSE2 or AVX512. Windows openssl can be downloaded from here.
Linux CLang compilation is done through clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe -DSHS_HAS_OPENSSL=1 -lssl -lcrypto if OpenSSL is needed and if not needed then clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe. As you can see most recent clang-12 is used, to install it do bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" (this command is described here). Linux version automatically detects current CPU architecture and uses best SIMD set of instructions.
My code needs C++20 standard support, as it uses some advanced features for easier implementing all things.
I implemented OpenSSL support in my library only to compare timings to show that my AVX2 version is 3-3.5x times faster.
Also providing timings done on GodBolt, but those are only for example of AVX-512 usage, as GodBolt CPUs have advanced AVX-512. Don't use GodBolt to actually measure timings, because all the timings there jump up to 5x times up and down, seems because of active processes eviction by operating system. Also providing GodBolt link for playground (this link may have a bit outdated code, use newest link to code at the bottom of my post):
MerkleRoot-Ossl 2305 ms
MerkleRoot-Simd-GEN-1 2982 ms
MerkleRoot-Simd-GEN-2 3078 ms
MerkleRoot-Simd-GEN-4 1157 ms
MerkleRoot-Simd-GEN-8 781 ms
MerkleRoot-Simd-GEN-16 349 ms
MerkleRoot-Simd-SSE2-1 387 ms
MerkleRoot-Simd-SSE2-2 769 ms
MerkleRoot-Simd-SSE2-4 940 ms
MerkleRoot-Simd-AVX2-1 251 ms
MerkleRoot-Simd-AVX2-2 253 ms
MerkleRoot-Simd-AVX2-4 777 ms
MerkleRoot-Simd-AVX512-1 257 ms
MerkleRoot-Simd-AVX512-2 741 ms
MerkleRoot-Simd-AVX512-4 961 ms
Examples of my code usage can be seen inside Test() function that tests all functionality of my library. My code is a bit dirty because I didn't want to spend very much time creating beautiful library, rather just to make a Proof of Concept that SIMD-based implementation can be considerably faster than OpenSSL version.
If you really want to use my boosted SIMD-based version instead of OpenSSL and if you care for speed very much, and you have questions about how to use it, please ask me in comments or chat.
Also I didn't bother about implementing multi-core/multi-threaded version, I think it is obvious how to do that and you can and should implement it without difficulties.
Providing external link to the code below, because my code is around 51 KB in size which exceeds allowed 30 KB of text for StackOverflow post.
sha256_simd.cpp

How to measure the execution time of C math.h library functions?

By using time.h header, I'm getting execution time of sqrt() as 2 nanoseconds (with the gcc command in a Linux terminal) and 44 nanoseconds (with the g++ command in Ubuntu terminal). Can anyone tell me any other method to measure the execution time of the math.h library functions?
Below is the code:
#include <time.h>
#include <stdio.h>
#include<math.h>
int main()
{
time_t begin,end; // time_t is a datatype to store time values.
time (&begin); // note time before execution
for(int i=0;i<1000000000;i++) //using for loop till 10^9 times to make the execution time in nanoseconds
{
cbrt(9999999); // calling the cube root function from math library
}
time (&end); // note time after execution
double difference = difftime (end,begin);
printf ("time taken for function() %.2lf in Nanoseconds.\n", difference );
printf(" cube root is :%f \t",cbrt(9999999));
return 0;
}
OUTPUT:
by using **gcc**: time taken for function() 2.00 seconds.
cube root is :215.443462
by using **g++**: time taken for function() 44.00 in Nanoseconds.
cube root is:215.443462
Linux terminal result
Give or take the length of the prompt:
$ g++ t1.c
$ ./a.out
time taken for function() 44.00 in Nanoseconds.
cube root is :215.443462
$ gcc t1.c
$ ./a.out
time taken for function() 2.00 in Nanoseconds.
cube root is :215.443462
$
how to measure the execution time of c math.h library functions?
C compilers are often allowed to analyze well known standard library functions and replace such fix code like cbrt(9999999); with 215.443462.... Further, since dropping the function in the loop does not affect the function of the code, that loop may be optimized out.
Use of volatile prevents much of this as the compiler cannot assume no impact when the function is replaced, removed.
for(int i=0;i<1000000000;i++) {
// cbrt(9999999);
volatile double x = 9999999.0;
volatile double y = cbrt(x);
}
The granularity of time() is often only 1 second and if the billion loops only results in a few seconds, consider more loops.
Code could use below to factor out the loop overhead.
time_t begin,middle,end;
time (&begin);
for(int i=0;i<1000000000;i++) {
volatile double x = 9999999.0;
volatile double y = x;
}
time (&middle);
for(int i=0;i<1000000000;i++) {
volatile double x = 9999999.0;
volatile double y = cbrt(x);
}
time (&end);
double difference = difftime(end,middle) - difftime(middle,begin);
Timing code is an art, and one part of the art is making sure that the compiler doesn't optimize your code away. For standard library functions, the compiler may well be aware of what it is/does and be able to evaluate a constant at compile time. In your example, the call cbrt(9999999); gives two opportunities for optimization. The value from cbrt() can be evaluated at compile-time because the argument is a constant. Secondly, the return value is not used, and the standard function has no side-effects, so the compiler can drop it altogether. You can avoid those problems by capturing the result (for example, by evaluating the sum of the cube roots from 0 to one billion (minus one) and printing that value after the timing code.
tm97.c
When I compiled your code, shorn of comments, I got:
$ cat tm97.c
#include <time.h>
#include <stdio.h>
#include <math.h>
int main(void)
{
time_t begin, end;
time(&begin);
for (int i = 0; i < 1000000000; i++)
{
cbrt(9999999);
}
time(&end);
double difference = difftime(end, begin);
printf("time taken for function() %.2lf in Nanoseconds.\n", difference );
printf(" cube root is :%f \t", cbrt(9999999));
return 0;
}
$ make tm97
gcc -O3 -g -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm97.c -o tm97 -L../lib -lsoq
tm97.c: In function ‘main’:
tm97.c:11:9: error: statement with no effect [-Werror=unused-value]
11 | cbrt(9999999);
| ^~~~
cc1: all warnings being treated as errors
rmk: error code 1
$
I'm using GCC 9.3.0 on a 2017 MacBook Pro running macOS Mojave 10.14.6 with XCode 11.3.1 (11C504) and GCC 9.3.0 — XCode 11.4 requires Catalina 10.15.2, but work hasn't got around organizing support for that, yet. Interestingly, when the same code is compiled by g++, it compiles without warnings (errors):
$ ln -s tm97.c tm89.cpp
make tm89 SXXFLAGS=-std=c++17 CXX=g++
g++ -O3 -g -I../inc -std=c++17 -Wall -Wextra -Werror -L../lib tm89.cpp -lsoq -o tm89
$
I routinely use some timing code that is available in my SOQ (Stack Overflow Questions) repository on GitHub as files timer.c and timer.h in the src/libsoq sub-directory. The code is only compiled as C code in my library, so I created a simple wrapper header, timer2.h, so that the programs below could use #include "timer2.h" and it would work OK with both C and C++ compilations:
#ifndef TIMER2_H_INCLUDED
#define TIMER2_H_INCLUDED
#ifdef __cplusplus
extern "C" {
#endif
#include "timer.h"
#ifdef __cplusplus
}
#endif
#endif /* TIMER2_H_INCLUDED */
tm29.cpp and tm31.c
This code uses the sqrt() function for testing. It accumulates the sum of the square roots. It uses the timing code from timer.h/timer.c around your timing code — type Clock and functions clk_init(), clk_start(), clk_stop(), and clk_elapsed_us() to evaluate the elapsed time in microseconds between when the clock was started and last stopped.
The source code can be compiled by either a C compiler or a C++ compiler.
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
double sum = 0.0;
int i;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (i = 0; i < 1000000000; i++)
{
sum += sqrt(i);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for sqrt() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Sum of square roots from 0 to %d is: %f\n", i, sum);
return 0;
}
tm41.c and tm43.cpp
This code is almost identical to the previous code, but the tested function is the cbrt() (cube root) function.
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
double sum = 0.0;
int i;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (i = 0; i < 1000000000; i++)
{
sum += cbrt(i);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for cbrt() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Sum of cube roots from 0 to %d is: %f\n", i, sum);
return 0;
}
tm59.c and tm61.c
This code uses fabs() instead of either sqrt() or cbrt(). It's still a function call, but it might be inlined. It invokes the conversion from int to double explicitly; without that cast, GCC complains that it should be using the integer abs() function instead.
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
double sum = 0.0;
int i;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (i = 0; i < 1000000000; i++)
{
sum += fabs((double)i);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for fabs() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Sum of absolute values from 0 to %d is: %f\n", i, sum);
return 0;
}
tm73.cpp
This file uses the original code with my timing wrapper code too. The C version doesn't compile — the C++ version does:
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (int i = 0; i < 1000000000; i++)
{
cbrt(9999999);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for cbrt() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Cube root is: %f\n", cbrt(9999999));
return 0;
}
Timing
Using a command timecmd which reports start and stop time, and PID, of programs as well as the timing code built into the various commands (it's a variant on the theme of the time command), I got the following results. (rmk is just an alternative implementation of make.)
$ for prog in tm29 tm31 tm41 tm43 tm59 tm61 tm73
> do rmk $prog && timecmd -ur -- $prog
> done
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm29.cpp -o tm29 -L../lib -lsoq
2020-03-28 08:47:50.040227 [PID 19076] tm29
Time taken for sqrt() is 1.00 nanoseconds (1.700296 ns).
Sum of square roots from 0 to 1000000000 is: 21081851051977.781250
2020-03-28 08:47:51.747494 [PID 19076; status 0x0000] - 1.707267s - tm29
gcc -O3 -g -I../inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm31.c -o tm31 -L../lib -lsoq
2020-03-28 08:47:52.056021 [PID 19088] tm31
Time taken for sqrt() is 1.00 nanoseconds (1.679867 ns).
Sum of square roots from 0 to 1000000000 is: 21081851051977.781250
2020-03-28 08:47:53.742383 [PID 19088; status 0x0000] - 1.686362s - tm31
gcc -O3 -g -I../inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm41.c -o tm41 -L../lib -lsoq
2020-03-28 08:47:53.908285 [PID 19099] tm41
Time taken for cbrt() is 7.00 nanoseconds (6.697999 ns).
Sum of cube roots from 0 to 1000000000 is: 749999999499.628418
2020-03-28 08:48:00.613357 [PID 19099; status 0x0000] - 6.705072s - tm41
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm43.cpp -o tm43 -L../lib -lsoq
2020-03-28 08:48:00.817975 [PID 19110] tm43
Time taken for cbrt() is 7.00 nanoseconds (6.614539 ns).
Sum of cube roots from 0 to 1000000000 is: 749999999499.628418
2020-03-28 08:48:07.438298 [PID 19110; status 0x0000] - 6.620323s - tm43
gcc -O3 -g -I../inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm59.c -o tm59 -L../lib -lsoq
2020-03-28 08:48:07.598344 [PID 19121] tm59
Time taken for fabs() is 1.00 nanoseconds (1.114822 ns).
Sum of absolute values from 0 to 1000000000 is: 499999999067108992.000000
2020-03-28 08:48:08.718672 [PID 19121; status 0x0000] - 1.120328s - tm59
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm61.cpp -o tm61 -L../lib -lsoq
2020-03-28 08:48:08.918745 [PID 19132] tm61
Time taken for fabs() is 2.00 nanoseconds (1.117780 ns).
Sum of absolute values from 0 to 1000000000 is: 499999999067108992.000000
2020-03-28 08:48:10.042134 [PID 19132; status 0x0000] - 1.123389s - tm61
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm73.cpp -o tm73 -L../lib -lsoq
2020-03-28 08:48:10.236899 [PID 19143] tm73
Time taken for cbrt() is 0.00 nanoseconds (0.000004 ns).
Cube root is: 215.443462
2020-03-28 08:48:10.242322 [PID 19143; status 0x0000] - 0.005423s - tm73
$
I've run the programs many times; the times above are representative of what I got each time. There are a number of conclusions that can be drawn:
sqrt() (1.7 ns) is quicker than cbrt() (6.7 ns).
fabs() (1.1 ns) is quicker than sqrt() (1.7 ns).
However, fabs() gives a moderate approximation to the time taken with loop overhead and conversion from int to double.
When the result of cbrt() is not used, the compiler eliminates the loop.
When compiled with the C++ compiler, the code with from the question removes the loop altogether, leaving only the calls to time() to be measured. The result printed by clk_elapsed_us() is the time taken to execute the code between clk_start() and clk_stop() in seconds with microsecond resolution — 0.000004 is 4 microseconds elapsed time. The value is marked in ns because when the loop executes one billion times, the elapsed time in seconds also represents the time in nanoseconds for one loop — there are a billion nanoseconds in a second.
The times reported by timecmd are consistent with the times reported by the programs. There is the overhead of starting the process (fork() and exec()) and the I/O in the process that is included in the times reported by timecmd.
Although not shown, the timings with clang and clang++ (instead of GCC 9.3.0) are very comparable, though the cbrt() code takes about 7.5 ns per iteration instead of 6.7 ns. The timing differences for the others are basically noise.
The number suffixes are all 2-digit primes. They have no other significance except to keep the different programs separate.
As #Jonathan Leffler commented, compiler can optimize your C / c++ code. If the C code just loops from 0 to 1000 w/o doing anything with the counter i (I mean, w/o printing it or using the intermediate values in any other operation, indexes, etc), compiler may not even create the assembly code that corresponds to that loop. Possible arithmetic operations will even be pre-computed. For the code below;
int foo(int x) {
return x * 5;
}
int main() {
int x = 3;
int y = foo(x);
...
...
}
it is not surprising for the compiler to generate just two lines of assembly code (the compiler may even by-pass calling the function foo and generate an inline instruction) for function foo:
mov $15, %eax
; compiler will not bother multiplying 5 by 3
; but just move the pre-computed '15' to register
ret
; and then return

SSE runs slow after using AVX [duplicate]

This question already has answers here:
Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
(2 answers)
Closed last year.
This post was edited and submitted for review last year and failed to reopen the post:
Original close reason(s) were not resolved
I have a strange issue with some SSE2 and AVX code I have been working on. I am building my application using GCC which runtime cpu feature detection. The object files are built with seperate flags for each CPU feature, for example:
g++ -c -o ConvertSamples_SSE.o ConvertSamples_SSE.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse
g++ -c -o ConvertSamples_SSE2.o ConvertSamples_SSE2.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse2
g++ -c -o ConvertSamples_AVX.o ConvertSamples_AVX.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -mavx
When I first launch the program, I find that the SSE2 routines are as per normal with a nice speed boost over the non SSE routines (around 100% faster). After I run any AVX routine, the exact same SSE2 routine now runs much slower.
Could someone please explain what the cause of this may be?
Before the AVX routine runs, all the tests are around 80-130% faster then FPU math, as can be seen here, after the AVX routine runs, the SSE routines are much slower.
If I skip the AVX test routines I never see this performance loss.
Here is my SSE2 routine
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
static __m128 mul = _mm_set_ps1(ratio);
unsigned int i;
for (i = 0; i < samples - 3; i += 4, in += 4, out += 4)
{
__m128i con = _mm_cvtps_epi32(_mm_mul_ps(_mm_load_ps(in), mul));
out[0] = ((int16_t*)&con)[0];
out[1] = ((int16_t*)&con)[2];
out[2] = ((int16_t*)&con)[4];
out[3] = ((int16_t*)&con)[6];
}
for (; i < samples; ++i, ++in, ++out)
*out = (int16_t)lrint(*in * ratio);
}
And the AVX version of the same.
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
static __m256 mul = _mm256_set1_ps(ratio);
unsigned int i;
for (i = 0; i < samples - 7; i += 8, in += 8, out += 8)
{
__m256i con = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_load_ps(in), mul));
out[0] = ((int16_t*)&con)[0];
out[1] = ((int16_t*)&con)[2];
out[2] = ((int16_t*)&con)[4];
out[3] = ((int16_t*)&con)[6];
out[4] = ((int16_t*)&con)[8];
out[5] = ((int16_t*)&con)[10];
out[6] = ((int16_t*)&con)[12];
out[7] = ((int16_t*)&con)[14];
}
for(; i < samples; ++i, ++in, ++out)
*out = (int16_t)lrint(*in * ratio);
}
I have also run this through valgrind which detects no errors.
Mixing AVX code and legacy SSE code incurs a performance penalty. The most reasonable solution is to execute the VZEROALL instruction after an AVX segment of code, especially just before executing SSE code.
As per Intel's diagram, the penalty when transitioning into or out of state C (legacy SSE with upper half of AVX registers saved) is in the order of 100 clock cycles. The other transitions are only 1 cycle:
References:
Intel: Avoiding AVX-SSE Transition Penalties
Intel® AVX State Transitions: Migrating SSE Code to AVX

g++ -O3 optimizes better than -O2 with all extra optimizations added [duplicate]

This question already has answers here:
What's the difference between -O3 and (-O2 + flags that man gcc says -O3 adds to -O2)?
(2 answers)
Closed 8 years ago.
Here's the function I'm looking at:
template <uint8_t Size>
inline uint64_t parseUnsigned( const char (&buf)[Size] )
{
uint64_t val = 0;
for (uint8_t i = 0; i < Size; ++i)
if (buf[i] != ' ')
val = (val * 10) + (buf[i] - '0');
return val;
}
I have a test harness which passes in all possible numbers with Size=5, left-padded with spaces. I'm using GCC 4.7.2. When I run the program under callgrind after compiling with -O3 I get:
I refs: 7,154,919
When I compile with -O2 I get:
I refs: 9,001,570
OK, so -O3 improves the performance (and I confirmed that some of the improvement comes from the above function, not just the test harness). But I don't want to completely switch from -O2 to -O3, I want to find out which specific option(s) to add. So I consult man g++ to get the list of options it says are added by -O3:
-fgcse-after-reload [enabled]
-finline-functions [enabled]
-fipa-cp-clone [enabled]
-fpredictive-commoning [enabled]
-ftree-loop-distribute-patterns [enabled]
-ftree-vectorize [enabled]
-funswitch-loops [enabled]
So I compile again with -O2 followed by all of the above options. But this gives me even worse performance than plain -O2:
I refs: 9,546,017
I discovered that adding -ftree-vectorize to -O2 is responsible for this performance degradation. But I can't figure out how to match the -O3 performance with any combination of options. How can I do this?
In case you want to try it yourself, here's the test harness (put the above parseUnsigned() definition under the #includes):
#include <cmath>
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
template <uint8_t Size>
inline void increment( char (&buf)[Size] )
{
for (uint8_t i = Size - 1; i < 255; --i)
{
if (buf[i] == ' ')
{
buf[i] = '1';
break;
}
++buf[i];
if (buf[i] > '9')
buf[i] -= 10;
else
break;
}
}
int main()
{
char str[5];
memset(str, ' ', sizeof(str));
unsigned max = std::pow(10, sizeof(str));
for (unsigned ii = 0; ii < max; ++ii)
{
uint64_t result = parseUnsigned(str);
if (result != ii)
{
printf("parseUnsigned(%*s) from %u: %lu\n", sizeof(str), str, ii, result);
abort();
}
increment(str);
}
}
A very similar question was already answered here: https://stackoverflow.com/a/6454659/483486
I've copied the relevant text underneath.
UPDATE: There are questions about it in GCC WIKI:
"Is -O1 (-O2,-O3 or -Os) equivalent to individual -foptimization options?"
No. First, individual optimization options (-f*) do not enable optimization, an option -Os or -Ox with x > 0 is required. Second, the -Ox flags enable many optimizations that are not controlled by any individual -f* option. There are no plans to add individual options for controlling all these optimizations.
"What specific flags are enabled by -O1 (-O2, -O3 or -Os)?"
Varies by platform and GCC version. You can get GCC to tell you what flags it enables by doing this:
touch empty.c
gcc -O1 -S -fverbose-asm empty.c
cat empty.s

Vectors and Arrays in C++

Performance difference between C++ vectors and plain arrays has been extensively discussed, for example here and here. Usually discussions conclude that vectors and arrays are similar in terms on performance when accessed with the [] operator and the compiler is enabled to inline functions. That is why expected but I came through a case where it seems that is not true. The functionality of the lines below is quite simple: a 3D volume is taken and it is swap and applied some kind of 3D little mask a certain number of times. Depending on the VERSION macro, volumes will be declared as vectors and accessed through the at operator (VERSION=2), declared as vectors and accessed via [] (VERSION=1) or declared as simple arrays.
#include <vector>
#define NX 100
#define NY 100
#define NZ 100
#define H 1
#define C0 1.5f
#define C1 0.25f
#define T 3000
#if !defined(VERSION) || VERSION > 2 || VERSION < 0
#error "Bad version"
#endif
#if VERSION == 2
#define AT(_a_,_b_) (_a_.at(_b_))
typedef std::vector<float> Field;
#endif
#if VERSION == 1
#define AT(_a_,_b_) (_a_[_b_])
typedef std::vector<float> Field;
#endif
#if VERSION == 0
#define AT(_a_,_b_) (_a_[_b_])
typedef float* Field;
#endif
#include <iostream>
#include <omp.h>
int main(void) {
#if VERSION != 0
Field img(NX*NY*NY);
#else
Field img = new float[NX*NY*NY];
#endif
double end, begin;
begin = omp_get_wtime();
const int csize = NZ;
const int psize = NZ * NX;
for(int t = 0; t < T; t++ ) {
/* Swap the 3D volume and apply the "blurring" coefficients */
#pragma omp parallel for
for(int j = H; j < NY-H; j++ ) {
for( int i = H; i < NX-H; i++ ) {
for( int k = H; k < NZ-H; k++ ) {
int eindex = k+i*NZ+j*NX*NZ;
AT(img,eindex) = C0 * AT(img,eindex) +
C1 * (AT(img,eindex - csize) +
AT(img,eindex + csize) +
AT(img,eindex - psize) +
AT(img,eindex + psize) );
}
}
}
}
end = omp_get_wtime();
std::cout << "Elapsed "<< (end-begin) <<" s." << std::endl;
/* Access img field so we force it to be deleted after accouting time */
#define WHATEVER 12.f
if( img[ NZ ] == WHATEVER ) {
std::cout << "Whatever" << std::endl;
}
#if VERSION == 0
delete[] img;
#endif
}
One would expect code will perform the same with VERSION=1 and VERSION=0, but the output is as follows:
VERSION 2 : Elapsed 6.94905 s.
VERSION 1 : Elapsed 4.08626 s
VERSION 0 : Elapsed 1.97576 s.
If I compile without OMP (I've got only two cores), I get similar results:
VERSION 2 : Elapsed 10.9895 s.
VERSION 1 : Elapsed 7.14674 s
VERSION 0 : Elapsed 3.25336 s.
I always compile with GCC 4.6.3 and the compilation options -fopenmp -finline-functions -O3 (I of course remove -fopenmp when I compile without omp) Is there something I do wrong, for example when compiling? Or should we really expect that difference between vectors and arrays?
PS: I cannot use std::array because of the compiler, of which I depend, that doesn't support C11 standard. With ICC 13.1.2 I get similar behavior.
I tried your code, used chrono to count the time.
And I compiled with clang (version 3.5) and libc++.
clang++ test.cc -std=c++1y -stdlib=libc++ -lc++abi -finline-functions
-O3
The result is exactly same for VERSION 0 and VERSION 1, there's no big difference. They are both 3.4 seconds in average (I use virtual machine so it is slower.).
Then I tried g++ (version 4.8.1),
g++ test.cc -std=c++1y -finline-functions
-O3
The result shows that, for VERSION 0, it is 4.4seconds (roughly), for VERSION 1, it is 5.2 seconds (roughly).
I then, tried clang++ with libstdc++.
clang++ test.cc -std=c++11 -finline-functions
-O3
voila, the result back to 3.4seconds again.
So, it's purely the optimization "bug" of g++.