Performance difference between C++ vectors and plain arrays has been extensively discussed, for example here and here. Usually discussions conclude that vectors and arrays are similar in terms on performance when accessed with the [] operator and the compiler is enabled to inline functions. That is why expected but I came through a case where it seems that is not true. The functionality of the lines below is quite simple: a 3D volume is taken and it is swap and applied some kind of 3D little mask a certain number of times. Depending on the VERSION macro, volumes will be declared as vectors and accessed through the at operator (VERSION=2), declared as vectors and accessed via [] (VERSION=1) or declared as simple arrays.
#include <vector>
#define NX 100
#define NY 100
#define NZ 100
#define H 1
#define C0 1.5f
#define C1 0.25f
#define T 3000
#if !defined(VERSION) || VERSION > 2 || VERSION < 0
#error "Bad version"
#endif
#if VERSION == 2
#define AT(_a_,_b_) (_a_.at(_b_))
typedef std::vector<float> Field;
#endif
#if VERSION == 1
#define AT(_a_,_b_) (_a_[_b_])
typedef std::vector<float> Field;
#endif
#if VERSION == 0
#define AT(_a_,_b_) (_a_[_b_])
typedef float* Field;
#endif
#include <iostream>
#include <omp.h>
int main(void) {
#if VERSION != 0
Field img(NX*NY*NY);
#else
Field img = new float[NX*NY*NY];
#endif
double end, begin;
begin = omp_get_wtime();
const int csize = NZ;
const int psize = NZ * NX;
for(int t = 0; t < T; t++ ) {
/* Swap the 3D volume and apply the "blurring" coefficients */
#pragma omp parallel for
for(int j = H; j < NY-H; j++ ) {
for( int i = H; i < NX-H; i++ ) {
for( int k = H; k < NZ-H; k++ ) {
int eindex = k+i*NZ+j*NX*NZ;
AT(img,eindex) = C0 * AT(img,eindex) +
C1 * (AT(img,eindex - csize) +
AT(img,eindex + csize) +
AT(img,eindex - psize) +
AT(img,eindex + psize) );
}
}
}
}
end = omp_get_wtime();
std::cout << "Elapsed "<< (end-begin) <<" s." << std::endl;
/* Access img field so we force it to be deleted after accouting time */
#define WHATEVER 12.f
if( img[ NZ ] == WHATEVER ) {
std::cout << "Whatever" << std::endl;
}
#if VERSION == 0
delete[] img;
#endif
}
One would expect code will perform the same with VERSION=1 and VERSION=0, but the output is as follows:
VERSION 2 : Elapsed 6.94905 s.
VERSION 1 : Elapsed 4.08626 s
VERSION 0 : Elapsed 1.97576 s.
If I compile without OMP (I've got only two cores), I get similar results:
VERSION 2 : Elapsed 10.9895 s.
VERSION 1 : Elapsed 7.14674 s
VERSION 0 : Elapsed 3.25336 s.
I always compile with GCC 4.6.3 and the compilation options -fopenmp -finline-functions -O3 (I of course remove -fopenmp when I compile without omp) Is there something I do wrong, for example when compiling? Or should we really expect that difference between vectors and arrays?
PS: I cannot use std::array because of the compiler, of which I depend, that doesn't support C11 standard. With ICC 13.1.2 I get similar behavior.
I tried your code, used chrono to count the time.
And I compiled with clang (version 3.5) and libc++.
clang++ test.cc -std=c++1y -stdlib=libc++ -lc++abi -finline-functions
-O3
The result is exactly same for VERSION 0 and VERSION 1, there's no big difference. They are both 3.4 seconds in average (I use virtual machine so it is slower.).
Then I tried g++ (version 4.8.1),
g++ test.cc -std=c++1y -finline-functions
-O3
The result shows that, for VERSION 0, it is 4.4seconds (roughly), for VERSION 1, it is 5.2 seconds (roughly).
I then, tried clang++ with libstdc++.
clang++ test.cc -std=c++11 -finline-functions
-O3
voila, the result back to 3.4seconds again.
So, it's purely the optimization "bug" of g++.
Related
I am trying to understand/test OpenMP with GPU offload. However, I am confused because some examples/info (1, 2, 3) in the internet are analogous or similar to mine but my example does not work as I think it should. I am using g++ 9.4 on Ubuntu 20.04 LTS and also installed gcc-9-offload-nvptx.
My example that does not work but is similar to this one:
#include <iostream>
#include <vector>
int main(int argc, char *argv[]) {
typedef double myfloat;
if (argc != 2) exit(1);
size_t size = atoi(argv[1]);
printf("Size: %zu\n", size);
std::vector<myfloat> data_1(size, 2);
myfloat *data1_ptr = data_1.data();
myfloat sum = -1;
#pragma omp target map(tofrom:sum) map(from: data1_ptr[0:size])
#pragma omp teams distribute parallel for simd reduction(+:sum) collapse(2)
for (size_t i = 0; i < size; ++i) {
for (size_t j = 0; j < size; ++j) {
myfloat term1 = data1_ptr[i] * i;
sum += term1 / (1 + term1 * term1 * term1);
}
}
printf("sum: %.2f\n", sum);
return 0;
}
When I compile it with: g++ main.cpp -o test -fopenmp -fcf-protection=none -fno-stack-protector I get the following
stack_example.cpp: In function ‘main._omp_fn.0.hsa.0’:
cc1plus: warning: could not emit HSAIL for the function [-Whsa]
cc1plus: note: support for HSA does not implement non-gridified OpenMP parallel constructs.
It does compile but when using it with
./test 10000
the printed sum is still -1. I think the sum value passed to the GPU was not returned properly but I explicitly map it, so shouldn't it be returned? Or what am I doing wrong?
EDIT 1
I was ask to modify my code because there was a historically grown redundant for loop and also sum was initialized with -1. I fixed that and also compiled it with gcc-11 which did not throw a warning or note as did gcc-9. However the behavior is similar:
Size: 100
Number of devices: 2
sum: 0.00
I checked with nvtop, the GPU is used. Because there are two GPUs I can even switch the device and can be seen by nvtop.
Solution:
The fix is very easy and stupid. Changing
map(from: data1_ptr[0:size])
to
map(tofrom: data1_ptr[0:size])
did the trick.
Even though I am not writing to the array this seemed to be the problem.
I try to introduce OpenMP to my c++ code to improve the performance using a simple case as shown:
#include <omp.h>
#include <chrono>
#include <iostream>
#include <cmath>
using std::cout;
using std::endl;
#define NUM 100000
int main()
{
double data[NUM] __attribute__ ((aligned (128)));;
#ifdef _OPENMP
auto t1 = omp_get_wtime();
#else
auto t1 = std::chrono::steady_clock::now();
#endif
for(long int k=0; k<100000; ++k)
{
#pragma omp parallel for schedule(static, 16) num_threads(4)
for(long int i=0; i<NUM; ++i)
{
data[i] = cos(sin(i*i+ k*k));
}
}
#ifdef _OPENMP
auto t2 = omp_get_wtime();
auto duration = t2 - t1;
cout<<"OpenMP Elapsed time (second): "<<duration<<endl;
#else
auto t2 = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
cout<<"No OpenMP Elapsed time (second): "<<duration/1e6<<endl;
#endif
double tempsum = 0.;
for(long int i=0; i<NUM; ++i)
{
int nextind = (i == 0 ? 0 : i-1);
tempsum += i + sin(data[i]) + cos(data[nextind]);
}
cout<<"Raw data sum: "<<tempsum<<endl;
return 0;
}
Access to a tightly looped int array (size = 10000) and change its elements in either parallel or non-parallel way.
Build as
g++ -o test test.cpp
or
g++ -o test test.cpp -fopenmp
The program reported results as:
No OpenMP Elapsed time (second): 427.44
Raw data sum: 5.00009e+09
OpenMP Elapsed time (second): 113.017
Raw data sum: 5.00009e+09
Intel 10th CPU, Ubuntu 18.04, GCC 7.5, OpenMP 4.5.
I suspect that the false sharing in the cache line leads to the bad performance of the OpenMP version code.
I update the new test results after increasing the loop size, the OpenMP runs faster as expected.
Thank you!
Since you're writing C++, use the C++ random number generator, which is threadsafe, unlike the C legacy one you're using.
Also, you're not using your data array, so the compiler is actually at liberty to remove your loop completely.
You should touch all your data once before you do the timed loop. That way you ensure that pages are instantiated and data is in or out of cache depending.
Your loop is pretty short.
rand() is not thread-safe (see here). Use an array of C++ random-number generators instead, one for each thread. See std::uniform_int_distribution for details.
You can drop #ifdef _OPENMP variations in your code. In a Bash terminal, you can call your application as OMP_NUM_THREADS=1 test. See here for details.
So you can remove num_threads(4) as well because you can explicitly specify the amount of parallelism.
Use Google Benchmark or command-line parameters so you can parameterize the number of threads and array size.
From here, I expect you will see:
The performance when you call OMP_NUM_THREADS=1 test is close to your non-OpenMP version.
The array of C++ RNG generators is faster than calling rand() from multiple threads.
The multi-threaded version is still slower than the single-threaded version when using a 10,000 element array.
I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
{
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
{
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
}
}
}
}
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
}
When I run this I always get that it is not on the GPU, but that there are 8 devices available.
This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.
Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.
After some performance experiments, it seemed that using char16_t arrays may boost performance sometimes up to 40-50%, but it seems that using std::u16string without any copying and allocations should be as fast as C arrays. However, benchmarks are showing the opposite.
Here is the code I've written for benchmark (it uses Google Benchmark lib):
#include "benchmark/benchmark.h"
#include <string>
static std::u16string str;
static char16_t *str2;
static void BM_Strings(benchmark::State &state) {
while (state.KeepRunning()) {
for (size_t i = 0; i < str.size(); i++){
benchmark::DoNotOptimize(str[i]);
}
}
}
static void BM_CharArray(benchmark::State &state) {
while (state.KeepRunning()) {
for (size_t i = 0; i < str.size(); i++){
benchmark::DoNotOptimize(str2[i]);
}
}
}
BENCHMARK(BM_Strings);
BENCHMARK(BM_CharArray);
static void init(){
str = u"Various applications of randomness have led to the development of several different methods ";
str2 = (char16_t *) str.c_str();
}
int main(int argc, char** argv) {
init();
::benchmark::Initialize(&argc, argv);
::benchmark::RunSpecifiedBenchmarks();
}
It shows the following result:
Run on (8 X 2200 MHz CPU s)
2017-07-11 23:05:57
Benchmark Time CPU Iterations
---------------------------------------------------
BM_Strings 1832 ns 1830 ns 365938
BM_CharArray 928 ns 926 ns 712577
I'm using clang (Apple LLVM version 8.1.0 (clang-802.0.42)) on mac. With optimizations turned on the gap is smaller but still noticeable:
Benchmark Time CPU Iterations
---------------------------------------------------
BM_Strings 242 ns 241 ns 2906615
BM_CharArray 161 ns 161 ns 4552165
Can someone explain what's going on here and why there is a difference?
Updated (mixing the order and added few warm-up steps):
Benchmark Time CPU Iterations
---------------------------------------------------
BM_CharArray 670 ns 665 ns 903168
BM_Strings 856 ns 854 ns 817776
BM_CharArray 166 ns 166 ns 4369997
BM_Strings 225 ns 225 ns 3149521
Also including compile flags I'm using:
/usr/bin/clang++ -I{some includes here} -O3 -std=c++14 -stdlib=libc++ -Wall -Wextra -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk -O3 -fsanitize=address -Werror -o CMakeFiles/BenchmarkString.dir/BenchmarkString.cpp.o -c test/benchmarks/BenchmarkString.cpp
Because of the way libc++ implements the small string optimization, on every dereference it needs to check whether the string contents are stored in the string object itself or on the heap. Because the indexing is wrapped in benchmark::DoNotOptimize, it needs to perform this check every time a character is accessed. When accessing the string data via a pointer the data is always external, and so requires no check.
Interestingly I am unable to reproduce your results. I can barely detect a difference between the two.
The (incomplete) code I used is shown here:
hol::StdTimer timer;
using index_type = std::size_t;
index_type const N = 100'000'000;
index_type const SIZE = 1024;
static std::u16string s16;
static char16_t const* p16;
int main(int, char** argv)
{
std::generate_n(std::back_inserter(s16), SIZE,
[]{ return (char)hol::random_number((int)'A', (int)'Z'); });
p16 = s16.c_str();
unsigned sum;
{
sum = 0;
timer.start();
for(index_type n = 0; n < N; ++n)
for(index_type i = 0; i < SIZE; ++i)
sum += s16[i];
timer.stop();
RESULT("string", sum, timer);
}
{
sum = 0;
timer.start();
for(std::size_t n = 0; n < N; ++n)
for(std::size_t i = 0; i < SIZE; ++i)
sum += p16[i];
timer.stop();
RESULT("array ", sum, timer);
}
}
Output:
string: (670240768) 17.575232 secs
array : (670240768) 17.546145 secs
Compiler:
GCC 7.1
g++ -std=c++14 -march=native -O3 -D NDEBUG
In pure char16_t you access array directly, while in string you have overloaded operator[]
reference
operator[](size_type __pos)
{
#ifdef _GLIBCXX_DEBUG_PEDANTIC
__glibcxx_check_subscript(__pos);
#else
// as an extension v3 allows s[s.size()] when s is non-const.
_GLIBCXX_DEBUG_VERIFY(__pos <= this->size(),
_M_message(__gnu_debug::__msg_subscript_oob)
._M_sequence(*this, "this")
._M_integer(__pos, "__pos")
._M_integer(this->size(), "size"));
#endif
return _M_base()[__pos];
}
and _M_base() is:
_Base& _M_base() { return *this; }
Now, my guesses are that either:
_M_base() might not get inlined and than you get performance hit because of every read takes additional operation to read the function address.
or
One of those subscript checks happen.
In general, I assume that the STL implementation of any algorithm is at least as efficient as anything I can come up with (with the additional benefit of being error free). However, I came to wonder whether the STL's focus on iterators might be harmful in some situations.
Lets assume I want to calculate the inner product of two fixed size arrays. My naive implementation would look like this:
std::array<double, 100000> v1;
std::array<double, 100000> v2;
//fill with arbitrary numbers
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
sum += v1[i] * v2[i];
}
As the number of iterations and the memory layout are known during compile time and all operations can directly be mapped to native processor instructions, the compiler should easily be able to generate the "optimal" machine code from this (loop unrolling, vectorization / FMA instructions ...).
The STL version
double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);
on the other hand adds some additional indirections and even if everything is inlined, the compiler still has to deduce that it is working on a continuous memory region and where this region lies. While this is certainly possible in principle, I wonder, whether the typical c++ compiler will actually do it.
So my question is: Do you think, there can be a performance benefit of implementing standard algorithms that operate on fixed size arrays on my own, or will the STL Version always outperform a manual implementation?
As suggested I did some measurements and
for the code below
compiled with VS2013 for x64 in release mode
executed on a Win8.1 Machine with an i7-2640M,
the algorithm version is consistently slower by about 20% (15.6-15.7s vs 12.9-13.1s). The relative difference, also stays roughly constant over two orders of magnitude for N and REPS.
So I guess the answer is: Using standard library algorithms CAN hurt performance.
It would still be interesting, if this is a general problem or if it is specific to my platform, compiler and benchmark. You are welcome to post your own resutls or comment on the benchmark.
#include <iostream>
#include <numeric>
#include <array>
#include <chrono>
#include <cstdlib>
#define USE_STD_ALGORITHM
using namespace std;
using namespace std::chrono;
static const size_t N = 10000000; //size of the arrays
static const size_t REPS = 1000; //number of repitions
array<double, N> a1;
array<double, N> a2;
int main(){
srand(10);
for (size_t i = 0; i < N; ++i) {
a1[i] = static_cast<double>(rand())*0.01;
a2[i] = static_cast<double>(rand())*0.01;
}
double res = 0.0;
auto start=high_resolution_clock::now();
for (size_t z = 0; z < REPS; z++) {
#ifdef USE_STD_ALGORITHM
res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res);
#else
for (size_t t = 0; t < N; ++t) {
res+= a1[t] * a2[t];
}
#endif
}
auto end = high_resolution_clock::now();
std::cout << res << " "; // <-- necessary, so that loop isn't optimized away
std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl;
}
/*
* Update: Results (ubuntu 14.04 , haswell)
* STL: algorithm
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3551 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3567 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9378 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8505 ms
*
* loop:
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3543 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3551 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9613 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8642 ms
*/
EDIT:
I did a quick check with g++-4.9.2 and clang++-3.5 with O3and std=c++11 on a fedora 21 Virtual Box VM on the same machine and apparently those compilers don't have the same problem (the time is almost the same for both versions). However, gcc's version is about twice as fast as clang's (7.5s vs 14s).