I wrote a test to measure the cost of C++ exceptions with threads.
#include <cstdlib>
#include <iostream>
#include <vector>
#include <thread>
static const int N = 100000;
static void doSomething(int& n)
{
--n;
throw 1;
}
static void throwManyManyTimes()
{
int n = N;
while (n)
{
try
{
doSomething(n);
}
catch (int n)
{
switch (n)
{
case 1:
continue;
default:
std::cout << "error" << std::endl;
std::exit(EXIT_FAILURE);
}
}
}
}
int main(void)
{
int nCPUs = std::thread::hardware_concurrency();
std::vector<std::thread> threads(nCPUs);
for (int i = 0; i < nCPUs; ++i)
{
threads[i] = std::thread(throwManyManyTimes);
}
for (int i = 0; i < nCPUs; ++i)
{
threads[i].join();
}
return EXIT_SUCCESS;
}
Here's the C version that I initially wrote for fun.
#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>
#include <glib.h>
#define N 100000
static GPrivate jumpBuffer;
static void doSomething(volatile int *pn)
{
jmp_buf *pjb = g_private_get(&jumpBuffer);
--*pn;
longjmp(*pjb, 1);
}
static void *throwManyManyTimes(void *p)
{
jmp_buf jb;
volatile int n = N;
(void)p;
g_private_set(&jumpBuffer, &jb);
while (n)
{
switch (setjmp(jb))
{
case 0:
doSomething(&n);
case 1:
continue;
default:
printf("error\n");
exit(EXIT_FAILURE);
}
}
return NULL;
}
int main(void)
{
int nCPUs = g_get_num_processors();
GThread *threads[nCPUs];
int i;
for (i = 0; i < nCPUs; ++i)
{
threads[i] = g_thread_new(NULL, throwManyManyTimes, NULL);
}
for (i = 0; i < nCPUs; ++i)
{
g_thread_join(threads[i]);
}
return EXIT_SUCCESS;
}
The C++ version runs very slow compared to the C version.
$ g++ -O3 -g -std=c++11 test.cpp -o cpp-test -pthread
$ gcc -O3 -g -std=c89 test.c -o c-test `pkg-config glib-2.0 --cflags --libs`
$ time ./cpp-test
real 0m1.089s
user 0m2.345s
sys 0m1.637s
$ time ./c-test
real 0m0.024s
user 0m0.067s
sys 0m0.000s
So I ran the callgrind profiler.
For cpp-test, __cxz_throw was called exactly 400,000 times with self-cost of 8,000,032.
For c-test, __longjmp_chk was called exactly 400,000 times with self-cost of 5,600,000.
The whole cost of cpp-test is 4,048,441,756.
The whole cost of c-test is 60,417,722.
I guess something much more than simply saving the state of the jump-point and later resuming is done with C++ exceptions. I couldn't test with larger N because the callgrind profiler will run forever for the C++ test.
What is the extra cost involved in C++ exceptions making it many times slower than the setjmp/longjmp pair at least in this example?
This is by design.
C++ exceptions are expected to be exceptional in nature and are optimized thusly. The program is compiled to be most efficient when an exception does not happen.
You can verify this by commenting out the exception from your tests.
In C++:
//throw 1;
$ g++ -O3 -g -std=c++11 test.cpp -o cpp-test -pthread
$ time ./cpp-test
real 0m0.003s
user 0m0.004s
sys 0m0.000s
In C:
/*longjmp(*pjb, 1);*/
$ gcc -O3 -g -std=c89 test.c -o c-test `pkg-config glib-2.0 --cflags --libs`
$ time ./c-test
real 0m0.008s
user 0m0.012s
sys 0m0.004s
What is the extra cost involved in C++ exceptions making it many times slower than the setjmp/longjmp pair at least in this example?
g++ implements zero-cost model exceptions, which have no effective overhead* when an exception is not thrown. Machine code is produced as if there were no try/catch block.
The cost of this zero-overhead is that a table lookup must be performed on the program counter when an exception is thrown, to determine a jump to the appropriate code for performing stack unwinding. This puts the entire try/catch block implementation within the code performing a throw.
Your extra cost is a table lookup.
*Some minor timing voodoo may occur, as the presence of a PC lookup table may affect memory layout, which may affect CPU cache misses.
Related
I have a program using OpenMP on C++ and I need it to port into Dll so I can call it from Python. It returns an array of double values, which calculated using a lot of for loops with openmp pragma. I was doubtful if it is going to work, so I started from a little test program that calculates Pi value in a loop with different precision values, then I would measure performance and ensure that OpenMP works properly that way. Plain (w/o Omp) implementation works fine from Python and C++, however Omp variant gives a runtime error in Python (exception: access violation writing 0x000000000000A6C8) and crashes without an error in C++. Also Omp variant works fine if it is not a Dll and just a regular executable. The Dll is made with a makefile. App that uses the Dll built into an executable with g++ with no flags (source code is in UnitMain.cpp). All the relevant code and a Makefile below (I didn't include some files and functions for brevity).
UPD: I tried Microsoft compiler and it works, also I tested a linux dynamic library on WSL/g++ and it also works. Looks like it is Windows gcc specific, I'll try another version of gcc (btw my current version is this):
Thread model: posix gcc version 8.1.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project)
UnitFunctions.cpp
#include "UnitFunctions.h"
#include <omp.h>
#include <stdio.h>
#include <string.h>
typedef long long int64_t;
double pi(int64_t n) {
double sum = 0.0;
int64_t sign = 1;
for (int64_t i = 0; i < n; ++i) {
sum += sign/(2.0*i+1.0);
sign *= -1;
}
return 4.0*sum;
}
void calcPiOmp(double* arr, int N) {
int64_t base = 10e5;
#pragma omp parallel for
for(int i = 0; i < N; ++i) {
arr[i] = pi(base+i);
}
}
UnitMain.cpp
#include <windows.h>
#include <iostream>
using namespace std;
struct DllHandle
{
DllHandle(const char * const filename)
: h(LoadLibrary(filename)) {}
~DllHandle() { if (h) FreeLibrary(h); }
const HINSTANCE Get() const { return h; }
private:
HINSTANCE h;
};
int main()
{
const DllHandle h("Functions.DLL");
if (!h.Get())
{
MessageBox(0,"Could not load DLL","UnitCallDll",MB_OK);
return 1;
}
typedef const void (*calcPiOmp_t) (double*, int);
const auto calcPiOmp = reinterpret_cast<calcPiOmp_t>(GetProcAddress(h.Get(), "calcPiOmp"));
double arr[80];
calcPiOmp(arr, 80);
cout << arr[0] << endl;
return 0;
}
Makefile
all: UnitEntryPoint.o UnitFunctions.o
g++ -m64 -fopenmp -s -o Functions.dll UnitEntryPoint.o UnitFunctions.o
UnitEntryPoint.o: UnitEntryPoint.cpp
g++ -m64 -fopenmp -c UnitEntryPoint.cpp
UnitFunctions.o: UnitFunctions.cpp
g++ -m64 -fopenmp -c UnitFunctions.cpp
A Python script
import numpy as np
import ctypes as ct
cpp_fun = ct.CDLL('./Functions.dll')
cpp_fun.calcPiNaive.argtypes = [np.ctypeslib.ndpointer(), ct.c_int]
cpp_fun.calcPiOmp.argtypes = [np.ctypeslib.ndpointer(), ct.c_int]
arrOmp = np.zeros(N).astype('float64')
cpp_fun.calcPiOmp(arrOmp, N)
Note: This is not a duplicate of existing std::ios::sync_with_stdio(false) questions. I have gone through all of them and yet I am unable to make cout behave as fast as printf. Example code and evidence shown below.
I have three source code files:
// ex1.cpp
#include <cstdio>
#include <chrono>
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10000000; i++) {
printf("%d\n", i);
}
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);
fprintf(stderr, "%lld\n", duration.count());
}
// ex2.cpp
#include <iostream>
#include <chrono>
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10000000; i++) {
std::cout << i << '\n';
}
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);
std::cerr << duration.count() << '\n';
}
// ex3.cpp
#include <iostream>
#include <chrono>
int main()
{
std::ios::sync_with_stdio(false);
std::cin.tie(nullptr);
auto t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10000000; i++) {
std::cout << i << '\n';
}
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);
std::cerr << duration.count() << '\n';
}
I am not going to ever mix cstdio and iostream in my code, so the hacks used in ex3.cpp are okay for me.
I compile them with clang++ on macOS with solid-state drive.
clang++ -std=c++11 -O2 -Wall -Wextra -pedantic ex1.cpp -o ex1
clang++ -std=c++11 -O2 -Wall -Wextra -pedantic ex2.cpp -o ex2
clang++ -std=c++11 -O2 -Wall -Wextra -pedantic ex3.cpp -o ex3
Now I run them and time them.
$ time ./ex1 > out.txt
1282
real 0m1.294s
user 0m1.217s
sys 0m0.071s
$ time ./ex1 > out.txt
1299
real 0m1.333s
user 0m1.221s
sys 0m0.072s
$ time ./ex1 > out.txt
1277
real 0m1.295s
user 0m1.214s
sys 0m0.070s
$ time ./ex2 > out.txt
3102
real 0m3.371s
user 0m3.037s
sys 0m0.075s
$ time ./ex2 > out.txt
3153
real 0m3.164s
user 0m3.073s
sys 0m0.075s
$ time ./ex2 > out.txt
3136
real 0m3.150s
user 0m3.051s
sys 0m0.077s
$ time ./ex3 > out.txt
3118
real 0m3.513s
user 0m3.045s
sys 0m0.080s
$ time ./ex3 > out.txt
3113
real 0m3.124s
user 0m3.042s
sys 0m0.077s
$ time ./ex3 > out.txt
3095
real 0m3.107s
user 0m3.029s
sys 0m0.073s
The results are quite similar even if I redirect the output to /dev/null. The results are quite similar with -O3 optimization level too.
Both ex3 and ex2 are slower than ex1? Is it possible to use std::cout in anyway that gives comparable speed with printf?
I am using the C++ standard library in some C++ code and this makefile:
CC=g++
CXXFLAGS=-Wall -Werror -ggdb3 -std=c++11 -pedantic $(OTHERFLAGS)
cpp_sort: cpp_sort.o
g++ -o $# $(CXXFLAGS) $^
clean:
rm -rf *.o cpp_sort *~
The source code:
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
void get_input(vector<int>& items, int size) {
for (int i = 0; i < size; ++i) {
int element;
cin >> element;
items.push_back(element);
}
}
void cpp_sort(vector<int>& items) {
sort(items.begin(), items.end());
}
void print_array(vector<int>& items) {
for (auto& item : items) {
cout << item << ' ';
}
cout << endl;
}
int main() {
int size;
cin >> size;
vector<int> items;
items.reserve(size);
get_input(items, size);
cpp_sort(items);
print_array(items);
}
I call make like this:
make OTHERFLAGS=-pg
run the program (where large.txt is a long list of integers):
./cpp_sort <large.txt
and view the profiling information:
grof ./cpp_sort
Which is fine and it does work, but the calling of my functions is obscured by all the C++ standard library function calls. Is there a way to exclude the standard library internal function calls?
This code here runs fine on -O but fails to exit on -O2 and -Os.
#include <iostream>
int main() {
int ctr = 2000000000;
while (ctr++ > 0) {
if (ctr % 100000000 == 0) {
std::cout << ctr << '\n';
}
}
return 0;
}
I know that it has something to do with integer overflow, but I thought that was defined behavior. In case it may be relevant, I'm compiling on a Linux virtual machine on a Windows 64-bit computer.
EDIT: Integer overflow is not defined behavior. So then what optimization or combination of optimizations causes the problem? The question is: "Why does the code work fine on -O but fail on -O2 and -Os?"
Previously I posted a question which was a bit complicated in some sense. In this question, I made a simple example to generate the same problem with Armadillo library which is supposed to be very fast.
Considering the following code with two main functions only one is activated at each time:
#include <armadillo>
const int runmax=50000000;
arma::vec::fixed<3u> x,y;
double a,b;
void init()
{
x<<0.2<<arma::endr<<-0.3<<arma::endr<<0.1;
y<<1<<arma::endr<<1<<arma::endr<<1;
a=0.3;
b=a*a;
}
template<unsigned N>
void scaledsum(double a,arma::vec::fixed<N> &x,double b,arma::vec::fixed<N> &y)
{
for(int i=0;i<N;i++)
{
y(i)=a*x(i)+b*y(i);
}
}
void main1()
{
for(int i=0;i<runmax;i++)
{
y=a*x+b*y;
}
}
void main2()
{
for(int i=0;i<runmax;i++)
{
scaledsum(a,x,b,y);
}
}
int main()
{
init();
//main1();
main2();
y.print();
return 0;
}
perf stat ./main1
perf stat ./main2
I expect main1 runs faster than main2. or at least very close to it. But main1 runs slower.
I do not understand such profiling:
main1
0.209682235 seconds time elapsed
main2
0.121644777 seconds time elapsed
PS
Compilation command: g++ -std=c++11 -O3 -s -DNDEBUG test.cpp