This code here runs fine on -O but fails to exit on -O2 and -Os.
#include <iostream>
int main() {
int ctr = 2000000000;
while (ctr++ > 0) {
if (ctr % 100000000 == 0) {
std::cout << ctr << '\n';
}
}
return 0;
}
I know that it has something to do with integer overflow, but I thought that was defined behavior. In case it may be relevant, I'm compiling on a Linux virtual machine on a Windows 64-bit computer.
EDIT: Integer overflow is not defined behavior. So then what optimization or combination of optimizations causes the problem? The question is: "Why does the code work fine on -O but fail on -O2 and -Os?"
Related
Command line I think should work: g++ -std=c++17 -O3 -Wl,-z,stack-size=1000000000 C.cc && ./a.out < C.2
Instead of working, it segfaults. -fsanitize=address identifies the problem as a stack overflow.
I found https://github.com/microsoft/WSL/issues/633 which suggests sudo prlimit --stack=unlimited --pid $$; ulimit -s unlimited which "works"; ulimit -s now prints "unlimited". But I still get the segfault. I expected the linker options above to work, but they don't. I also tried using setrlimit within the C++ code directly:
struct rlimit rl;
assert(0 == getrlimit(RLIMIT_STACK, &rl));
rl.rlim_cur = 3LL*1024*1024*1024;
assert(0 == setrlimit(RLIMIT_STACK, &rl));
assert(0 == getrlimit(RLIMIT_STACK, &rl));
cerr << rl.rlim_cur << endl;
The last line prints 3GB, but I still get the stack overflow.
The function with the overflow is a recursive tree DFS with the following signature: ll root(ll x, const vector<vector<pll>>& E, vector<ll>& P, vector<ll>& SZ) {. The depth of the tree is <= 800,000, so I think the total required stack size AFAIK should be 8*800,000*4 bytes, around 30 MB. So it should be possible to get this to run.
I couldn't think of anything else to try. Any ideas? I'm on WSL version 1 in Ubuntu.
=========================================
Here is a simple repro. Save the following code as ex.cc:
#include <iostream>
#include <sys/resource.h>
#include <cassert>
using namespace std;
using ll = int64_t;
ll f(ll n) {
if(n==0) { return 0LL; }
return n + f(n-1);
}
int main() {
struct rlimit rl;
assert(0 == getrlimit(RLIMIT_STACK, &rl));
rl.rlim_cur = 3LL*1024*1024*1024;
assert(0 == setrlimit(RLIMIT_STACK, &rl));
assert(0 == getrlimit(RLIMIT_STACK, &rl));
cerr << rl.rlim_cur << endl;
cout << f(1e6) << endl;
}
g++ -std=c++17 -Wl,-z,stack-size=1000000000 ex.cc -fsanitize=address && ./a.out gives a stack overflow (also without -fsanitize=address). For me, it fails roughly 262,000 stack frames in.
Upgrading to WSL2 fixed this: https://learn.microsoft.com/en-us/windows/wsl/install-win10
Either the setrlimit or sudo prlimit --stack=unlimited --pid $$; ulimit -s unlimited works (the linker option doesn't seem to matter).
I have a program using OpenMP on C++ and I need it to port into Dll so I can call it from Python. It returns an array of double values, which calculated using a lot of for loops with openmp pragma. I was doubtful if it is going to work, so I started from a little test program that calculates Pi value in a loop with different precision values, then I would measure performance and ensure that OpenMP works properly that way. Plain (w/o Omp) implementation works fine from Python and C++, however Omp variant gives a runtime error in Python (exception: access violation writing 0x000000000000A6C8) and crashes without an error in C++. Also Omp variant works fine if it is not a Dll and just a regular executable. The Dll is made with a makefile. App that uses the Dll built into an executable with g++ with no flags (source code is in UnitMain.cpp). All the relevant code and a Makefile below (I didn't include some files and functions for brevity).
UPD: I tried Microsoft compiler and it works, also I tested a linux dynamic library on WSL/g++ and it also works. Looks like it is Windows gcc specific, I'll try another version of gcc (btw my current version is this):
Thread model: posix gcc version 8.1.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project)
UnitFunctions.cpp
#include "UnitFunctions.h"
#include <omp.h>
#include <stdio.h>
#include <string.h>
typedef long long int64_t;
double pi(int64_t n) {
double sum = 0.0;
int64_t sign = 1;
for (int64_t i = 0; i < n; ++i) {
sum += sign/(2.0*i+1.0);
sign *= -1;
}
return 4.0*sum;
}
void calcPiOmp(double* arr, int N) {
int64_t base = 10e5;
#pragma omp parallel for
for(int i = 0; i < N; ++i) {
arr[i] = pi(base+i);
}
}
UnitMain.cpp
#include <windows.h>
#include <iostream>
using namespace std;
struct DllHandle
{
DllHandle(const char * const filename)
: h(LoadLibrary(filename)) {}
~DllHandle() { if (h) FreeLibrary(h); }
const HINSTANCE Get() const { return h; }
private:
HINSTANCE h;
};
int main()
{
const DllHandle h("Functions.DLL");
if (!h.Get())
{
MessageBox(0,"Could not load DLL","UnitCallDll",MB_OK);
return 1;
}
typedef const void (*calcPiOmp_t) (double*, int);
const auto calcPiOmp = reinterpret_cast<calcPiOmp_t>(GetProcAddress(h.Get(), "calcPiOmp"));
double arr[80];
calcPiOmp(arr, 80);
cout << arr[0] << endl;
return 0;
}
Makefile
all: UnitEntryPoint.o UnitFunctions.o
g++ -m64 -fopenmp -s -o Functions.dll UnitEntryPoint.o UnitFunctions.o
UnitEntryPoint.o: UnitEntryPoint.cpp
g++ -m64 -fopenmp -c UnitEntryPoint.cpp
UnitFunctions.o: UnitFunctions.cpp
g++ -m64 -fopenmp -c UnitFunctions.cpp
A Python script
import numpy as np
import ctypes as ct
cpp_fun = ct.CDLL('./Functions.dll')
cpp_fun.calcPiNaive.argtypes = [np.ctypeslib.ndpointer(), ct.c_int]
cpp_fun.calcPiOmp.argtypes = [np.ctypeslib.ndpointer(), ct.c_int]
arrOmp = np.zeros(N).astype('float64')
cpp_fun.calcPiOmp(arrOmp, N)
I wrote a test to measure the cost of C++ exceptions with threads.
#include <cstdlib>
#include <iostream>
#include <vector>
#include <thread>
static const int N = 100000;
static void doSomething(int& n)
{
--n;
throw 1;
}
static void throwManyManyTimes()
{
int n = N;
while (n)
{
try
{
doSomething(n);
}
catch (int n)
{
switch (n)
{
case 1:
continue;
default:
std::cout << "error" << std::endl;
std::exit(EXIT_FAILURE);
}
}
}
}
int main(void)
{
int nCPUs = std::thread::hardware_concurrency();
std::vector<std::thread> threads(nCPUs);
for (int i = 0; i < nCPUs; ++i)
{
threads[i] = std::thread(throwManyManyTimes);
}
for (int i = 0; i < nCPUs; ++i)
{
threads[i].join();
}
return EXIT_SUCCESS;
}
Here's the C version that I initially wrote for fun.
#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>
#include <glib.h>
#define N 100000
static GPrivate jumpBuffer;
static void doSomething(volatile int *pn)
{
jmp_buf *pjb = g_private_get(&jumpBuffer);
--*pn;
longjmp(*pjb, 1);
}
static void *throwManyManyTimes(void *p)
{
jmp_buf jb;
volatile int n = N;
(void)p;
g_private_set(&jumpBuffer, &jb);
while (n)
{
switch (setjmp(jb))
{
case 0:
doSomething(&n);
case 1:
continue;
default:
printf("error\n");
exit(EXIT_FAILURE);
}
}
return NULL;
}
int main(void)
{
int nCPUs = g_get_num_processors();
GThread *threads[nCPUs];
int i;
for (i = 0; i < nCPUs; ++i)
{
threads[i] = g_thread_new(NULL, throwManyManyTimes, NULL);
}
for (i = 0; i < nCPUs; ++i)
{
g_thread_join(threads[i]);
}
return EXIT_SUCCESS;
}
The C++ version runs very slow compared to the C version.
$ g++ -O3 -g -std=c++11 test.cpp -o cpp-test -pthread
$ gcc -O3 -g -std=c89 test.c -o c-test `pkg-config glib-2.0 --cflags --libs`
$ time ./cpp-test
real 0m1.089s
user 0m2.345s
sys 0m1.637s
$ time ./c-test
real 0m0.024s
user 0m0.067s
sys 0m0.000s
So I ran the callgrind profiler.
For cpp-test, __cxz_throw was called exactly 400,000 times with self-cost of 8,000,032.
For c-test, __longjmp_chk was called exactly 400,000 times with self-cost of 5,600,000.
The whole cost of cpp-test is 4,048,441,756.
The whole cost of c-test is 60,417,722.
I guess something much more than simply saving the state of the jump-point and later resuming is done with C++ exceptions. I couldn't test with larger N because the callgrind profiler will run forever for the C++ test.
What is the extra cost involved in C++ exceptions making it many times slower than the setjmp/longjmp pair at least in this example?
This is by design.
C++ exceptions are expected to be exceptional in nature and are optimized thusly. The program is compiled to be most efficient when an exception does not happen.
You can verify this by commenting out the exception from your tests.
In C++:
//throw 1;
$ g++ -O3 -g -std=c++11 test.cpp -o cpp-test -pthread
$ time ./cpp-test
real 0m0.003s
user 0m0.004s
sys 0m0.000s
In C:
/*longjmp(*pjb, 1);*/
$ gcc -O3 -g -std=c89 test.c -o c-test `pkg-config glib-2.0 --cflags --libs`
$ time ./c-test
real 0m0.008s
user 0m0.012s
sys 0m0.004s
What is the extra cost involved in C++ exceptions making it many times slower than the setjmp/longjmp pair at least in this example?
g++ implements zero-cost model exceptions, which have no effective overhead* when an exception is not thrown. Machine code is produced as if there were no try/catch block.
The cost of this zero-overhead is that a table lookup must be performed on the program counter when an exception is thrown, to determine a jump to the appropriate code for performing stack unwinding. This puts the entire try/catch block implementation within the code performing a throw.
Your extra cost is a table lookup.
*Some minor timing voodoo may occur, as the presence of a PC lookup table may affect memory layout, which may affect CPU cache misses.
I am trying the new features of c++11 and I found an issue. This is my code:
#include <iostream>
#include <list>
#include <string>
using namespace std;
class A {
public:
int f (list<string> a, list<string> b={})
{
cout << a.size() << endl;
cout << b.size() << endl; // This line!!!
return 0;
}
};
int main ()
{
A a;
list<string> l{"hello","world"};
a.f(l);
return 0;
}
the execution stuck at "This line!!!" line. I continue debugging and it looks like the problem is here.
/** Returns the number of elements in the %list. */
size_type
size() const _GLIBCXX_NOEXCEPT
{ return std::distance(begin(), end()); }
I compile my program in this way:
g++ -std=c++11 -ggdb3 -fPIC -o test TestlistInit.cpp
I am using this version of g++:
g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
thanks in advance!!!
To find the cause, enable debug symbols, and when you get to first line, we first check contents of b, which looks like this (the value will be different) In this case I used Code::Blocks "Watch" option.
b.M_Impl._M_node._M_next = 0x7fffffffe370
b.M_Impl._M_node._M_prev = 0x7fffffffe370
Then use debugger option to "Step into" once we hit our b.size line.
Eventually this will take us to stl_iterator_base_funcs.h
At start we can see first & last are same:
__first._M_node = 0x7fffffffe370
__last._M_node = 0x7fffffffe370
while (__first != __last)
{
++__first;
++__n;
}
Stepping into ++__first we can see it does this in stl_list.h :
_Self&
operator++()
{
_M_node = _M_node->_M_next;
return *this;
}
_M_node and and _M_node->_M_next are the same, so __first never increments, and .size() is caught in endless loop.
So when i compile this code (using the mersenne twister found here: http://www-personal.umich.edu/~wagnerr/MersenneTwister.html ):
#include <iostream>
#include <cmath>
#include "mtrand.h"
using namespace std;
double pythag(double x, double y) {
double derp=0;
derp=(x*x)+(y*y);
derp=sqrt(derp);
}
int main() {
double x=0;
double y=0;
double pi=0;
double hold1=0;
double hold2=0;
double hits=0;
MTRand mt;
mt.seed();
// cout.precision(10);
for(long i=1; i<=100000000000l; i++) {
x=abs(mt.rand());
y=abs(mt.rand());
if(pythag(x,y)<=1) {
hits++;
}
if(i%100000l==0) {
pi=(4*hits)/i;
cout << "\r" << i << " " << pi ;
}
}
cout <<"\n";
return 42;
}
Using g++ ("g++ pi.cc -o pi")
And run the resulting application, I get the output i wanted, a running tally of pi calculated using the Monte Carlo method.
But, when i compile with mingw g++ ("i686-pc-mingw32-g++ -static-libstdc++ -static-libgcc pi.cc -o pi.exe")
I always get a running tally of 0.
Any help is greatly appreciated.
Perhaps it's because you omitted the return statement:
double pythag(double x, double y) {
double derp=0;
derp=(x*x)+(y*y);
derp=sqrt(derp);
// You're missing this!!!
return derp;
}
I'd be surprised that you didn't get any warnings or errors on this.
pythag() does not return anything, as Loki is trying to say without telling you the exact answer. That means the return value is not specified.
Why do you return 42 in main()?! 8-)