Force use of locks inside std::atomic during debugging with libstdc++ - c++

I've done a bit of a google and can't seem to turn up a GCC option or libstdc++ macro for this. Is it possible to force the use of locking internally on all the std::atomic template specializations. On some platforms some of the specializations are locking anyway, so it certainly seems like a feasible option.
In the past I've found the use of std::atomic to be very painful when debugging data-races with tools such as Valgrind (Helgrind or DRD) due to the enormous number of false positives. If use of atomics is pervasive enough, suppression files seem to not be a very scalable solution.

There is no way, AFAIK. GCC implements C++11 atomics via lock-free builtin functions (__atomic_fetch_add, __atomic_test_and_set, etc). Depending on what is available in the machine definition, GCC may emit some efficient insn sequence or, as a last resort, use a compare-and-swap loop. If nothing useful is available, GCC just emits calls to external functions with the same names and arguments.
http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/_005f_005fatomic-Builtins.html#_005f_005fatomic-Builtins
PS. Actually, you may compile with -m32 -march=i386 and provide yourself the required external functions.

Related

How to use -fsplit-stack on a large project

I recently posted a question about stack segmentation and boost coroutines but it seems like the -fsplit-stack approach only works with source files that are compiled with that flag, the runtime breaks down when you branch to another function that has not been compiled with -fsplit-stack. For example
This implies that the runtime uses a function local technique to detect when the current stack has been surpassed. And not a "guard page signal" trick, where the end of the stack always has a guard page which will raise a signal on write or read, telling the runtime to allocate a new stack frame and branch to that.
Then what is the use of this flag? If I link to any other library that has not been built with this, code will break (even libstdc++ and libc), then how is this something people use practically with big projects?
From reading the gcc wiki about split stacks it seems like calling a non split stack function from a split stack function results in an allocation of a 64KB stack frame. Good.
But it seems like calling a non split stack function from a function pointer has not yet been implemented to follow the above scheme.
What use is this flag then? If I proceed to call any virtual function will my program break?
Further from the answer below it seems like clang has not implemented split stacks?
You have to compile boost (at least boost.context and boost.coroutine) with segmeented-stacks support AND your application.
compile boost (boost.context and boost.coroutine) with b2 property segmented-stacks=on (enables special code inside boost.coroutine and boost.context).
your app has to be compiled with -DBOOST_USE_SEGMENTED_STACKS and -fsplit-stack (required by boost.coroutines headers).
see boost.coroutine documentation
boost.coroutine contains an example that demonstrates segmented stacks (in directory coroutine/example/asymmetric/ call b2 toolset=gcc segmented-stacks=on).
regarding your last question GCC Wiki states:
For calls from split-stack code to non-split-stack code, the linker
will change the initial instructions in the split-stack (caller)
function. This means that the linker will have to have special
knowledge of the instructions that the compiler emits. The effect of
the changes will be to increase the required framesize by a number
large enough to reasonably work for a non-split-stack. This will be a
target dependent number; the default will be something like 64K. Note
that this large stack will be released when the split-stack function
returns. Note that I'm disregarding the case of split-stack code in a
shared library calling non-split-stack code in the main executable;
that seems like an unlikely problem.
please note: while llvm supports segmented stacks, clang seams not to provide the __splitstack_<xyz> functions.
First I'd say split stack support is somewhat experimental in nature to begin with. It is not a widely supported thing nor has a single implementation become accepted as the way to go. As such, part of the purpose of it existing in the compiler is to enable research in real use.
That said, one generally wants to use such a feature to enable lots of threads with small stacks, but which can get bigger if they need to. In some applications, the code that runs in these threads can be tightly controlled. E.g. fairly specialized request handlers that do not call general purpose libraries such as Boost. High performance systems work often involves tightening down the constraints on what code is used in a given path and this would be an example thereof. It certainly limits the applicability of the feature, but I wouldn't be surprised if someone is using it in production this way.
Note that similar issues exist with flags such as -fno-exceptions and -fno-rtti . Generally C++ requires compiling everything that goes into an executable with a compatible set of flags. Sometimes one can mix and match, but it is often fragile. This is part of the motivation of building everything from source and hermetic build tools like bazel. Other languages have different approaches to non-source components, especially virtual machine based languages such as Java and the .NET family. In those worlds things like split stacks are decided at a lower-level of compilation, but typically one would not have any control over or awareness of them at the source code level.

Avoiding duplicate symbol when compiling to multiple instruction sets

I am in the process of using CPU dispatch based on processor features to switch implementation of a complicated numerical algorithm. I want to include the two versions (an sse2 and sse3 version for arguments sake) I am compiling in the same dynamic library.
The approach taken so far is to wrap all architecture specific code into a namespace e.g. namespace sse2 and namespace sse3 and thus avoiding duplicate symbol names when linking into the final dynamic library.
However, what happens if I use some code outside my control (e.g. a std::vector<int>) in both the sse2 and ss3 version. As far as I can see, the std::vector implementation will be present in both the sse2 and sse3 object files, but could in theory contain different instructions depending on the optimizations performed by the compiler. When I link these object files into the dynamic library, one of them will be used, and I risk potentially trying to run an sse3 instruction on a cpu only supporting sse2.
Aside from compiling to two separate dynamic libraries, what can be done to get around this problem? I need a solution working with both Visual Studio and clang on windows, mac os x and linux.
One approach would be to dispatch at the shared-library level instead of the object-file level. This would require compiling the entire library multiple times with different instruction set support, then dispatching to the appropriate shared library at runtime based on the CPU capabilities that you detect. I detail an approach that can work for this on OS X and Linux in this previous answer. I have not tried to implement this on Windows (yet), however.
This scenario is fully supported in the language and should not require any explicit handling. Your dynamic dispatch scenario doesn't have much to do with it - it is very typical for a project to instantiate std::vector in multiple translation units, and still ODR is not violated. In general, inline functions - and in particular template instantiations - are just not visible to the linker (i.e. do not appear as 'External' at the obj file tables).
If for some exotic reasons you would need to explicitly control this linkage type you'd have to resort to compiler-specific apparati. The MSVC apparatus is selectany, gcc has some other devices. Don't know about clang - but the point is you'd be hard pressed to come up with a reason to use any of them.

What exactly does `threading=multi` do when compiling boost?

I am not entirely sure what exactly the threading=multi flag does when building boost. The documentation says:
Causes the produced binaries to be thread-safe. This requires proper
support in the source code itself.
which does not seem to be very specific. Does this mean that the accesses to, for example, boost containers are guarded by mutexes/locks or similar? As the performance of my code is critical, I would like to minimize any unnecessary mutexes etc.
Some more details:
My code is a plug-in DLL which gets loaded into a multi-threaded third-party application. I statically link boost into the DLL (the plug-in is not allowed to have any other dependencies except standard Windows DLLs, so I am forced to do this).
Although, the application is multi-threaded, most of the functions in my DLL are only ever called from a single thread and therefore the accesses to containers need not be guarded. I explicitly guard the the remaining places of my code, which can be called from multiple threads, by using boost::mutex and friends.
I've tried building boost with both threading=multi and threading=single and both seem to work but I'd really like to know what I am doing here.
No, threading=multi doesn't mean that things like boost containers will suddenly become safe for concurrent access by multiple threads (that would be prohibitively expensive from a performance point of view).
Rather, what it means in theory is that boost will be compiled to be thread aware. Basically this means that boost methods and classes will behave in a reasonable default way when accessed from multiple threads, much like classes in the std library. This means that you cannot access the same object from multiple threads, unless otherwise documented, but you can access different objects from multiple threads, safely. This might seem obvious, even without explicit support, but any static state used by the library would break that guarantee if not protected. Using threading=multi guarantees that any such shared state is property guarded by mutex or some other mechanism.
In the past similar arguments or stdlib flavors were available for the C and C++ std libraries supplied my compilers, although today mostly just the multithreaded versions are available.
There is likely little downside to compiling with threading=multi, given that only a limited amount of static state need to be synchronized. Your comment that your library will mostly only be called by a single thread doesn't inspire a lot of confidence - after all, those are the kinds of latent bugs that will cause you to be woken up at 3 AM by your boss after a night of long drinking.
The example of boost's shared_ptr is informative. With threading=single, it is not even guaranteed that independent manipulation of two shared_ptr instances, from multiple threads, is safe. If they happen to point to the same object (or, in theory, under some exotic implementations even if they don't), you will gen undefined behavior, because shared state won't be manipulated with the proper protections.
With threading=multi, this won't happen. However, it is still not safe to access the same shared_ptr instance from multiple threads. That is, it doesn't give any thread safety guarantees which aren't documented for the object in question - but it does give the "expected/reasonable/default" guarantees of independent objects being independent. There isn't a good name for this default level of thread-safety, but it is in fact what's generally offered all standard libraries for multi-threaded languages today.
As a final point, it's worth noting that Boost.Thread is implicitly always compiled with threading=multi - since using boost's multithreaded classes is an implicit hint that multiple threads are present. Using Boost.Thread without multithreaded support would be nonsensical.
Now, all that said, the above is the theoretical idea behind compiling boost "with thread support" or "without thread support" which is the purpose of the threading= flag. In practice, since this flag was introduced, multithreading has become the default, and single threading the exception. Indeed, many compilers and linkers which defaulted to single threaded behavior, now default to multithreaded - or at least require only a single "hint" (e.g., the presence of -pthread on the command line) to flip to multithreaded.
Apart from that, there has also been a concerted effort to make the boost build be "smart" - in that it should flip to multithreaded mode when the environment favors it. That's pretty vague, but necessarily so. It gets as complicated as weak-linking the pthreads symbols, so that the decision to use MT or ST code is actually deferred to runtime - if pthreads is available at execution time, those symbols will be used, otherwise weakly linked stubs - that do nothing at all - will be used.
The bottom line is that threading=multi is correct, and harmless, for your scenario, and especially if you are producing a binary you'll distribute to other hosts. If you don't special that, it is highly likely that it will work anyway, due to the build-time and even runtime heuristics, but you do run the chance of silently using empty stub methods, or otherwise using MT-unsafe code. There is little downside to using the right option - but some of the gory details can also be found in the comments to this point, and Igor's reply as well.
After some digging, it turns out that threading=single doesn't have much effect, as one would expect. In particular, it does not affect BOOST_HAS_THREADS macro and thus doesn't configure libraries to assume single threaded environment.
With gcc threading=multi just implies #define BOOST_HAS_PTHREADS, while with MSVC it doesn't produce any visible effect. Paricularly, _MT is defined both in threading=single and threading=multi modes.
Note however, that one can explicitly configure Boost libraries for single-threaded mode by defining the appropriate macro, like BOOST_SP_DISABLE_THREADS , BOOST_ASIO_DISABLE_THREADS, or globally with BOOST_DISABLE_THREADS.
Let's be concise. Causes the produced binaries to be thread-safe means that Boost code is adapted to that different threads can use different Boost objects. That means in particular that the Boost code will have a special care of the thread-safety of accesses to hidden global or static objects that might be use by the implementation of Boost libraries. It does not automatically allow different threads to use the same Boost objects at the same time without protection (locks/mutexes/...).
Edit: Some Boost libraries may document an extra thread-safety for specific functions or classes. For example asio::io_service as suggested by Igor R. in a comment.
The documentation says everything that is needed: this option ensures thread safety. That is, when programming in a multithreaded environment, you need to ensure certain properties like avoiding unrestricted access to e.g. variables.
I think, enabling this option is the way to go.
For further reference: BOOST libraries in multithreading-aware mode

How do I ensure lrint is inlined in gcc?

After reading around the subject, there is overwhelming evidence from numerous sources that using standard C or C++ casts to convert from floating point to integer numbers on Intel is very slow. In order to meeting the ANSI/ISO specification, Intel CPUs need to execute a large number of instructions including those needed to switch the rounding mode of the FPU hardware.
There are a number of workarounds described in various documents, but the cleanest and most portable seems to be the lrint() call added to C99 and C++ 0x standards. Many documents say that a compiler should inline expand these functions when optimization is enabled, leading to code which is faster than a conventional cast, or a function call.
I even found references to gcc feature tracking bags to add this inline expansion to the gcc optimizer, but in my own performance tests I have been unable to get it to work. All my attempts show lrint performance to be much slower than a simple C or C++ style cast. Examining the assembly output of the compiler, and disassembling the compiled objects always shows an explicit call to an external lrint() or lrintf() function.
The gcc versions I have been working with are 4.4.3 and 4.6.1, and I have tried a number of flag combinations on 32bit and 64bit x86 targets, including options to explicitly enable SSE.
How do I get gcc to inline expand lrint, and give me fast conversions?
The lrint() function may raise domain and range errors. One possible way the libc deals with such errors is setting errno (see C99/C11 section 7.12.1). The overhead of the error checking can be quite significant and in this particular case seems to be enough for the optimizer to decide against inlining.
The gcc flag -fno-math-errno (which is part of -ffast-math) will disable these checks. It might be a good idea to look into -ffast-math if you do not rely on standards-compliant handling of floating-point semantics, in particular NaNs and infinities...
Have you tried the -finline-functions flag to gcc.
You can also direct GCC to try to integrate all “simple enough” functions into their callers with the option -finline-functions.
see http://gcc.gnu.org/onlinedocs/gcc/Inline.html
Here you can say gcc to make all function to inline but not all will be inlined.
The compiler uses some heuristics to determine whether the function is small enough to be inlined. One more thing is that a recursive function are also not going to be inline here.

C++0x atomic template implementation

I know that similar template exits in Intel's TBB, besides that I can't find any implementation on google or in Boost library.
You can find discussions about this feature implementation in boost there : http://lists.boost.org/Archives/boost/2008/11/144803.php
> Can the N2427 - C++ Atomic Types and Operations be implemented
> without the help of the compiler?
No.
They don't need to be intrinsics if you can write inline assembler (or separately-compiled assembler for
that matter) then you can write the
operations themselves directly. You
might even be able to use simple C++
(e.g. just plain assignment for load
or store). The reason you need
compiler support is preventing
inappropriate optimizations: atomic
operations can't be optimized out, and
generally must not be reordered
before or after any other operations.
This means that even non-atomic
stores performed before an atomic
store have to be complete, and can't
be cached in a register (for example).
Also, loads that occur after an
atomic operation cannot be hoisted
before the atomic op. On some
compilers, just using inline assembler
is enough. On others, calling an
external function is enough. MSVC
provides
_ReadWriteBarrier() to provide the compiler ordering. Other compilers
need other flags.