Does std::future spin-wait? - c++

Before explaining the question in more detail, I'll note that the answer is obviously implementation dependent, so I'm primarily asking about libstdc++, but I would also be interested in hearing about libc++. The operating system is Linux.
Calling wait() or get() on a std::future blocks until the result is set by an asynchronous operation -- either an std::promise, std::packaged_task, or std::asyn function. The availability of the result is communicated through a shared state, which is basically an atomic variable: the future waits for the shared state to be marked as ready by the promise (or async task). This waiting and notification is implemented in libstdc++ via a futex system call. Granted that futexes are highly performant, in situations where a future expects to only wait for an extremely brief period (on the order of single microseconds), it would seem that a performance gain could be made by spinning on the shared state for a short time before proceeding to wait on the futex.
I have not found any evidence of such spinning in the current implementation, however, I did find a comment in atomic_futex.h at line 161 where I would expect to find such spinning:
// TODO Spin-wait first.
So my question is rather the following: Are there really plans to implement a spin-wait, and if so, how will the duration be decided? Additionally, is this the type of functionality that could eventually be specified through a policy to the future?

I will answer the question: Does std::future::get() perform a spin-wait?
The answer for all of C++ is: It is an implementation detail. A conforming standard library might spin or it might not (in the same vein, std::mutex::lock() is allowed to spin). Will there be a mechanism for specifying if and how to spin in the future? The places to look are in std::experimental::future (coming to a full version of the Standard Library soon), boost::future (proving ground for what might later go into the standard), and hpx::future (performance-focused library with advanced facilities for future management). None of these have mechanisms for explicit specification of spinning, nor has there been discussion in meeting minutes that I know of nor on the ISO CPP mailing list. It is safe to say that something like a get_with_spins function is not in the pipeline.
To answer for libstdc++ (and libc++): They also do not spin. Aside from the TODO which came from the original patch, it does not look like there is any plan to change this. I've searched the GCC mailing list for mentions of changing this behavior, but found none. Doing a pre-sleep spin can hurt in the general case (if none of the get()s have a value, you've wasted a lot of CPU cycles), so a change here could have negative impacts.
To summarize: Implementations do not appear to spin now and there appear to be no plans to change the behavior in the near future, but that can change at any moment.

Related

Deduce if a program is going to use threads

Thread-safe or thread-compatible code is good.
However there are cases in which one could implement things differently (more simply or more efficiently) if one knows that the program will not be using threads.
For example, I once heard that things like std::shared_ptr could use different implementations to optimize the non-threaded case (but I can't find a reference).
I think historically std::string in some implementation could use Copy-on-write in non-threaded code.
I am not in favor or against these techniques but I would like to know if that there is a way, (at least a nominal way) to determine at compile time if the code is being compiled with the intention of using threads.
The closest I could get is to realize that threaded code is usually (?) compiled with the -pthreads (not -lpthreads) compiler option.
(Not sure if it is a hard requirement or just recommended.)
In turn -pthreads defines some macros, like _REENTRANT or _THREAD_SAFE, at least in gcc and clang.
In some some answers in SO, I also read that they are obsolete.
Are these macros the right way to determine if the program is intended to be used with threads? (e.g. threads launched from that same program). Are there other mechanism to detect this at compile time? How confident would the detection method be?
EDIT: since the question can be applied to many contexts apparently, let me give a concrete case:
I am writing a header only library that uses another 3rd party library inside. I would like to know if I should initialize that library to be thread-safe (or at least give a certain level of thread support). If I assume the maximum level of thread support but the user of the library will not be using threads then there will be cost paid for nothing. Since the 3rd library is an implementation detail I though I could make a decision about the level of thread safety requested based on a guess.
EDIT2 (2021): By chance I found this historical (but influential) library Blitz++ which in the documentation says (emphasis mine)
8.1 Blitz++ and thread safety
To enable thread-safety in Blitz++, you need to do one of these
things:
Compile with gcc -pthread, or CC -mt under Solaris. (These options define_REENTRANT,which tells Blitz++ to generate thread-safe code).
Compile with -DBZ_THREADSAFE, or #define BZ_THREADSAFE before including any Blitz++ headers.
In threadsafe mode, Blitz++ array reference counts are safeguarded by
a mutex. By default, pthread mutexes are used. If you would prefer a
different mutex implementation, add the appropriate BZ_MUTEX macros to
<blitz/blitz.h> and send them toblitz-dev#oonumerics.org for
incorporation. Blitz++ does not do locking for every array element
access; this would result in terrible performance. It is the job of
the library user to ensure that appropriate synchronization is used.
So it seems that at some point _REENTRANT was used as a clue for the need of multi-threading code.
Maybe it is a very old reference to take seriously.
I support the other answer in that thread-safety decision ideally should not be done on whole program basis, rather they should be for specific areas.
Note that boost::shared_ptr has thread-unsafe version called boost::local_shared_ptr. boost::intrusive_ptr has safe and unsafe counter implementation.
Some libraries use "null mutex" pattern, that is a mutex, which does nothing on lock / unlock. See boost or Intel TBB null_mutex, or ATL CComFakeCriticalSection. This is specifically to substitute real mutex for threqad-safe code, and a fake one for thread-unsafe.
Even more, sometimes it may make sense to use the same objects in thread-safe and thread-unsafe way, depending on current phase of execution. There's also atomic_ref which serves the purpose of providing thread-safe access to underlying type, but still letting work with it in thread unsafe.
I know a good example of runtime switches between thread-safe and thread-unsafe. See HeapCreate with HEAP_NO_SERIALIZE, and HeapAlloc with HEAP_NO_SERIALIZE.
I know also a questionable example of the same. Delphi recommends calling its BeginThread wrapper instead of CreateThread API function. The wrapper sets a global variable telling that from now on Delphi Memory Manager should be thread-safe. Not sure if this behavior is still in place, but it was there for Delphi 7.
Fun fact: in Windows 10, there are virtually no single-threaded programs. Before the first statement in main is executed, static DLL dependencies are loaded. Current Windows version makes this DLL loading paralleled where possible by using thread pool. Once program is loaded, thread pool threads are waiting for other tasks that could be issued by using of Windows API calls or std::async. Sure if program by itself will not use threads and TLS, it will not notice, but technically it is multi-threaded from the OS perspective.
How confident would the detection method be?
Not really. Even if you can unambiguously detect if code is compiled to be used with multiple threads, not everything must be thread safe.
Making everything thread-safe by default, even though it is only ever used only by a single thread would defeat the purpose of your approach. You need more fine grainded control to turn on/off thread safety if you do not want to pay for what you do not use.
If you have class that has a thread-safe and a non-thread-safe version then you could use a template parameter
class <bool isThreadSafe> Foo;
and let the user decide on a case for case basis.

Does std::chrono::system_clock::now() use mutex inside? [duplicate]

First, I'm assuming that calling any function of std::chrono is guaranteed to be thread-safe (no undefined behaviour or race conditions or anything dangerous if called from different threads). Am I correct?
Next, for example on windows there is a well known problem related to multi-core processors that force some implementations of time related systems to allow forcing a specific core to get any time information.
What I want to know is:
using std::chrono, in the standard, is there any guarantee that think kind of problem shouldn't appear?
or is it implementation defined
or is there an explicit absence of guarantee that imply that on windows you'd better get time always from the same core?
Yes, calls to some_clock::now() from different threads should be thread safe.
As regards the specific issue you mention with QueryPerformanceCounter, it is just that the Windows API exposes a hardware issue on some platforms. Other OSes may or may not expose this hardware issue to user code.
As far as the C++ standard is concerned, if the clock claims to be a "steady clock" then it must never go backwards, so if there are two reads on the same thread, the second must never return a value earlier than the first, even if the OS switches the thread to a different processor.
For non-steady clocks (such as std::chrono::system_clock on many systems), there is no guarantee about this, since an external agent could change the clock arbitrarily anyway.
With my implementation of the C++11 thread library (including the std::chrono stuff) the implementation takes care to ensure that the steady clocks are indeed steady. This does impose a cost over and above a raw call to QueryPerformanceCounter to ensure the synchronization, but no longer pins the thread to CPU 0 (which it used to do). I would expect other implementations to have workarounds for this issue too.
The requirements for a steady clock are in 20.11.3 [time.clock.req] (C++11 standard)
I honestly believe that this question is fully answered by the follow statement: There is no guarantee that an implementation will not have platform-specific bugs. It's all supposed to work perfectly, but sometimes for some reason or another it doesn't. Nobody can promise you that it will do what you want, but it is supposed to work.

Does Boost have support for Windows EnterCriticalSection API?

I know Boost has support for mutexes and lock_guard, which can be used to implement critical sections.
But Windows has a special API for critical sections (see EnterCriticalSection and LeaveCriticalSection) which is a LOT faster than a mutex (for rarely contended, short sections of code).
Hence my question - it is possible in Boost to take advantage of this API, and fallback to spinlock/mutex/futex-based implementation on other platforms?
The simple answer is no.
Here's some relevant background from an old mailing list thread:
BTW. I am agree that mutex is more universal solution from a
performance point of view. But to be fair - CS are faster in simple
design. I believe that possibility to support them should be at
least
taken in account.
This was the article that someone pointed me to. The conclusion was
that CS are only faster if:
There are less than 8 threads total in the process.
You weren't running in the background.
You weren't on an dual processor machine.
To me this means that simple testing yields good CS performance
results, but any real world program is better off with a full blown
mutex.
I'm not adverse to supporting a CS implementation. However, I
originally chose not to for the following reasons:
You get either construction and destruction hits from using a PIMPL
idiom or you must include Windows.h in the Boost.Threads headers,
which I simply don't want to do. (This can be worked around by
emulating a CS ala OPTEX from the MSDN.)
According to this research paper most programs won't benefit from
a CS design.
It's trivial to code a (non-portable) critical_section class that
follows the Mutex model if you truly can make use of this.
For now I think I've made the right choice, though down the road we
may change the implementation to use a critical section or OPTEX.
Bill Kempf
Speaking as someone who helps out maintaining Boost.Thread, and as someone who failed to get an event object into Boost.Thread, I don't think critical sections have ever been added nor would be added to Boost for these reasons:
A Win32 critical section is trivially easy to build using a boost::atomic and a boost::condition_variable, so much so it isn't really worth having an official one. Here is probably the most complex one you could imagine, but extremely configurable including being constexpr ready (don't ask!): https://github.com/ned14/boost.outcome/blob/master/include/boost/outcome/v1/spinlock.hpp#L331
You can build your own simply by matching (Basic)Lockable concept and using atomic compare_exchange (non-x86/x64) or atomic exchange (x86/x64) and then grab it using a lock_guard around the critical section.
Some may object that a win32 critical section is not this. I am afraid it is: it simply spins on an atomic for a spin count, and then lazily tries to allocate a win32 event object which it then waits upon. Nothing special.
As much as you might think critical sections (really user mode mutexes) are better/faster/whatever, they probably are not as great as you might think. boost::mutex is a big vast heavyweight thing on Windows internally using a win32 semaphore as the kernel wait object because of the need to emulate thread cancellation and to behave well in a general purpose use context. It's easy to write a concurrency structure which is faster than another for some single use case, but it is very very hard to write a concurrency structure which is all of:
Faster than a standard implementation in the uncontended case.
Faster than a standard implementation in the lightly contended case.
Faster than a standard implementation in the heavily contended case.
Even if you manage all three of the above, that still isn't enough: you also need some guarantees on worst case progression ordering, so whether certain patterns of locks, waits and unlocks produce predictable outcomes. This is why threading facilities can appear to look slow in narrow use case scenarios, so Boost.Thread much as the STL can appear to be much slower than hand rolled locking code in say an uncontended use case.
Boost.Thread already does substantial work in user mode to avoid going to kernel sleep on Windows. On POSIX any of the major pthreads implementations also does substantial work to avoid kernel sleeps and hence Boost.Thread doesn't replicate that work. In other words, critical sections don't gain you anything in terms of scaling to load behaviours, though inevitably Boost.Thread v4 especially on Windows does a ton load of work a naive implementation does not (the planned rewrite of Boost.Thread is vastly more efficient on Windows as it can assume Windows Vista or above).
So, it looks like the default Boost mutex doesn't support it, but asio::detail::mutex does.
So I ended up using that:
#include <boost/asio/detail/mutex.hpp>
#include <boost/thread.hpp>
using boost::asio::detail::mutex;
using boost::lock_guard;
int myFunc()
{
static mutex mtx;
lock_guard<mutex> lock(mtx);
. . .
}

Is there any std::chrono thread safety guarantee even with multicore context?

First, I'm assuming that calling any function of std::chrono is guaranteed to be thread-safe (no undefined behaviour or race conditions or anything dangerous if called from different threads). Am I correct?
Next, for example on windows there is a well known problem related to multi-core processors that force some implementations of time related systems to allow forcing a specific core to get any time information.
What I want to know is:
using std::chrono, in the standard, is there any guarantee that think kind of problem shouldn't appear?
or is it implementation defined
or is there an explicit absence of guarantee that imply that on windows you'd better get time always from the same core?
Yes, calls to some_clock::now() from different threads should be thread safe.
As regards the specific issue you mention with QueryPerformanceCounter, it is just that the Windows API exposes a hardware issue on some platforms. Other OSes may or may not expose this hardware issue to user code.
As far as the C++ standard is concerned, if the clock claims to be a "steady clock" then it must never go backwards, so if there are two reads on the same thread, the second must never return a value earlier than the first, even if the OS switches the thread to a different processor.
For non-steady clocks (such as std::chrono::system_clock on many systems), there is no guarantee about this, since an external agent could change the clock arbitrarily anyway.
With my implementation of the C++11 thread library (including the std::chrono stuff) the implementation takes care to ensure that the steady clocks are indeed steady. This does impose a cost over and above a raw call to QueryPerformanceCounter to ensure the synchronization, but no longer pins the thread to CPU 0 (which it used to do). I would expect other implementations to have workarounds for this issue too.
The requirements for a steady clock are in 20.11.3 [time.clock.req] (C++11 standard)
I honestly believe that this question is fully answered by the follow statement: There is no guarantee that an implementation will not have platform-specific bugs. It's all supposed to work perfectly, but sometimes for some reason or another it doesn't. Nobody can promise you that it will do what you want, but it is supposed to work.

pthread-win32 extension sem_post_multiple

I am currently building a thin C++ wrapper around pthreads for in-house use. Windows as well as QNX are targeted and fortunately the pthreads-win32 ports seems to work very well, while QNX is conformant to POSIX for our practical purposes.
Now, while implementing semaphores, I hit the function
sem_post_multiple(sem_t*, int)
which is apparently only available on pthreads-win32 but is missing from QNX. As the name suggests the function is supposed to increment the semaphore by the count given as second argument. As far as I can tell the function is not part of neither POSIX 1b nor POSIX 1c.
Although there is currently no requirement for said function I am still wondering why pthreads-win32 provides the function and whether it might be of use. I could try to mimic it for QNX using similar to the following:
sem_post_multiple_qnx(sem_t* sem, int count)
{
for(;count > 0; --count)
{
sem_post(sem);
}
}
What I am asking for are suggestion/advice on how to proceed. If consensus suggests to do implement the function for QNX I would also appreciate comments on whether the suggested code snipped is a viable solution.
Thanks in advance.
PS: I deliberately left out my fancy C++ class for clarity. For all folks suggesting boost to the rescue: it is not an option in my current project due to management reasons.
In any case semaphores are an optional extension in POSIX. E.g OS X doesn't seem to implement them fully. So if you are concerned with portability, you'd have to provide wrappers of the functionalities you need, anyhow.
Your approach to emulate an atomic increment by iterated sem_post has certainly downsides.
It might be performing badly,
where usually sem_t are used in
performance critical contexts.
This operation would not be
atomic. So confusing things might
happen before you finish the loop.
I would stick to the just necessary, strictly POSIX conforming. Beware that sem_timedwait is yet another optional part of the semaphore option.
Your proposed implementation of sem_post_multiple doesn't play nicely with sem_getvalue, since sem_post_multiple is an atomic increase and therefore it's not possible for a "simultaneous" call to sem_getvalue to return any of the intermediate values.
Personally I'd want to leave them both out: trying to add fundamental synchronization operations to a system which lacks them is a mug's game, and your wrapper might soon cease to be "thin". So don't get into it unless you have code that uses sem_post_multiple, that you absolutely have to port.
sem_post_multiple() is a non-standard helper function introduced by the win32-pthreads maintainers. Your implementation isn't the same as theirs because the multiple decrements aren't atomic. Whether or not this is a problem depends on the intended use. (Personally, I wouldn't try to implement this function unless/until the need arises.)
This is an interesting question. +1.
I agree with the current prevailing consensus here that it is probably not a good idea to implement that function. While your proposed implementation would probably be work just fine in most situations, there are definitely conditions in which the results could be dramatically different due to the non-atomicity. The following is one (extremely) contrived situation:
Start thread A which calls sem_post_multiple( s, 10 )
Thread B waiting on s is released. Thread B kills thread A.
In the above unfriendly scenario, the atomic version would have incremented the semaphore by 10. With non-atomic version, it may only be incremented once. This example is certainly not likely in the real world. For example, the killing of a thread is almost always a bad idea not to mention the fact that it could leave the semaphore object in an invalid state. The Win32 implementation could leave a mutex lock on the semaphore - see this for why.