Linking pthread disables lock-free shared_ptr implementation

Linking pthread disables lock-free shared_ptr implementation - c++

The title pretty much conveys all relevant information, but here's a minimal repro:
#include <atomic>
#include <cstdio>
#include <memory>
int main() {
auto ptr = std::make_shared<int>(0);
bool is_lockless = std::atomic_is_lock_free(&ptr);
printf("shared_ptr is lockless: %d\n", is_lockless);
}
Compiling this with the following compiler options produces a lock-free shared_ptr implementation:
g++ -std=c++11 -march=native main.cpp
While this doesn't:
g++ -std=c++11 -march=native -pthread main.cpp
GCC version: 5.3.0 (on Linux, using libstdc++), tested on multiple machines that should have the necessary atomic instructions to make this work.
Is there any way to force the lock-free implementation (I'd need the lock-free version, regardless of performance)?

There are two separate things:
Manipulation of the reference counter in the control block (or equivalent thing) is typically implemented with lock-free atomics whenever possible. This is not what std::atomic_is_lock_free tells you.
libstdc++'s __shared_ptr is templated on the lock policy, so you can explicitly use
template<typename T>
using shared_ptr_unsynchronized = std::__shared_ptr<T, __gnu_cxx::_S_single>;
if you know what you are doing.
std::atomic_is_lock_free tells you whether the atomic access functions (std::atomic_{store, load, exchange, compare_exchange} etc.) on shared_ptr are lock-free. Those functions are used to concurrently access the same shared_ptr object, and typical implementations will use a mutex.

If you use shared_ptr in a threaded environment, you NEED to have locks [of some kind - they could be implemented as atomic increment and decrement, but there may be places where a "bigger" lock is required to ensure no races]. The lockless version only works when there is only one thread. If you are not using threads, don't link with -lpthread.
I'm sure there is some tricky way to convince the compiler that you are not REALLY using the threads for your shared pointers, but you are REALLY in fragile territory if you do - what happens if a shared_ptr is passed to a thread? You may be able to guarantee that NOW, but someone will probably accidentally or on purpose introduce one into something that runs in a different thread, and it all breaks.

Related

Do dependent reads require a load-acquire?

Does the following program expose a data race, or any other concurrency concern?
#include <cstdio>
#include <cstdlib>
#include <atomic>
#include <thread>
class C {
public:
int i;
};
std::atomic<C *> c_ptr{};
int main(int, char **) {
std::thread t([] {
auto read_ptr = c_ptr.load(std::memory_order_relaxed);
if (read_ptr) {
// does the following dependent read race?
printf("%d\n", read_ptr->i);
}
});
c_ptr.store(new C{rand()}, std::memory_order_release);
return 0;
}
Godbolt
I am interested in whether reads through pointers need load-acquire semantics when loading the pointer, or whether the dependent-read nature of reads through pointers makes that ordering unnecessary. If it matters, assume arm64, and please describe why it matters, if possible.
I have tried searching for discussions of dependent reads and haven't found any explicit recognition of their implicit load-reordering-barriers. It looks safe to me, but I don't trust my understanding enough to know it's safe.

Your code is not safe, and can break in practice with real compilers for DEC Alpha AXP (which can violate causality via tricky cache bank shenanigans IIRC).
As far as the ISO C++ standard guaranteeing anything in the C++ abstract machine, no, there's no guarantee because nothing creates a happens-before relationship between the init of the int and the read in the other thread.
But in practice C++ compilers implement release the same way regardless of context and without checking the whole program for the existence of a possible reader with at least consume ordering.
In some but not all concrete implementations into asm for real machines by real compilers, this will work. (Because they choose not to look for that UB and break the code on purpose with fancy inter-thread analysis of the only possible reads and writes of that atomic variable.)
DEC Alpha could famously break this code, not guaranteeing dependency ordering in asm, so needing barriers for memory_order_consume, unlike all(?) other ISAs.
Given the current deprecated state of consume, the only way to get efficient asm on ISAs with dependency ordering (not Alpha), but which don't do acquire for free (x86) is to write code like this. The Linux kernel does this in practice for things like RCU.
That requires keeping it simple enough that compilers can't break the dependency ordering by e.g. proving that any non-NULL read_ptr would have a specific value, like the address of a static variable.
See also
What does memory_order_consume really do?
C++11: the difference between memory_order_relaxed and memory_order_consume
Memory order consume usage in C11 - more about the hardware mechanism / guarantee that consume is intended to expose to software. Out-of-order exec can only reorder independent work anyway, not start a load before the load address is known, so on most CPUs enforcing dependency ordering happens for free anyway: only a few models of DEC Alpha could violate causality and effectively load data from before it had the pointer that gave it the address.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0371r1.html - and other C++ wg21 documents linked from that about why consume is discouraged.

Is there any potential problem with double-check lock for C++?

Here is a simple code snippet for demonstration.
Somebody told me that the double-check lock is incorrect. Since the variable is non-volatile, the compiler is free to reorder the calls or optimize them away(For details, see codereview.stackexchange.com/a/266302/226000).
But I really saw such a code snippet is used in many projects indeed. Could somebody shed some light on this matter? I googled and talked about it with my friends, but I still can't find out the answer.
#include <iostream>
#include <mutex>
#include <fstream>
namespace DemoLogger
{
void InitFd()
{
if (!is_log_file_ready)
{
std::lock_guard<std::mutex> guard(log_mutex);
if (!is_log_file_ready)
{
log_stream.open("sdk.log", std::ofstream::out | std::ofstream::trunc);
is_log_file_ready = true;
}
}
}
extern static bool is_log_file_ready;
extern static std::mutex log_mutex;
extern static std::ofstream log_stream;
}
//cpp
namespace DemoLogger
{
bool is_log_file_ready{false};
std::mutex log_mutex;
std::ofstream log_stream;
}
UPDATE:
Thanks to all of you. There is better implementation for InitFd() indeed, but it's only a simple demo indeed, what I really want to know is that whether there is any potential problem with double-check lock or not.
For the complete code snippet, see https://codereview.stackexchange.com/questions/266282/c-logger-by-template.

The double-checked lock is incorrect because is_log_file_ready is a plain bool, and this flag can be accessed by multiple threads one of which is a writer - that is a race
The simple fix is to change the declaration:
std::atomic<bool> is_log_file_ready{false};
You can then further relax operations on is_log_file_ready:
void InitFd()
{
if (!is_log_file_ready.load(std::memory_order_acquire))
{
std::lock_guard<std::mutex> guard(log_mutex);
if (!is_log_file_ready.load(std::memory_order_relaxed))
{
log_stream.open("sdk.log", std::ofstream::out | std::ofstream::trunc);
is_log_file_ready.store(true, std::memory_order_release);
}
}
}
But in general, double-checked locking should be avoided except in low-level implementations.
As suggested by Arthur P. Golubev, C++ offers primitives to do this, such as std::call_once
Update:
Here's an example that shows one of the problems a race can cause.
#include <thread>
#include <atomic>
using namespace std::literals::chrono_literals;
int main()
{
int flag {0}; // wrong !
std::thread t{[&] { while (!flag); }};
std::this_thread::sleep_for(20ms);
flag = 1;
t.join();
}
The sleep is there to give the thread some time to initialize.
This program should return immediately, but compiled with full optimization -O3, it probably doesn't. This is caused by a valid compiler transformation, that changes the while-loop into something like this:
if (flag) return; while(1);
And if flag is (still) zero, this will run forever (changing the flag type to std::atomic<int> will solve this).
This is only one of the effects of undefined behavior, the compiler does not even have to commit the change to flag to memory.
With a race, or incorrectly set (or missing) barriers, operations can also be re-ordered causing unwanted effects, but these are less likely to occur on X86 since it is a generally more forgiving platform than weaker architectures (although re-ordering effects do exist on X86)

Somebody told me that the double-check lock is incorrect
It usually is.
IIRC double-checked locking originated in Java (whose more strongly-specified memory model made it viable).
From there it spread a plague of ill-informed and incorrect C++ code, presumably because it looks enough like Java to be vaguely plausible.
Since the variable is non-volatile
Double-checked locking cannot be made correct by using volatile for synchronization, because that's not what volatile is for.
Java is perhaps also the source of this misuse of volatile, since it means something entirely different there.
Thanks for linking to the review that suggested this, I'll go and downvote it.
But I really saw such a code snippet is used in many projects indeed. Could somebody shed some light on this matter?
As I say, it's a plague, or really I suppose a harmful meme in the original sense.
I googled and talked about it with my friends, but I still can't find out the answer.
... Is there any potential problem with double-check lock for C++?
There are nothing but problems with double-checked locking for C++. Almost nobody should ever use it. You should probably never copy code from anyone who does use it.
In preference order:
Just use a static local, which is even less effort and still guaranteed to be correct - in fact:
If multiple threads attempt to initialize the same static local variable concurrently, the initialization occurs exactly once (similar behavior can be obtained for arbitrary functions with std::call_once).
Note: usual implementations of this feature use variants of the double-checked locking pattern, which reduces runtime overhead for already-initialized local statics to a single non-atomic boolean comparison.
so you can get correct double-checked locking for free.
Use std::call_once if you need more elaborate initialization and don't want to package it into a class
Use (if you must) double-checked locking with a std::atomic_flag or std::atomic_bool flag and never volatile.

There is nothing to optimize away here (no commands to be excluded, see the details below), but there are the following:
It is possible that is_log_file is set to true before log_stream opens the file; and then another thread is possible to bypass the outer if block code and start using the stream before the std::ofstream::open has completed.
It could be solved by using std::atomic_thread_fence(std::memory_order_release); memory barrier before setting the flag to true.
Also, a compiler is forbidden to reorder accesses to volatile objects on the same thread (https://en.cppreference.com/w/cpp/language/as_if), but, as for the code specifically, the available set of operator << functions and write function of std::ofstream just is not for volatile objects - it would not be possible to write in the stream if make it volatile (and making volatile only the flag would not permit the reordering).
Note, a protection for is_log_file flag from a data race with C++ standard library means releasing std::memory_order_release or stronger memory order - the most reasonable would be std::atomic/std::atomic_bool (see LWimsey's answer for the sample of the code) - would make reordering impossible because the memory order
Formally, an execution with a data race is considered to be causing undefined behaviour - which in the double-checked lock is actual for is_log_file flag. In a conforming to the standard of the language code, the flag must be protected from a data race (the most reasonable way to do it would be using std::atomic/std::atomic_bool).
Though, in practice, if the compiler is not insane so that intentionally spoils your code (some people wrongly consider undefined behaviour to be what occurs in run-time and does not relate to compilation, but standard operates undefined behaviour to regulate compilation) under the pretext it is allowed everything if undefined behavior is caused (by the way, must be documented; see details of compiling C++ code with a data race in: https://stackoverflow.com/a/69062080/1790694
) and at the same time if it implements bool reasonably, so that consider any non-zero physical value as true (it would be reasonable since it must convert arithmetics, pointers and some others to bool so), there will never be a problem with partial setting the flag to true (it would not cause a problem when reading); so the only memory barrier std::atomic_thread_fence(std::memory_order_release); before setting the flag to true, so that reordering is prevented, would make your code work without problems.
At https://en.cppreference.com/w/cpp/language/storage_duration#Static_local_variables you can read that implementations of initialization of static local variables since C++11 (which you also should consider to use for one-time actions in general, see the note about what to consider for one-time actions in general below) usually use variants of the double-checked locking pattern, which reduces runtime overhead for already-initialized local statics to a single non-atomic boolean comparison.
This is an examples of exactly that environment-dependent safety of a non-atomic flag which I stated above. But it should be understood that these solutions are environment-dependent, and, since they are parts of implementations of the compilers themselves, but not a program using the compilers, there is no concern of conforming to the standard there.
To make your program corresponding to the standard of the language and be protected (as far as the standard is implemented) against a compiler implementation details liberty, you must protect the flag from data races, and the most reasonable then, would be using std::atomic or std::atomic_bool.
Note, even without protection of the flag from data races:
because of the mutex, it is not possible that any thread would not get updates after changing values (both the bool flag and the std::ofstream object) by some thread.
The mutex implements the memory barrier, and if we don’t have the update when checking the flag in the first condition clause, we then get it then come to the mutex, and so guaranteedly have the updated value when checking the flag in the second condition clause.
because the flag can unobservably be potentially accessed in unpredictable ways from other translation units, the compiler would not be able to avoid writes and reads to the flag under the as-if rule even if the other code of translation unit would be so senseless (such as setting the flag to true and then starting the threads so that no resets to false accessible) that it would be permitted in case the flag is not accessible from other translation units.
For one-time actions in general besides raw protection with flags and mutexes consider using:
std::call_once (https://en.cppreference.com/w/cpp/thread/call_once);
calling a function for initializing a static local variable (https://en.cppreference.com/w/cpp/language/storage_duration#Static_local_variables) if its lifetime suits since its initialization is data race safety (be careful in regards to the fact that data race safety of initialization of static local variables is present only since C++11).
All the mentioned multi-threading functionality is available since C++11 (but, since you are already using std::mutex which is available starting since it too, this is the case).
Also, you should correctly handle the cases of opening the file failure.
Also, everyone must protect your std::ofstream object from concurrent operations of writing to the stream.
Answering the additional question from the update of the question, there are no problems with properly implemented double-check lock, and the proper implementation is possible in C++.

Why does libc++ allow recursive locking of std::mutex?

std::mutex is nonrecursive, and violation of that is UB. So anything is possible in theory(including works as std::recursive_mutex)), but libc++ seems to work fine , this program outputs
bye
#include <iostream>
#include <mutex>
std::mutex m;
int main() {
std::scoped_lock l1(m);
std::scoped_lock l2(m);
std::cout << "bye" << std::endl;
}
Is this intentional design decision in libc++ or just some accident(for example they could use same logic for mutex and recursive_mutex)?
libstdc++ hangs.
note: I am aware of that people should not rely on UB, so this is not about best practices, I am just curious about obscure implementation details.

It does not seem to be an intentional design decision. The libc++ implementation of std::mutex is just a wrapper around the platform's POSIX default mutex. Since that also is defined to have UB if locked recursively, they just inherited the fact that the platform's default POSIX mutex also happens to allow recursive locking.

I get the opposite results: libc++ hangs and libstdc++ doesn't
The reason is that if the file is not compiled with -pthread, threading support is disabled and std::mutex::lock/unlock become noops. Adding -pthread makes both of them deadlock as expected.
libc++ is built with threading support by default and doesn't require the -pthread flag, so it std::mutex::lock does actually acquire a lock, creating the deadlock.

Is atomic<T*> always lock free?

On my MAC OS, atomic<T*> is lock free.
#include <iostream>
#include <atomic>
int main() {
std::cout << std::atomic<void*>().is_lock_free() << std::endl;
return 0;
}
output: 1
I want to know if atomic<T*> is always lock free?
Is there a reference to introduce it?

The standard allows implementing any atomic type (with exception of std::atomic_flag) to be implemented with locks. Even if the platform would allow lock-free atomics for some type, the standard library developers might not have implemented that.
If you need to implement something differently when locks are used, this can be checked at compile time using ATOMIC_POINTER_LOCK_FREE macro.

No, it is not safe to assume that any particular platform's implementation of std::atomic is always lock free.
The standard specifies some marker macros, including ATOMIC_POINTER_LOCK_FREE, which indicates either pointers are never, sometimes or always lock free, for the platform in question.
You can also get an answer from std::atomic<T *>::is_always_lock_free, for your particular T.1
Note 1: A given pointer type must be consistent, so the instance method std::atomic<T *>::is_lock_free() is redundant.

assignment in pthreads application

I have a linux multithread application in C++.
In this application in class App offer variable Status:
class App {
...
typedef enum { asStop=0, asStart, asRestart, asWork, asClose } TAppStatus;
TAppStatus Status;
...
}
All threads are often check Status by calling GetStatus() function.
inline TAppStatus App::GetStatus(){ return Status };
Other functions of the application can assign a different values to a Status variable by calling SetStatus() function and do not use Mutexes.
void App::SetStatus( TAppStatus aStatus ){ Status=aStatus };
Edit: All threads use Status in switch operator:
switch ( App::GetStatus() ){ case asStop: ... case asStart: ... };
Is the assignment in this case, an atomic operation?
Is this correct code?
Thanks.

There is no portable way to implement synchronized variables in C99 or C++03 and pthread library does not provide one either. You can:
Use C++0x <atomic> header (or C1x <stdatomic.h>). Gcc does support it for C++ if given -std=c++0x or -std=gnu++0x option since version 4.4.
Use the Linux-specific <linux/atomic.h> (this is implementation used by kernel, but it should be usable from userland as well).
Use GCC-specific __sync_* builtin functions.
Use some other library that provides atomic operations like glib.
Use locks, but that's orders of magnitude slower compared to the fast operation itself.
Note: As Martinho pointed out, while they are called "atomic", for store and load it's not the atomic property (operation cannot be interrupted and load always sees or does not see the whole store, which is usually true of 32-bit stores and loads) but the ordering property (if you store a and than b, nobody may get new value of b and than old value of a) that is hard to get but necessary in this case.

This depends entirely upon the enum representation chosen. For x86, I believe, then all assignment operations of the operating system's word size (so 32bit for x86 and 64bit for x64) and alignment of that size as well are atomic, so a simple read and write is atomic.
Even assuming that it is the correct size and alignment, this doesn't mean that these functions are thread-safe, depends on what the status is used for.
Edit: In addition, the compiler's optimizer may well wreak havoc if there are no uses of atomic operations or other volatile accesses.
Edit to your edit: No, that's not thread safe at all. If you converted it manually into a jump table then you might be thread-safe, I'll need to think about it for a little while.

On certain architecture this assignment might be atomic (by accident), but even if it is, this code is wrong. Compiler and hardware may perform various optimizations, with might break this "atomicity". Look at: http://video.google.com/videoplay?docid=-4714369049736584770#
Use locks or atomic http://www.stdthread.co.uk/doc/headers/atomic/atomic.html variable to fix it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js