Fastest Multi-Reader / Single Writer Protection for Shared Resources - C++ - c++

I would like confirmation that my approach is extremely fast and appropriate for cross platform protection of a shared resource for a mostly multiple reader, single writer approach using C++. It favors writers such that when they enter all current threads are allowed to finish, but all new threads of any type must wait. The reverse of these two functions should be obvious.
The reading I've done suggest that boost shared_mutex and other type rwlocks are not implemented very well and should be avoided. In fact, shared_mutex will not make it into C++0x I take it. See this response by Anthony Williams.
It seems it might even be possible to write to an integer and not need locking of any kind if it is aligned correctly. There is so many articles out there, any good reading on this subject so I don't have to sort the wheat from the chaff?
void AquireReadLock(void)
{
mutex::enter();
if(READ_STATE == true)
{
iReaders++;
mutex::leave();
return;
}
else
{
mutex::leave();
sleep(1);
AquireReadLock();
return;
}
}
void AquireWriteLock(void)
{
mutex::enter();
READ_STATE = false;
if (iReaders != 0)
{
mutex::leave();
sleep(1);
AquireWriteLock();
return;
}
else
{
mutex::leave();
return;
}
}

The decision to leave shared_mutex was made independently of any quality issues. The decision was part of the "Kona compromise" made in the Fall of 2007. This compromise was aimed at reducing the feature set of C++0x so as to ship a standard by 2009. It didn't work, but nevertheless, that is the rationale.
shared_mutex's will be discussed for inclusion to a technical report (i.e. tr2) after the committee has completed C++0x. The chairman on the library working group has already contacted me on this very subject. That is not to say that shared_mutex will be in tr2. Just that it will be discussed.
Your implementation of AquireReadLock and AquireWriteLock have the disadvantage in that they eat a stack frame at the rate of once per second when under contention. And when contention is over, they delay up to a second before reacting. This makes them both stack hungry and poor performing (sorry).
If you are interested, there is a full description and implementation of shared_mutex here:
http://home.roadrunner.com/~hinnant/mutexes/locking.html
The code is not part of boost, but does carry the boost open source license. Feel free to use it, just keep the copyright with the source code. No other strings attached. Here is its analog to your AquireReadLock:
void
shared_mutex::lock_shared()
{
std::unique_lock<mutex_t> lk(mut_);
while ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_)
gate1_.wait(lk);
count_t num_readers = (state_ & n_readers_) + 1;
state_ &= ~n_readers_;
state_ |= num_readers;
}
And here is its analog to your AquireWriteLock:
void
shared_mutex::lock()
{
std::unique_lock<mutex_t> lk(mut_);
while (state_ & write_entered_)
gate1_.wait(lk);
state_ |= write_entered_;
while (state_ & n_readers_)
gate2_.wait(lk);
}
I consider this a well-tested and high-performing, fair read/write mutex implementation for C++. If you have ideas on how to improve it, I would welcome them.

I would like confirmation that my
approach is extremely fast and
appropriate for cross platform
protection of a shared resource for a
mostly multiple reader, single writer
approach using C++.
Faster than what? Your question seems like a micro-optimization, any solution will require profiling to make an adequate conclusion.
The reading I've done suggest that
boot shared_mutex and other type
rwlocks are not implemented very well
and should be avoided.
What are your sources for this statement?

Related

How to properly implement a cross-platform spinlock in c++

Essentially, my question is:
What does an "good" implementation of a spinlock look like in c++ which works on the "usual" CPU/OS/Compiler combinations (x86 & arm, Windows & Linux, msvc & clang & g++ (maybe also icc) ).
Explanation:
As I wrote in the answer to a different question, it is fairly easy to write a working spinlock in c++11. However, as pointed out (in the comments as well as in e.g. spinlock-vs-stdmutextry-lock), such an implementation comes with some performance problems in case of congestion, which imho can only be solved by using platform specific instructions (intrinsics / os primitives / assembly?).
I'm not looking for a super optimized version (I expect that would only make sense if you have very precise knowledge about the exact platform and workload and need every last bit of efficiency) but something that lives around the mythical 20/80 tradeoff point i.e. I want to avoid the most important pitfalls in most cases while still keeping the solution as simple and understandable as possible.
In general, I'd expect the result to look something like thist:
#include <atomic>
#ifdef _MSC_VER
#include <Windows.h>
#define YIELD_CPU YieldProcessor();
#elif defined(...)
#define YIELD_CPU ...
...
#endif
class SpinLock {
std::atomic_flag locked = ATOMIC_FLAG_INIT;
public:
void lock() {
while (locked.test_and_set(std::memory_order_acquire)) {
YIELD_CPU;
}
}
void unlock() {
locked.clear(std::memory_order_release);
}
};
But I don't know
if a YIELD_CPU macro inside the loop is all that's needed or if there are any other problematic aspects (e.g. can/should we indicate if we expect the test_and_set to succeed most of the time)
what the appropriate mapping for YIELD_CPU on the different CPU/OS/Compiler combinations is (and if possible I'd like to avoid dragging in a heavy weight header like Windows.h)
Note: I'm also interested in answers that only cover a subset of the mentioned platforms, but might not mark them as the accepted answer and/or merge them into a separate community answer.

Control degree of parallelism with std::async

Is there a way to explicitly set/limit the degree of parallelism (= the number of separate threads) used by std::async and related classes?
Perusing the thread support library hasn’t turned up anything promising.
As close as I could figure out, std::async implementations (usually?) use a thread pool internally. Is there are standardised API to control this?
For background: I’m in a setting (shared cluster) where I have to manually limit the number of cores used. If I fail to do this, the load sharing scheduler throws a fit and I’m penalised. In particular, std::thread::hardware_concurrency() holds no useful information, since the number of physical cores is irrelevant for the constraints I’m under.
Here’s a relevant piece of code (which, in C++17 with parallelism TS, would probably be written using parallel std::transform):
auto read_data(std::string const&) -> std::string;
auto multi_read_data(std::vector<std::string> const& filenames, int ncores = 2) -> std::vector<std::string> {
auto futures = std::vector<std::future<std::string>>{};
// Haha, I wish.
std::thread_pool::set_max_parallelism(ncores);
for (auto const& filename : filenames) {
futures.push_back(std::async(std::launch::async, read_data, filename));
}
auto ret = std::vector<std::string>(filenames.size());
std::transform(futures.begin(), futures.end(), ret.begin(),
[](std::future<std::string>& f) {return f.get();});
return ret;
}
From a design point of view I’d have expected the std::execution::parallel_policy class (from parallelism TS) to allow specifying that (in fact, this is how I did it in the framework I designed for my master thesis). But this doesn’t seem to be the case.
Ideally I’d like a solution for C++11 but if there’s one for later versions I would still like to know about it (though I can’t use it).
No. std::async is opaque, and you have no control over it's usage of threads, thread pools or anything else. As a matter of fact, you do not even have any guarantee that it would use a thread at all - it might as well execute in the same thread (potentially, note #T.C. comment below), and such implementation would still be conformant.
C++ threading library was never supposed to handle fine-tuning of OS / hardware specifics of threads management, so I am afraid, in your case you will have to code for the proper support yourself, potentially using OS-provided thread control primitives.
As other people have noted, std::async doesn't let you do this.
Yet (but see the May 2022 update at the end of the answer)
You're describing one of the simpler use-cases of Executors which are currently still bouncing around the design-space of C++ Standardisation, specifically right now in Study Group 1: Concurrency.
Since reading WG21 standards proposals can be a slog, they authors have helpfully linked to both a prototype header-only reference implementation and some example code.
It even includes a static thread pool, and an example of almost exactly what you want:
async_1.cpp
#include <experimental/thread_pool>
#include <iostream>
#include <tuple>
namespace execution = std::experimental::execution;
using std::experimental::static_thread_pool;
template <class Executor, class Function>
auto async(Executor ex, Function f)
{
return execution::require(ex, execution::twoway).twoway_execute(std::move(f));
}
int main()
{
static_thread_pool pool{1};
auto f = async(pool.executor(), []{ return 42; });
std::cout << "result is " << f.get() << "\n";
}
Thank you to #jared-hoberock for pointing me at P0668R0 as the much simpler followup to P0443R1 which I had referenced in an earlier version of this answer.
This simplification has been applied, and now there's both a paper describing the rationale (P0761R0), and a much simpler version of the standard wording in P0443R2.
As of July 2017, the only actual guess I've seen on delivery of this is: Michael Wong, editor of the Concurrency TS --- the standardisation vehicle for Executors --- feels "confident that it will make it into C++20".
I'm still getting Stack Overflow Points™ for this answer, so here's a May 2022 update:
Executors didn't land in C++20.
"A Unified Executors Proposal for C++" reached revision 14 (P0443R14) in 2020, and a new paper std::execution (P2300R5) is proposed as a follow-on; See sections 1.8 and 1.9 for the reasons for the new paper and differences from P0443.
Notably:
A specific thread pool implementation is omitted, as per LEWG direction.
The "Do in a thread-pool" example from std::execution looks like:
using namespace std::execution;
scheduler auto sch = thread_pool.scheduler();
sender auto begin = schedule(sch);
sender auto hi = then(begin, []{
std::cout << "Hello world! Have an int.";
return 13;
});
sender auto add_42 = then(hi, [](int arg) { return arg + 42; });
auto [i] = this_thread::sync_wait(add_42).value();
There's a lot to process here. And the last decade of work has pretty much abandoned "std::async and related classes", so perhaps the actual answer to this question is no longer
Yet
but
No, and there never will be. There'll be a different model where you can do that instead.
c.f. the Motivation section of P2300R5
std::async/std::future/std::promise, C++11’s intended exposure for asynchrony, is inefficient, hard to use correctly, and severely lacking in genericity, making it unusable in many contexts.
P2453R0 records the rough feelings of the LEWG participants around this current approach, and also how it interacts with the existing Networking TS, i.e. Asio, which has its own concurrency model. My reading of the polls and comments says neither is likely to land in C++23.

Is Mutex required for 1 byte shared memory

my case is one thread read and want to
decide if needed to change the value or not?
some thing like below
void set(bool status)
{
if(status == m_status)
return;
monitor.lock();
m_status = status;
}
if this possible?
Using a synchronization object for boolean state is overkill.
On Windows you can use Interlocked Variable Access.
For cross platform solution .. see Boost Atomic
std::atomic from C++11 is also a solution
I think you need to clarify your question a bit. Is it possible? Yes. Is it necessary? Probably. Are there other ways to do it? Yes, as another answer has noted.
Don't forget to unlock when you're done with the things you want to change. And just a stylistic note, I find it much clearer to use your 'if' statement to encase the code block instead of return'ing out of the function. Like this:
void set(bool status)
{
if(status != m_status)
{
monitor.lock();
m_status = status;
monitor.unlock();
}
}
Just my opinion, of course.
Generally it's not possible. It will work most of the time on most platforms, but it's formally undefined and there are cases where cache coherency issues will come to hunt you.
If you can get C++11, use std::atomic<bool> from the new <atomic> header. If not, you should be using legacy compiler-specific equivalent. Windows have Interlocked* functions, GCC has __sync keyword. There is actually a cross-platform implementation of the important bits of the C++11 standard buried deep in Boost.Interprocess library, but it's unfortunately not exposed to the user.

Safe cross platform coroutines

All coroutine implementations I've encountered use assembly or inspect the contents of jmp_buf. The problem with this is it inherently not cross platform.
I think the following implementation doesn't go off into undefined behavior or rely on implementation details. But I've never encountered a coroutine written like this.
Is there some inherent flaw is using long jump with threads?
Is there some hidden gotcha in this code?
#include <setjmp.h>
#include <thread>
class Coroutine
{
public:
Coroutine( void ) :
m_done( false ),
m_thread( [&](){ this->start(); } )
{ }
~Coroutine( void )
{
std::lock_guard<std::mutex> lock( m_mutex );
m_done = true;
m_condition.notify_one();
m_thread.join();
}
void start( void )
{
if( setjmp( m_resume ) == 0 )
{
std::unique_lock<std::mutex> lock( m_mutex );
m_condition.wait( lock, [&](){ return m_done; } );
}
else
{
routine();
longjmp( m_yield, 1 );
}
}
void resume( void )
{
if( setjmp( m_yield ) == 0 )
{
longjmp( m_resume, 1 );
}
}
void yield( void )
{
if( setjmp( m_resume ) == 0 )
{
longjmp( m_yield, 1 );
}
}
private:
virtual void routine( void ) = 0;
jmp_buf m_resume;
jmp_buf m_yield;
bool m_done;
std::mutex m_mutex;
std::condition_variable m_condition;
std::thread m_thread;
};
UPDATE 2013-05-13 These days there is Boost Coroutine (built on Boost Context, which is not implemented on all target platforms yet, but likely to be supported on all major platforms sooner rather than later).
I don't know whether stackless coroutines fit the bill for your intended use, but I suggest you have a look at them here:
Boost Asio: The Proactor Design Pattern: Concurrency Without Threads
Asio also has a co-procedure 'emulation' model based on a single (IIRC) simple preprocessor macro, combined with some amount of cunningly designed template facilities that come things eerily close to compiler support for _stack-less co procedures.
The sample HTTP Server 4 is an example of the technique.
The author of Boost Asio (Kohlhoff) explains the mechanism and the sample on his Blog here: A potted guide to stackless coroutines
Be sure to look for the other posts in that series!
There is a C++ standard proposal for coroutine support - N3708 which is written by Oliver Kowalke (who is an author of Boost.Coroutine) and Goodspeed.
I suppose this would be the ultimate clean solution eventually (if it happens…)
Because we don't have stack exchange support from C++ compiler, coroutines currently need low level (usually assembly level, or setjmp/longjmp) hack, and that's out of abstraction range of C++. Then the implementations are fragile, and need help from compiler to be robust.
For example, it's really hard to set stack size of a coroutine context, and if you overflow the stack, your program will be corrupted silently. Or crash if you're lucky. Segmented stack seems can help this, but again, this needs compiler level support.
If once it becomes standard, compiler writers will take care. But before that day, Boost.Coroutine would be the only practical solution in C++ to me.
In C, there's libtask written by Russ Cox (who is a member of Go team). libtask works pretty well, but doesn't seem to be maintained anymore.
P.S. If someone know how to support standard proposal, please let me know. I really support this proposal.
There is no generalized cross-platform way of implementing co-routines. Although some implementations can fudge co-routines using setjmp/longjmp, such practices are not standards-compliant. If routine1 uses setjmp() to create jmp_buf1, and then calls routine2() which uses setjmp() to create jmp_buf2, any longjmp() to jmp_buf1 will invalidate jmp_buf2 (if it hasn't been invalidated already).
I've done my share of co-routine implementations on a wide variety of CPUs; I've always used at least some assembly code. It often doesn't take much (e.g. four instructions for a task-switch on the 8x51) but using assembly code can help ensure that a compiler won't apply creative optimizations that would break everything.
I don't believe you can fully implement co-routines with long jump. Co-routines are natively supported in WinAPI, they are called fibers. See for example, CreateFiber(). I don't think other operating systems have native co-routine support. If you look at SystemC library, for which co-routines are central part, they are implemented in assembly for each supported platform, except Windows. GBL library also uses co-routines for event-driven simulation based on Windows fibers. It's very easy to make hard to debug errors trying to implement co-routines and event-driven design, so I suggest using existing libraries, which are already thoroughly tested and have higher level abstractions to deal with this concept.

Reader/Writer Locks in C++

I'm looking for a good reader/writer lock in C++. We have a use case of a single infrequent writer and many frequent readers and would like to optimize for this. Preferable I would like a cross-platform solution, however a Windows only one would be acceptable.
Since C++ 17 (VS2015) you can use the standard:
#include <shared_mutex>
typedef std::shared_mutex Lock;
typedef std::unique_lock< Lock > WriteLock;
typedef std::shared_lock< Lock > ReadLock;
Lock myLock;
void ReadFunction()
{
ReadLock r_lock(myLock);
//Do reader stuff
}
void WriteFunction()
{
WriteLock w_lock(myLock);
//Do writer stuff
}
For older compiler versions and standards you can use boost to create a read-write lock:
#include <boost/thread/locks.hpp>
#include <boost/thread/shared_mutex.hpp>
typedef boost::shared_mutex Lock;
typedef boost::unique_lock< Lock > WriteLock;
typedef boost::shared_lock< Lock > ReadLock;
Newer versions of boost::thread have read/write locks (1.35.0 and later, apparently the previous versions did not work correctly).
They have the names shared_lock, unique_lock, and upgrade_lock and operate on a shared_mutex.
Using standard pre-tested, pre-built stuff is always good (for example, Boost as another answer suggested), but this is something that's not too hard to build yourself. Here's a dumb little implementation pulled out from a project of mine:
#include <pthread.h>
struct rwlock {
pthread_mutex_t lock;
pthread_cond_t read, write;
unsigned readers, writers, read_waiters, write_waiters;
};
void reader_lock(struct rwlock *self) {
pthread_mutex_lock(&self->lock);
if (self->writers || self->write_waiters) {
self->read_waiters++;
do pthread_cond_wait(&self->read, &self->lock);
while (self->writers || self->write_waiters);
self->read_waiters--;
}
self->readers++;
pthread_mutex_unlock(&self->lock);
}
void reader_unlock(struct rwlock *self) {
pthread_mutex_lock(&self->lock);
self->readers--;
if (self->write_waiters)
pthread_cond_signal(&self->write);
pthread_mutex_unlock(&self->lock);
}
void writer_lock(struct rwlock *self) {
pthread_mutex_lock(&self->lock);
if (self->readers || self->writers) {
self->write_waiters++;
do pthread_cond_wait(&self->write, &self->lock);
while (self->readers || self->writers);
self->write_waiters--;
}
self->writers = 1;
pthread_mutex_unlock(&self->lock);
}
void writer_unlock(struct rwlock *self) {
pthread_mutex_lock(&self->lock);
self->writers = 0;
if (self->write_waiters)
pthread_cond_signal(&self->write);
else if (self->read_waiters)
pthread_cond_broadcast(&self->read);
pthread_mutex_unlock(&self->lock);
}
void rwlock_init(struct rwlock *self) {
self->readers = self->writers = self->read_waiters = self->write_waiters = 0;
pthread_mutex_init(&self->lock, NULL);
pthread_cond_init(&self->read, NULL);
pthread_cond_init(&self->write, NULL);
}
pthreads not really being Windows-native, but the general idea is here. This implementation is slightly biased towards writers (a horde of writers can starve readers indefinitely); just modify writer_unlock if you'd rather the balance be the other way around.
Yes, this is C and not C++. Translation is an exercise left to the reader.
Edit
Greg Rogers pointed out that the POSIX standard does specify pthread_rwlock_*. This doesn't help if you don't have pthreads, but it stirred my mind into remembering: Pthreads-w32 should work! Instead of porting this code to non-pthreads for your own use, just use Pthreads-w32 on Windows, and native pthreads everywhere else.
Whatever you decide to use, benchmark your work load against simple locks, as read/write locks tend to be 3-40x slower than simple mutex, when there is no contention.
Here is some reference
C++17 supports std::shared_mutex . It is supported in MSVC++ 2015 and 2017.
Edit: The MSDN Magazine link isn't available anymore. The CodeProject article is now available on https://www.codeproject.com/Articles/32685/Testing-reader-writer-locks and sums it up pretty nicely. Also I found a new MSDN link about Compound Synchronisation Objects.
There is an article about reader-writer locks on MSDN that presents some implementations of them. It also introduces the Slim reader/writer lock, a kernel synchronisation primitive introduced with Vista. There's also a CodeProject article about comparing different implementations (including the MSDN article's ones).
Intel Thread Building Blocks also provide a couple of rw_lock variants:
http://www.threadingbuildingblocks.org/
They have a spin_rw_mutex for very short periods of contention and a queueing_rw_mutex for longer periods of contention. The former can be used in particularly performance sensitive code. The latter is more comparable in performance to that provided by Boost.Thread or directly using pthreads. But profile to make sure which one is a win for your access patterns.
Boost.Thread has since release 1.35.0 already supports reader-writer locks. The good thing about this is that the implementation is greatly cross-platform, peer-reviewed, and is actually a reference implementation for the upcoming C++0x standard.
I can recommend the ACE library, which provides a multitude of locking mechanisms and is ported to various platforms.
Depending on the boundary conditions of your problem, you may find the following classes useful:
ACE_RW_Process_Mutex
ACE_Write_Guard and ACE_Read_Guard
ACE_Condition
http://www.codeproject.com/KB/threads/ReaderWriterLock.aspx
Here is a good and lightweight implementation suitable for most tasks.
Multiple-Reader, Single-Writer Synchronization Lock Class for Win32 by Glenn Slayde
http://www.glennslayden.com/code/win32/reader-writer-lock
#include <shared_mutex>
class Foo {
public:
void Write() {
std::unique_lock lock{mutex_};
// ...
}
void Read() {
std::shared_lock lock{mutex_};
// ...
}
private:
std::shared_mutex mutex_;
};
You could copy Sun's excellent ReentrantReadWriteLock. It includes features such as optional fairness, lock downgrading, and of course reentrancy.
Yes it's in Java, but you can easily read and transpose it to C++, even if you don't know any Java. The documentation I linked to contains all the behavioral properties of this implementation so you can make sure it does what you want.
If nothing else, it's a guide.