Deleting data in atomic / fenced data - c++

Assume we use the standard consumer/producer pattern in our C++11 program: (from: http://en.cppreference.com/w/cpp/atomic/memory_order)
#include <thread>
#include <atomic>
#include <cassert>
#include <string>
std::atomic<std::string*> ptr;
int data;
void producer()
{
std::string* p = new std::string("Hello");
ptr.store(p, std::memory_order_release);
}
void consumer()
{
std::string* p2;
while (!(p2 = ptr.load(std::memory_order_consume)))
;
assert(*p2 == "Hello"); // never fires: *p2 carries dependency from ptr
// yea well, it actually uses p2 for quite a while probably....
}
int main()
{
std::thread t1(producer);
std::thread t2(consumer);
t1.join(); t2.join();
}
Now, I would like to change the behavior of the producer code just a bit. Instead of simply setting a string, I'd like it to overwrite a string. E.g.:
void producer()
{
std::string* p = new std::string("Hello");
ptr.store(p, std::memory_order_release);
// do some stuff
std::string* p2 = new std::string("Sorry, should have been Hello World");
ptr.store(p2, std::memory_order_release);
// **
}
The producer here is responsible for the generation of the strings, which means that in my simple world it should also be responsible for the destruction of these strings.
In the line marked with '**' we should therefore destroy string 'p', which is what this question is about.
The solution you might consider would be to add (at the marked line):
delete p;
However, this will break the program, since the consumer might be using the string after we've deleted it -- after all, the consumer uses a pointer. Also, this implies that the producer waits for the consumer, which isn't necessary - we just want our old memory to be cleaned up. Using a ref counting smart pointer seems to be out of the question, since atomic only supports that many types.
What's the best (most efficient) way to fix this?

You can do an atomic exchange, which will return the previous value of the atomic variable.
The variable ptr then has two states: It can either have no data available, in which case it is equal to nullptr, or have data available for consumption.
To consume data any of the consumers may exchange ptr with a nullptr.
If there was no data, there still will not be any data and the consumer will have to try again later (this effectively builds a spinlock).
If there was data, the consumer now takes ownership and becomes responsible for deleting it when it is no longer needed.
To produce data, a producer exchanges ptr with a pointer to the produced data.
If there was no data, the previous pointer will be equal to nullptr and data was successfully produced.
If there was data, the producer effectively takes back ownership of the previously produced data. It can then either delete the object, or - more effectively - simply reuse it for its next production.

Related

What exactly is Synchronize-With relationship?

I've been reading this post of Jeff Preshing about The Synchronizes-With Relation, and also the "Release-Acquire Ordering" section in the std::memory_order page from cpp reference, and I don't really understand:
It seems that there is some kind of promise by the standard that I don't understand why it's necessary. Let's take the example from the CPP reference:
#include <thread>
#include <atomic>
#include <cassert>
#include <string>
std::atomic<std::string*> ptr;
int data;
void producer()
{
std::string* p = new std::string("Hello");
data = 42;
ptr.store(p, std::memory_order_release);
}
void consumer()
{
std::string* p2;
while (!(p2 = ptr.load(std::memory_order_acquire)))
;
assert(*p2 == "Hello"); // never fires
assert(data == 42); // never fires
}
int main()
{
std::thread t1(producer);
std::thread t2(consumer);
t1.join(); t2.join();
}
The reference explains that:
If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.
as far as I understand, when we
ptr.store(p, std::memory_order_release)
What we're actually doing is telling both the compiler and the CPU that when running, make it so there will be no way that data and the memory pointed to by std::string* p will be visible AFTER the new value of ptr will be visible to thread t2.
And same, when we
ptr.load(std::memory_order_acquire)
We are telling the compiler and CPU: make it so the loading of ptr will be no later than then loading of *p2 and data.
So I don't understand what further promise we have here?
This ptr.store(p, std::memory_order_release) (L1) guarantees that anything done prior to this line in this particular thread (T1) will be visible to other threads as long as those other threads are reading ptr in a correct fashion (in this case, using std::memory_order_acquire). This guarantee works only with this pair, alone this line guarantees nothing.
Now you have ptr.load(std::memory_order_acquire) (L2) on the other thread (T2) which, working with its pair from another thread, guarantees that as long as it read the value written in T1 you can see other values written prior to that line (in your case it is data). So because L1 synchronizes with L2, data = 42; happens before assert(data == 42).
Also there is a guarantee that ptr is written and read atomically, because, well, it is atomic. No other guarantees or promises are in that code.

std::shared_ptr::unique(), copying and thread safety

I have a shared_ptr stored in a central place which can be accessed by multiple threads through a method getPointer(). I want to make sure that only one thread uses the pointer at one time. Thus, whenever a thread wants to get the pointer I test if the central copy is the only one via std::shared_ptr::unique() method. If it returns yes, I return the copy assuming that unique()==false as long as that thread works on the copy. Other threads trying to access the pointer at the same time receive a nullptr and have to try again in the future.
Now my question:
Is it theoretically possible that two different threads calling getPointer() can get mutual access to the pointer despite the mutex guard and the testing via unique() ?
std::shared_ptr<int> myPointer; // my pointer is initialized somewhere else but before the first call to getPointer()
std::mutex myMutex;
std::shared_ptr<int> getPointer()
{
std::lock_guard<std::mutex> guard(myMutex);
std::shared_ptr<int> returnValue;
if ( myPointer.unique() )
returnValue = myPointer;
else
returnValue = nullptr;
return returnValue;
}
Regards
Only one "active" copy can exist at a time.
It is protected by the mutex until after a second shared_ptr is created at which point a subsequent call (once it gets the mutex after the first call has exited) will fail the unique test until the initial caller's returned shared_ptr is destroyed.
As noted in the comments, unique is going away in c++20, but you can test use_count == 1 instead, as that is what unique does.
Your solution seems overly complicated. It exploits the internal workings of the shared pointer to deduce a flag value. Why not just make the flag explicit?
std::shared_ptr<int> myPointer;
std::mutex myMutex;
bool myPointerIsInUse = false;
bool GetPermissionToUseMyPointer() {
std::lock_guard<std::mutex guard(myMutex);
auto r = (! myPointerIsInUse);
myPointerIsInUse ||= myPointerIsInUse;
return r;
}
bool RelinquishPermissionToUseMyPointer() {
std::lock_guard<std::mutex guard(myMutex);
myPointerIsInUse = false;
}
P.S., If you wrap that in a class with a few extra bells and whistles, it'll start to look a lot like a semaphore.

Is this inter-thread object sharing strategy sound?

I'm trying to come up with a fast way of solving the following problem:
I have a thread which produces data, and several threads which consume it. I don't need to queue produced data, because data is produced much more slowly than it is consumed (and even if this failed to be the case occasionally, it wouldn't be a problem if a data point were skipped occasionally). So, basically, I have an object that encapsulates the "most recent state", which only the producer thread is allowed to update.
My strategy is as follows (please let me know if I'm completely off my rocker):
I've created three classes for this example: Thing (the actual state object), SharedObject<Thing> (an object that can be local to each thread, and gives that thread access to the underlying Thing), and SharedObjectManager<Thing>, which wraps up a shared_ptr along with a mutex.
The instance of the SharedObjectManager (SOM) is a global variable.
When the producer starts, it instantiates a Thing, and tells the global SOM about it. It then makes a copy, and does all of it's updating work on that copy. When it is ready to commit it's changes to the Thing, it passes the new Thing to the global SOM, which locks it's mutex, updates the shared pointer it keeps, and then releases the lock.
Meanwhile, the consumer threads all intsantiate SharedObject<Thing>. these objects each keep a pointer to the global SOM, as well as a cached copy of the shared_ptr kept by the SOM... It keeps this cached until update() is explicitly called.
I believe this is getting hard to follow, so here's some code:
#include <mutex>
#include <iostream>
#include <memory>
class Thing
{
private:
int _some_member = 10;
public:
int some_member() const { return _some_member; }
void some_member(int val) {_some_member = val; }
};
// one global instance
template<typename T>
class SharedObjectManager
{
private:
std::shared_ptr<T> objPtr;
std::mutex objLock;
public:
std::shared_ptr<T> get_sptr()
{
std::lock_guard<std::mutex> lck(objLock);
return objPtr;
}
void commit_new_object(std::shared_ptr<T> new_object)
{
std::lock_guard<std::mutex> lck (objLock);
objPtr = new_object;
}
};
// one instance per consumer thread.
template<typename T>
class SharedObject
{
private:
SharedObjectManager<T> * som;
std::shared_ptr<T> cache;
public:
SharedObject(SharedObjectManager<T> * backend) : som(backend)
{update();}
void update()
{
cache = som->get_sptr();
}
T & operator *()
{
return *cache;
}
T * operator->()
{
return cache.get();
}
};
// no actual threads in this test, just a quick sanity check.
SharedObjectManager<Thing> glbSOM;
int main(void)
{
glbSOM.commit_new_object(std::make_shared<Thing>());
SharedObject<Thing> myobj(&glbSOM);
std::cout<<myobj->some_member()<<std::endl;
// prints "10".
}
The idea for use by the producer thread is:
// initialization - on startup
auto firstStateObj = std::make_shared<Thing>();
glbSOM.commit_new_object(firstStateObj);
// main loop
while (1)
{
// invoke copy constructor to copy the current live Thing object
auto nextState = std::make_shared<Thing>(*(glbSOM.get_sptr()));
// do stuff to nextState, gradually filling out it's new value
// based on incoming data from other sources, etc.
...
// commit the changes to the shared memory location
glbSOM.commit_new_object(nextState);
}
The use by consumers would be:
SharedObject<Thing> thing(&glbSOM);
while(1)
{
// think about the data contained in thing, and act accordingly...
doStuffWith(thing->some_member());
// re-cache the thing
thing.update();
}
Thanks!
That is way overengineered. Instead, I'd suggest to do following:
Create a pointer to Thing* theThing together with protection mutex. Either a global one, or shared by some other means. Initialize it to nullptr.
In your producer: use two local objects of Thing type - Thing thingOne and Thing thingTwo (remember, thingOne is no better than thingTwo, but one is called thingOne for a reason, but this is a thing thing. Watch out for cats.). Start with populating thingOne. When done, lock the mutex, copy thingOne address to theThing, unlock the mutex. Start populating thingTwo. When done, see above. Repeat untill killed.
In every listener: (make sure the pointer is not nullptr). Lock the mutex. Make a copy of the object pointed two by the theThing. Unlock the mutex. Work with your copy. Burn after reading. Repeat untill killed.

Non blocking way of adding a work item to array or list

Edit:
I now have finished my queue (overcoming the problem described below, and more). For those interested it can be found here. I'd be happy to hear any remarks:). Please note the queue isn't just a work item queue, but rather a template container which of course could be instantiated with work items.
Original:
After watching Herb Sutter's talk on concurrency in C++11 and 14 I got all excited about non blocking concurrency.
However, I've not yet been able to find a solution for what I considered a basic problem. So if this is already on here, please be gentile with me.
My problem is quite simple. I'm creating a very simple threadpool. In order to do this I've got some worker threads running inside the workPool class. And I keep a list of workItems.
How do I add a work item in a lock free way.
The non lock free way of doing this would of course be to create a mutex. Lock it if you add an item and read(and lock of course) the list once the current work item is done.
I do not know how to do this in an lock free way however.
Below a rough idea of what I'm creating. This code I've written for this question. And It's neither complete, nor error less:)
#include <thread>
#include <deque>
#include <vector>
class workPool
{
public:
workPool(int workerCount) :
running(1)
{
for (int i = workerCount; i > 0; --i)
workers.push_back(std::thread(&workPool::doWork, this));
}
~workPool()
{
running = 0;
}
private:
bool running;
std::vector< std::thread > workers;
std::deque< std::function<void()> > workItems;
void doWork()
{
while (running)
{
(*workItems.begin())();
workItems.erase(workItems.begin());
if (!workItems.size())
//here the thread should be paused till a new item is added
}
}
void addWorkitem()
{
//This is my confusion. How should I do this?
}
};
I have seen Herb's talks recently and I believe his lock-free linked list should do fine. The only problem is that atomic< shared_ptr<T> > is not yet implemented. I've used the atomic_* function calls as also explained by Herb in his talk.
In the example, I've simplified a task to an int, but it could be anything you want.
The function atomic_compare_exchange_weak takes three arguments: the item to compare, the expected value and the desired value. It returns true or false to indicate success or failure. On failure, the expected value will be changed to the value that was found instead.
#include <memory>
#include <atomic>
// Untested code.
struct WorkItem { // Simple linked list implementation.
int work;
shared_ptr<WorkItem> next; // remember to use as atomic
};
class WorkList {
shared_ptr<WorkItem> head; // remember to use as atomic
public:
// Used by producers to add work to the list. This implementation adds
// new items to the front (stack), but it can easily be changed to a queue.
void push_work(int work) {
shared_ptr<WorkItem> p(new WorkItem()); // The new item we want to add.
p->work = work;
p->next = head;
// Do we get to change head to p?
while (!atomic_compare_exchange_weak(&head, &p->next, p)) {
// Nope, someone got there first, try again with the new p->next,
// and remember: p->next is automatically changed to the new value of head.
}
// Yup, great! Everything's done then.
}
// Used by consumers to claim items to process.
int pop_work() {
auto p = atomic_load(&head); // The item we want to process.
int work = (p ? p->work : -1);
// Do we get to change head to p->next?
while (p && !atomic_compare_exchange_weak(&head, &p, p->next)) {
// Nope, someone got there first, try again with the new p,
// and remember: p is automatically changed to the new value of head.
work = (p ? p->work : -1); // Make sure to update work as well!
}
// Yup, great! Everything's done then, return the new task.
return work; // Returns -1 if list is empty.
}
};
Edit: The reason for using shared_ptr in combination with atomic_* functions is explained in the talk. In a nutshell: popping an item from the linked list might delete it from underneath someone traversing the list, or a different node might get allocated on the same memory address (The ABA Problem). Using shared_ptr will ensure any old readers will hold a valid reference to the original item.
As Herb explained, this makes the pop-function trivial to implement.
Lock free in this kind of context where you have a shared resource (a work queue) is often going to be replaced by atomics and a CAS loop if you really dig deep.
The basic idea is rather simple to get a lock-free concurrent stack (edit: though perhaps a bit deceptively tricky as I made a goof in my first post -- all the more reason to appreciate a good lib). I chose a stack for simplicity but it doesn't take much more to use a queue instead.
Writing to the stack:
Create a new work item.
Loop Repeatedly:
Store the top pointer to the stack.
Set the work item's next pointer to the top of the stack.
Atomic: Compare and swap the top pointer with the pointer to the work item.
If this succeeds and returns the top pointer we stored, break out
of the loop.
Popping from the stack:
Loop:
Fetch top pointer.
If top pointer is not null:
Atomic: CAS top pointer with next pointer.
If successful, break.
Else:
(Optional) Sleep/Yield to avoid burning cycles.
Process the item pointed to by the previous top pointer.
Now if you get really elaborate, you can stick in other work for the thread to do when a push or pop fails, e.g.
I do not know how to do this in C++ 11 (or later); however, here is a solution for how to do it with C++ 98 and `boost (v1.50):
This is obviously not a very useful example, it's only for demonstrative purposes:
#include <boost/scoped_ptr.hpp>
#include <boost/function.hpp>
#include <boost/asio/io_service.hpp>
#include <boost/thread.hpp>
class WorkHandler
{
public:
WorkHandler();
~WorkHandler();
typedef boost::function<void(void)> Work; // the type of work we can handle
void AddWork(Work w) { pThreadProcessing->post(w); }
private:
void ProcessWork();
boost::scoped_ptr<boost::asio::io_service> pThreadProcessing;
boost::thread thread;
bool runThread; // Make sure this is atomic
};
WorkHandler::WorkHandler()
: pThreadProcessing(new boost::asio::io_service), // create our io service
thread(&WorkHandler::ProcessWork, this), // create our thread
runThread(true) // run the thread
{
}
WorkHandler::~WorkHandler()
{
runThread = false; // stop running the thread
thread.join(); // wait for the thread to finish
}
void WorkHandler::ProcessWork()
{
while (runThread) // while the thread is running
{
pThreadProcessing->run(); // process work
pThreadProcessing->reset(); // prepare for more work
}
}
int CalculateSomething(int a, int b)
{
return a + b;
}
int main()
{
WorkHandler wh; // create a work handler
// give it some work to do
wh.AddWork(boost::bind(&CalculateSomething, 4, 5));
wh.AddWork(boost::bind(&CalculateSomething, 10, 100));
wh.AddWork(boost::bind(&CalculateSomething, 35, -1));
Sleep(2000); // ONLY for demonstration! This just allows the thread a chance to work before we destroy it.
return 0;
}
boost::asio::io_service is thread-safe, so you can post work to it without needing mutexes.
NB: Although I haven't made the bool runThread atomic, for thread-safety it should be (I just don't have atomic in my c++)

Read-write thread-safe smart pointer in C++, x86-64

I develop some lock free data structure and following problem arises.
I have writer thread that creates objects on heap and wraps them in smart pointer with reference counter. I also have a lot of reader threads, that work with these objects. Code can look like this:
SmartPtr ptr;
class Reader : public Thread {
virtual void Run {
for (;;) {
SmartPtr local(ptr);
// do smth
}
}
};
class Writer : public Thread {
virtual void Run {
for (;;) {
SmartPtr newPtr(new Object);
ptr = newPtr;
}
}
};
int main() {
Pool* pool = SystemThreadPool();
pool->Run(new Reader());
pool->Run(new Writer());
for (;;) // wait for crash :(
}
When I create thread-local copy of ptr it means at least
Read an address.
Increment reference counter.
I can't do these two operations atomically and thus sometimes my readers work with deleted object.
The question is - what kind of smart pointer should I use to make read-write access from several threads with correct memory management possible? Solution should exist, since Java programmers don't even care about such a problem, simply relying on that all objects are references and are deleted only when nobody uses them.
For PowerPC I found http://drdobbs.com/184401888, looks nice, but uses Load-Linked and Store-Conditional instructions, that we don't have in x86.
As far I as I understand, boost pointers provide such functionality only using locks. I need lock free solution.
boost::shared_ptr have atomic_store which uses a "lock-free" spinlock which should be fast enough for 99% of possible cases.
boost::shared_ptr<Object> ptr;
class Reader : public Thread {
virtual void Run {
for (;;) {
boost::shared_ptr<Object> local(boost::atomic_load(&ptr));
// do smth
}
}
};
class Writer : public Thread {
virtual void Run {
for (;;) {
boost::shared_ptr<Object> newPtr(new Object);
boost::atomic_store(&ptr, newPtr);
}
}
};
int main() {
Pool* pool = SystemThreadPool();
pool->Run(new Reader());
pool->Run(new Writer());
for (;;)
}
EDIT:
In response to comment below, the implementation is in "boost/shared_ptr.hpp"...
template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r )
{
boost::detail::spinlock_pool<2>::scoped_lock lock( p );
p->swap( r );
}
template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r )
{
boost::detail::spinlock & sp = boost::detail::spinlock_pool<2>::spinlock_for( p );
sp.lock();
p->swap( r );
sp.unlock();
return r; // return std::move( r )
}
With some jiggery-pokery you should be able to accomplish this using InterlockedCompareExchange128. Store the reference count and pointer in a 2 element __int64 array. If reference count is in array[0] and pointer in array[1] the atomic update would look like this:
while(true)
{
__int64 comparand[2];
comparand[0] = refCount;
comparand[1] = pointer;
if(1 == InterlockedCompareExchange128(
array,
pointer,
refCount + 1,
comparand))
{
// Pointer is ready for use. Exit the while loop.
}
}
If an InterlockedCompareExchange128 intrinsic function isn't available for your compiler then you may use the underlying CMPXCHG16B instruction instead, if you don't mind mucking around in assembly language.
The solution proposed by RobH doesn't work. It has the same problem as the original question: when accessing the reference count object, it might already have been deleted.
The only way I see of solving the problem without a global lock (as in boost::atomic_store) or conditional read/write instructions is to somehow delay the destruction of the object (or the shared reference count object if such thing is used). So zennehoy has a good idea but his method is too unsafe.
The way I might do it is by keeping copies of all the pointers in the writer thread so that the writer can control the destruction of the objects:
class Writer : public Thread {
virtual void Run() {
list<SmartPtr> ptrs; //list that holds all the old ptr values
for (;;) {
SmartPtr newPtr(new Object);
if(ptr)
ptrs.push_back(ptr); //push previous pointer into the list
ptr = newPtr;
//Periodically go through the list and destroy objects that are not
//referenced by other threads
for(auto it=ptrs.begin(); it!=ptrs.end(); )
if(it->refCount()==1)
it = ptrs.erase(it);
else
++it;
}
}
};
However there are still requirements for the smart pointer class. This doesn't work with shared_ptr as the reads and writes are not atomic. It almost works with boost::intrusive_ptr. The assignment on intrusive_ptr is implemented like this (pseudocode):
//create temporary from rhs
tmp.ptr = rhs.ptr;
if(tmp.ptr)
intrusive_ptr_add_ref(tmp.ptr);
//swap(tmp,lhs)
T* x = lhs.ptr;
lhs.ptr = tmp.ptr;
tmp.ptr = x;
//destroy temporary
if(tmp.ptr)
intrusive_ptr_release(tmp.ptr);
As far as I understand the only thing missing here is a compiler level memory fence before lhs.ptr = tmp.ptr;. With that added, both reading rhs and writing lhs would be thread-safe under strict conditions: 1) x86 or x64 architecture 2) atomic reference counting 3) rhs refcount must not go to zero during the assignment (guaranteed by the Writer code above) 4) only one thread writing to lhs (using CAS you could have several writers).
Anyway, you could create your own smart pointer class based on intrusive_ptr with necessary changes. Definitely easier than re-implementing shared_ptr. And besides, if you want performance, intrusive is the way to go.
The reason this works much more easily in java is garbage collection. In C++, you have to manually ensure that a value is not just starting to be used by a different thread when you want to delete it.
A solution I've used in a similar situation is to simply delay the deletion of the value. I create a separate thread that iterates through a list of things to be deleted. When I want to delete something, I add it to this list with a timestamp. The deleting thread waits until some fixed time after this timestamp before actually deleting the value. You just have to make sure that the delay is large enough to guarantee that any temporary use of the value has completed.
100 milliseconds would have been enough in my case, I chose a few seconds to be safe.