LRU with shared_ptr in multi-threaded environment

LRU with shared_ptr in multi-threaded environment - c++

I run into a corner case while using a shared_ptr based "database" that doubles as an LRU.
Since C++17, the shared_ptr::use_count is imprecise, so I have trouble deciding on which elements can be safely removed from the LRU.
I cannot remove still-in-use elements, as that would break contracts in the rest of the code.
As far as I understand it locking a mutex is not enough, since only a mutex unlock will force a memory barrier. I could still read an outdated value even when holding the lock.
I could of course slap a memory barier after I lock a mutex inside the LRU, but I'm a bit worried about the performance impact.
Here is an outline of how the lru works:
template <k,v>
class DB{
shared_ptr<V> emplace(k, args...) {
lock guard();
remove_elements_if_needed();
insert_if_new(k, args...);
refresh_lru(k);
return ptr;
}
};

General solution I'd propose is while under a lock, identify a candidate element to remove. Then attempt to remove it by doing this:
Create a weak_ptr instance to an element you think might be a good
candidate to remove
Delete the item from the list as you normally would
Try to restore the item as a shared_ptr from the weak_ptr
If you promote the weak_ptr back to shared_ptr and it's still null,
then you are done.
If the new shared_ptr is not null, then you know someone still has a
reference to the item. Put the item back into your data structure
exactly as you found it.
Something like this. Since I don't have your code yet, I'm winging it wrt to your implementation. But I'm guessing you have a collection of "nodes". Each node has a member that is the shared_ptr instance that you hand back to callers.
bool remove_unused_element() {
bool removed = false;
for (node& n : lru) {
weak_ptr<X> wp = n.item;
n.item.reset();
shared_ptr<X> sp = wp.lock();
if (sp != nullptr){
n.item = sp ; // restore this node, someone is still using it
}
else {
lru.erase(n);
removed = true;
break;
}
}
return removed;
}
void remove_elements_if_needed() {
bool result = true;
while ((lru.size() > max_lru_size) && result) {
result = remove_unused_element();
}
}
All of the above code assumes you have acquired the mutex as you show in your pseudo code.

Related

std::shared_ptr::unique(), copying and thread safety

I have a shared_ptr stored in a central place which can be accessed by multiple threads through a method getPointer(). I want to make sure that only one thread uses the pointer at one time. Thus, whenever a thread wants to get the pointer I test if the central copy is the only one via std::shared_ptr::unique() method. If it returns yes, I return the copy assuming that unique()==false as long as that thread works on the copy. Other threads trying to access the pointer at the same time receive a nullptr and have to try again in the future.
Now my question:
Is it theoretically possible that two different threads calling getPointer() can get mutual access to the pointer despite the mutex guard and the testing via unique() ?
std::shared_ptr<int> myPointer; // my pointer is initialized somewhere else but before the first call to getPointer()
std::mutex myMutex;
std::shared_ptr<int> getPointer()
{
std::lock_guard<std::mutex> guard(myMutex);
std::shared_ptr<int> returnValue;
if ( myPointer.unique() )
returnValue = myPointer;
else
returnValue = nullptr;
return returnValue;
}
Regards

Only one "active" copy can exist at a time.
It is protected by the mutex until after a second shared_ptr is created at which point a subsequent call (once it gets the mutex after the first call has exited) will fail the unique test until the initial caller's returned shared_ptr is destroyed.
As noted in the comments, unique is going away in c++20, but you can test use_count == 1 instead, as that is what unique does.

Your solution seems overly complicated. It exploits the internal workings of the shared pointer to deduce a flag value. Why not just make the flag explicit?
std::shared_ptr<int> myPointer;
std::mutex myMutex;
bool myPointerIsInUse = false;
bool GetPermissionToUseMyPointer() {
std::lock_guard<std::mutex guard(myMutex);
auto r = (! myPointerIsInUse);
myPointerIsInUse ||= myPointerIsInUse;
return r;
}
bool RelinquishPermissionToUseMyPointer() {
std::lock_guard<std::mutex guard(myMutex);
myPointerIsInUse = false;
}
P.S., If you wrap that in a class with a few extra bells and whistles, it'll start to look a lot like a semaphore.

Non blocking way of adding a work item to array or list

Edit:
I now have finished my queue (overcoming the problem described below, and more). For those interested it can be found here. I'd be happy to hear any remarks:). Please note the queue isn't just a work item queue, but rather a template container which of course could be instantiated with work items.
Original:
After watching Herb Sutter's talk on concurrency in C++11 and 14 I got all excited about non blocking concurrency.
However, I've not yet been able to find a solution for what I considered a basic problem. So if this is already on here, please be gentile with me.
My problem is quite simple. I'm creating a very simple threadpool. In order to do this I've got some worker threads running inside the workPool class. And I keep a list of workItems.
How do I add a work item in a lock free way.
The non lock free way of doing this would of course be to create a mutex. Lock it if you add an item and read(and lock of course) the list once the current work item is done.
I do not know how to do this in an lock free way however.
Below a rough idea of what I'm creating. This code I've written for this question. And It's neither complete, nor error less:)
#include <thread>
#include <deque>
#include <vector>
class workPool
{
public:
workPool(int workerCount) :
running(1)
{
for (int i = workerCount; i > 0; --i)
workers.push_back(std::thread(&workPool::doWork, this));
}
~workPool()
{
running = 0;
}
private:
bool running;
std::vector< std::thread > workers;
std::deque< std::function<void()> > workItems;
void doWork()
{
while (running)
{
(*workItems.begin())();
workItems.erase(workItems.begin());
if (!workItems.size())
//here the thread should be paused till a new item is added
}
}
void addWorkitem()
{
//This is my confusion. How should I do this?
}
};

I have seen Herb's talks recently and I believe his lock-free linked list should do fine. The only problem is that atomic< shared_ptr<T> > is not yet implemented. I've used the atomic_* function calls as also explained by Herb in his talk.
In the example, I've simplified a task to an int, but it could be anything you want.
The function atomic_compare_exchange_weak takes three arguments: the item to compare, the expected value and the desired value. It returns true or false to indicate success or failure. On failure, the expected value will be changed to the value that was found instead.
#include <memory>
#include <atomic>
// Untested code.
struct WorkItem { // Simple linked list implementation.
int work;
shared_ptr<WorkItem> next; // remember to use as atomic
};
class WorkList {
shared_ptr<WorkItem> head; // remember to use as atomic
public:
// Used by producers to add work to the list. This implementation adds
// new items to the front (stack), but it can easily be changed to a queue.
void push_work(int work) {
shared_ptr<WorkItem> p(new WorkItem()); // The new item we want to add.
p->work = work;
p->next = head;
// Do we get to change head to p?
while (!atomic_compare_exchange_weak(&head, &p->next, p)) {
// Nope, someone got there first, try again with the new p->next,
// and remember: p->next is automatically changed to the new value of head.
}
// Yup, great! Everything's done then.
}
// Used by consumers to claim items to process.
int pop_work() {
auto p = atomic_load(&head); // The item we want to process.
int work = (p ? p->work : -1);
// Do we get to change head to p->next?
while (p && !atomic_compare_exchange_weak(&head, &p, p->next)) {
// Nope, someone got there first, try again with the new p,
// and remember: p is automatically changed to the new value of head.
work = (p ? p->work : -1); // Make sure to update work as well!
}
// Yup, great! Everything's done then, return the new task.
return work; // Returns -1 if list is empty.
}
};
Edit: The reason for using shared_ptr in combination with atomic_* functions is explained in the talk. In a nutshell: popping an item from the linked list might delete it from underneath someone traversing the list, or a different node might get allocated on the same memory address (The ABA Problem). Using shared_ptr will ensure any old readers will hold a valid reference to the original item.
As Herb explained, this makes the pop-function trivial to implement.

Lock free in this kind of context where you have a shared resource (a work queue) is often going to be replaced by atomics and a CAS loop if you really dig deep.
The basic idea is rather simple to get a lock-free concurrent stack (edit: though perhaps a bit deceptively tricky as I made a goof in my first post -- all the more reason to appreciate a good lib). I chose a stack for simplicity but it doesn't take much more to use a queue instead.
Writing to the stack:
Create a new work item.
Loop Repeatedly:
Store the top pointer to the stack.
Set the work item's next pointer to the top of the stack.
Atomic: Compare and swap the top pointer with the pointer to the work item.
If this succeeds and returns the top pointer we stored, break out
of the loop.
Popping from the stack:
Loop:
Fetch top pointer.
If top pointer is not null:
Atomic: CAS top pointer with next pointer.
If successful, break.
Else:
(Optional) Sleep/Yield to avoid burning cycles.
Process the item pointed to by the previous top pointer.
Now if you get really elaborate, you can stick in other work for the thread to do when a push or pop fails, e.g.

I do not know how to do this in C++ 11 (or later); however, here is a solution for how to do it with C++ 98 and `boost (v1.50):
This is obviously not a very useful example, it's only for demonstrative purposes:
#include <boost/scoped_ptr.hpp>
#include <boost/function.hpp>
#include <boost/asio/io_service.hpp>
#include <boost/thread.hpp>
class WorkHandler
{
public:
WorkHandler();
~WorkHandler();
typedef boost::function<void(void)> Work; // the type of work we can handle
void AddWork(Work w) { pThreadProcessing->post(w); }
private:
void ProcessWork();
boost::scoped_ptr<boost::asio::io_service> pThreadProcessing;
boost::thread thread;
bool runThread; // Make sure this is atomic
};
WorkHandler::WorkHandler()
: pThreadProcessing(new boost::asio::io_service), // create our io service
thread(&WorkHandler::ProcessWork, this), // create our thread
runThread(true) // run the thread
{
}
WorkHandler::~WorkHandler()
{
runThread = false; // stop running the thread
thread.join(); // wait for the thread to finish
}
void WorkHandler::ProcessWork()
{
while (runThread) // while the thread is running
{
pThreadProcessing->run(); // process work
pThreadProcessing->reset(); // prepare for more work
}
}
int CalculateSomething(int a, int b)
{
return a + b;
}
int main()
{
WorkHandler wh; // create a work handler
// give it some work to do
wh.AddWork(boost::bind(&CalculateSomething, 4, 5));
wh.AddWork(boost::bind(&CalculateSomething, 10, 100));
wh.AddWork(boost::bind(&CalculateSomething, 35, -1));
Sleep(2000); // ONLY for demonstration! This just allows the thread a chance to work before we destroy it.
return 0;
}
boost::asio::io_service is thread-safe, so you can post work to it without needing mutexes.
NB: Although I haven't made the bool runThread atomic, for thread-safety it should be (I just don't have atomic in my c++)

Updating cache without blocking

I currently have a program that has a cache like mechanism. I have a thread listening for updates from another server to this cache. This thread will update the cache when it receives an update. Here is some pseudo code:
void cache::update_cache()
{
cache_ = new std::map<std::string, value>();
while(true)
{
if(recv().compare("update") == 0)
{
std::map<std::string, value> *new_info = new std::map<std::string, value>();
std::map<std::string, value> *tmp;
//Get new info, store in new_info
tmp = cache_;
cache_ = new_cache;
delete tmp;
}
}
}
std::map<std::string, value> *cache::get_cache()
{
return cache_;
}
cache_ is being read from many different threads concurrently. I believe how I have it here I will run into undefined behavior if one of my threads call get_cache(), then my cache updates, then the thread tries to access the stored cache.
I am looking for a way to avoid this problem. I know I could use a mutex, but I would rather not block reads from happening as they have to be as low latency as possible, but if need be, I can go that route.
I was wondering if this would be a good use case for a unique_ptr. Is my understanding correct in that if a thread calls get_cache, and that returns a unique_ptr instead of a standard pointer, once all threads that have the old version of cache are finished with it(i.e leave scope), the object will be deleted.
Is using a unique_ptr the best option for this case, or is there another option that I am not thinking of?
Any input will be greatly appreciated.
Edit:
I believe I made a mistake in my OP. I meant to use and pass a shared_ptr not a unique_ptr for cache_. And when all threads are finished with cache_ the shared_ptr should delete itself.
A little about my program: My program is a webserver that will be using this information to decide what information to return. It is fairly high throughput(thousands of req/sec) Each request queries the cache once, so telling my other threads when to update is no problem. I can tolerate slightly out of date information, and would prefer that over blocking all of my threads from executing if possible. The information in the cache is fairly large, and I would like to limit any copies on value because of this.
update_cache is only run once. It is run in a thread that just listens for an update command and runs the code.

I feel there are multiple issues:
1) Do not leak memory: for that never use "delete" in your code and stick with unique_ptr (or shared_ptr in specific cases)
2) Protect accesses to shared data, for that either using locking (mutex) or lock-free mecanism (std::atomic)
class Cache {
using Map = std::map<std::string, value>();
std::unique_ptr<Map> m_cache;
std::mutex m_cacheLock;
public:
void update_cache()
{
while(true)
{
if(recv().compare("update") == 0)
{
std::unique_ptr<Map> new_info { new Map };
//Get new info, store in new_info
{
std::lock_guard<std::mutex> lock{m_cacheLock};
using std::swap;
swap(m_cache, new_cache);
}
}
}
}
Note: I don't like update_cache() being part of a public interface for the cache as it contains an infinite loop. I would probably externalize the loop with the recv and have a:
void update_cache(std::unique_ptr<Map> new_info)
{
{ // This inner brace is not useless, we don't need to keep the lock during deletion
std::lock_guard<std::mutex> lock{m_cacheLock};
using std::swap;
swap(m_cache, new_cache);
}
}
Now for the reading to the cache, use proper encapsulation and don't leave the pointer to the member map escape:
value get(const std::string &key)
{
// lock, fetch, and return.
// Depending on value type, you might want to allocate memory
// before locking
}
Using this signature you have to throw an exception if the value is not present in the cache, another option is to return something like a boost::optional.
Overall you can keep a low latency (everything is relative, I don't know your use case) if you take care of doing costly operations (memory allocation for instance) outside of the locking section.

shared_ptr is very reasonable for this purpose, C++11 has a family of functions for handling shared_ptr atomically. If the data is immutable after creation, you won't even need any additional synchronization:
class cache {
public:
using map_t = std::map<std::string, value>;
void update_cache();
std::shared_ptr<const map_t> get_cache() const;
private:
std::shared_ptr<const map_t> cache_;
};
void cache::update_cache()
{
while(true)
{
if(recv() == "update")
{
auto new_info = std::make_shared<map_t>();
// Get new info, store in new_info
// Make immutable & publish
std::atomic_store(&cache_,
std::shared_ptr<const map_t>{std::move(new_info)});
}
}
}
auto cache::get_cache() const -> std::shared_ptr<const map_t> {
return std::atomic_load(&cache_);
}

Wait-free or lock-free initialization

I've been looking at lazy initialization (let's say- global variables, but it could be of anything). So far, what I've come up with is something like
enum state {
uninitialized,
initializing,
initialized
};
state s;
char memory[sizeof(T)];
T& initialize() {
auto val = compare_and_swap(&s, state::uninitialized, state::initializing);
if (val == initialized)
return *(T*)memory;
if (val == initializing) {
while(atomic_read(&s) != state::initialized);
return *(T*)memory;
}
new (memory) T();
atomic_write(&s, state::initialized);
return *(T*)memory;
}
In the case where it's already been initialized, then it's wait-free. But I've got a problem with the case where one thread is initializing. The number of steps required to finish initializing or wait for initialization to finish isn't proportional to the number of threads. But if the initializing thread is paused, the other threads all have to wait arbitrarily until it's resumed. So in the general case it's not lock-free or wait-free.
Is it possible to create lazy initialization which is wait-free or lock-free?

If you're willing to initialize more than one object, then you can make lockfree code by only storing a pointer:
std::atomic<T *> p { nullptr };
T & get()
{
T * q = p.load();
if (!q)
{
T * r = new T;
if (p.compare_exchange_strong(q, r))
{
return *r;
}
else
{
delete r;
return *q;
}
}
return *q;
}
The cost of lock-free algorithms is that you generally have to "try and fail", so you have to pay the price of trying locally even if you have to discard the result.
As you rightly pointed out, if only a single thread performs the initialization, you always depend on that single thread, and you cannot be lock-free.
You will need corresponding clean-up code, too, if the destructor has side effects:
delete p.exchange(nullptr);

I wouldnt have thought either is possible in the general case. If you did this as part of global initialization, before main is called and before the threads are created, then you stand a chance. How can it be wait-free or lock-free?
It could be lock-free if you allow the initialise to fail and retry later.
It cant be wait-free if another thread has the item.
If you care about the performance here, then you could use the thread id to hash into a bucket of items, to try and reduce the frequency of contention.

Read-write thread-safe smart pointer in C++, x86-64

I develop some lock free data structure and following problem arises.
I have writer thread that creates objects on heap and wraps them in smart pointer with reference counter. I also have a lot of reader threads, that work with these objects. Code can look like this:
SmartPtr ptr;
class Reader : public Thread {
virtual void Run {
for (;;) {
SmartPtr local(ptr);
// do smth
}
}
};
class Writer : public Thread {
virtual void Run {
for (;;) {
SmartPtr newPtr(new Object);
ptr = newPtr;
}
}
};
int main() {
Pool* pool = SystemThreadPool();
pool->Run(new Reader());
pool->Run(new Writer());
for (;;) // wait for crash :(
}
When I create thread-local copy of ptr it means at least
Read an address.
Increment reference counter.
I can't do these two operations atomically and thus sometimes my readers work with deleted object.
The question is - what kind of smart pointer should I use to make read-write access from several threads with correct memory management possible? Solution should exist, since Java programmers don't even care about such a problem, simply relying on that all objects are references and are deleted only when nobody uses them.
For PowerPC I found http://drdobbs.com/184401888, looks nice, but uses Load-Linked and Store-Conditional instructions, that we don't have in x86.
As far I as I understand, boost pointers provide such functionality only using locks. I need lock free solution.

boost::shared_ptr have atomic_store which uses a "lock-free" spinlock which should be fast enough for 99% of possible cases.
boost::shared_ptr<Object> ptr;
class Reader : public Thread {
virtual void Run {
for (;;) {
boost::shared_ptr<Object> local(boost::atomic_load(&ptr));
// do smth
}
}
};
class Writer : public Thread {
virtual void Run {
for (;;) {
boost::shared_ptr<Object> newPtr(new Object);
boost::atomic_store(&ptr, newPtr);
}
}
};
int main() {
Pool* pool = SystemThreadPool();
pool->Run(new Reader());
pool->Run(new Writer());
for (;;)
}
EDIT:
In response to comment below, the implementation is in "boost/shared_ptr.hpp"...
template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r )
{
boost::detail::spinlock_pool<2>::scoped_lock lock( p );
p->swap( r );
}
template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r )
{
boost::detail::spinlock & sp = boost::detail::spinlock_pool<2>::spinlock_for( p );
sp.lock();
p->swap( r );
sp.unlock();
return r; // return std::move( r )
}

With some jiggery-pokery you should be able to accomplish this using InterlockedCompareExchange128. Store the reference count and pointer in a 2 element __int64 array. If reference count is in array[0] and pointer in array[1] the atomic update would look like this:
while(true)
{
__int64 comparand[2];
comparand[0] = refCount;
comparand[1] = pointer;
if(1 == InterlockedCompareExchange128(
array,
pointer,
refCount + 1,
comparand))
{
// Pointer is ready for use. Exit the while loop.
}
}
If an InterlockedCompareExchange128 intrinsic function isn't available for your compiler then you may use the underlying CMPXCHG16B instruction instead, if you don't mind mucking around in assembly language.

The solution proposed by RobH doesn't work. It has the same problem as the original question: when accessing the reference count object, it might already have been deleted.
The only way I see of solving the problem without a global lock (as in boost::atomic_store) or conditional read/write instructions is to somehow delay the destruction of the object (or the shared reference count object if such thing is used). So zennehoy has a good idea but his method is too unsafe.
The way I might do it is by keeping copies of all the pointers in the writer thread so that the writer can control the destruction of the objects:
class Writer : public Thread {
virtual void Run() {
list<SmartPtr> ptrs; //list that holds all the old ptr values
for (;;) {
SmartPtr newPtr(new Object);
if(ptr)
ptrs.push_back(ptr); //push previous pointer into the list
ptr = newPtr;
//Periodically go through the list and destroy objects that are not
//referenced by other threads
for(auto it=ptrs.begin(); it!=ptrs.end(); )
if(it->refCount()==1)
it = ptrs.erase(it);
else
++it;
}
}
};
However there are still requirements for the smart pointer class. This doesn't work with shared_ptr as the reads and writes are not atomic. It almost works with boost::intrusive_ptr. The assignment on intrusive_ptr is implemented like this (pseudocode):
//create temporary from rhs
tmp.ptr = rhs.ptr;
if(tmp.ptr)
intrusive_ptr_add_ref(tmp.ptr);
//swap(tmp,lhs)
T* x = lhs.ptr;
lhs.ptr = tmp.ptr;
tmp.ptr = x;
//destroy temporary
if(tmp.ptr)
intrusive_ptr_release(tmp.ptr);
As far as I understand the only thing missing here is a compiler level memory fence before lhs.ptr = tmp.ptr;. With that added, both reading rhs and writing lhs would be thread-safe under strict conditions: 1) x86 or x64 architecture 2) atomic reference counting 3) rhs refcount must not go to zero during the assignment (guaranteed by the Writer code above) 4) only one thread writing to lhs (using CAS you could have several writers).
Anyway, you could create your own smart pointer class based on intrusive_ptr with necessary changes. Definitely easier than re-implementing shared_ptr. And besides, if you want performance, intrusive is the way to go.

The reason this works much more easily in java is garbage collection. In C++, you have to manually ensure that a value is not just starting to be used by a different thread when you want to delete it.
A solution I've used in a similar situation is to simply delay the deletion of the value. I create a separate thread that iterates through a list of things to be deleted. When I want to delete something, I add it to this list with a timestamp. The deleting thread waits until some fixed time after this timestamp before actually deleting the value. You just have to make sure that the delay is large enough to guarantee that any temporary use of the value has completed.
100 milliseconds would have been enough in my case, I chose a few seconds to be safe.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

LRU with shared_ptr in multi-threaded environment - c++

Related

std::shared_ptr::unique(), copying and thread safety

Non blocking way of adding a work item to array or list

Updating cache without blocking

Wait-free or lock-free initialization

Read-write thread-safe smart pointer in C++, x86-64

Categories

Resources