Memory management in a lock free queue

Memory management in a lock free queue - c++

We've been looking to use a lock-free queue in our code to reduce lock contention between a single producer and consumer in our current implementation. There are plenty of queue implementations out there, but I haven't been too clear on how to best manage memory management of nodes.
For example, the producer looks like this:
queue.Add( new WorkUnit(...) );
And the consumer looks like:
WorkUnit* unit = queue.RemoveFront();
unit->Execute();
delete unit;
We currently use memory pool for allocation. You'll notice that the producer allocates the memory and the consumer deletes it. Since we're using pools, we need to add another lock to the memory pool to properly protect it. This seems to negate the performance advantage of a lock-free queue in the first place.
So far, I think our options are:
Implement a lock-free memory pool.
Dump the memory pool and rely on a threadsafe allocator.
Are there any other options that we can explore? We're trying to avoid implementing a lock-free memory pool, but we may to take that path.
Thanks.

Only producer shall be able to create objects and destroy them when they aren't needed anymore. Consumer is only able to use objects and mark them as used. That's the point. You don't need to share memory in this case. This is the only way I know of efficient lock free queue implementation.
Read this great article that describes such algorithm in details.

One more possibility is to have a separate memory pool for each thread, so only one thread is ever using a heap, and you can avoid locking for the allocation.
That leaves you with managing a lock-free function to free a block. Fortunately, that's pretty easy to manage: you maintain a linked list of free blocks for each heap. You put the block's own address into (the memory you'll use as) the link field for the linked list, and then do an atomic exchange with the pointer holding the head of the linked list.

You should look at Intel's TBB. I don't know how much, if anything, it costs for commercial projects, but they already have concurrent queues, concurrent memory allocators, that sort of thing.
Your queue interface also looks seriously dodgy- for example, your RemoveFront() call- what if the queue is empty? The new and delete calls also look quite redundant. Intel's TBB and Microsoft's PPL (included in VS2010) do not suffer these issues.

I am not sure what exactly your requirements are, but if you have scenario where producer creates data and push this data on queue and you have single consumer who takes this data and uses it and then destroys it, you just need tread safe queue or you can make your own single linked list that is tread safe in this scenario.
create list element
append data in list element
change pointer of next element from null to list element (or equivalent in other languages)
Consumer can be implemented in any way, and most linked lists are by default tread safe for this kind of operation (but check that whit implementation)
In this scenario, consumer should free this memory, or it can return it in same way back to producer setting predefined flag. Producer does not need to check all flags (if there are more than like 1000 of them) to find which bucket is free, but this flags can be implemented like tree, enabling log(n) searching for available pool. I am sure this can be done in O(1) time without locking, but have no idea how

I had the exact same concern, so I wrote my own lock-free queue (single-producer, single-consumer) that manages the memory allocation for you (it allocates a series of contiguous blocks, sort of like std::vector).
I recently released the code on GitHub. (Also, I posted on my blog about it.)
If you create the node on the stack, enqueue it, then dequeue it to another node on the stack, you won't need to use pointers/manual allocation at all. Additionally, if your node implements move semantics for construction and assignment, it will be automatically moved instead of copied :-)

Related

How to wipe some contents of boost managed_shared_memory?

The boost::interprocess::managed_shared_memory manual and most other resources I checked always shows examples where there is a parent process and a bunch of children spawned by it.
In my case, I have several processes spawned by a third part application and I can only control the "children". It means I cannot have a central brain to allocate and deallocate the shared memory segment. All my processes must be able to do so (Therefore, I can't erase data on exit).
My idea was to open_or_create a segment and, using a lock stored (find_or_construct'ed) in this area, I check a certain hash to see if the memory area was created by this same software version.
If this is not true, the memory segment must be wiped to avoid breaking code.
Ideally, I would want to keep the lock object because there could be other processes already waiting on it.
Things I though:
List all object names and delete all but the lock.
This can not be done because the objects might be using different implementations
Also I couldn't find where to list the names.
Use shared_memory_object::truncate
I could not find much about it
By using a managed_shared_memory, I don't know how reliable it would be because I'm not sure the lock was the first allocated data.
Refcount the processes and wipe data on last one
Prone to fatal termination problems.
Use a separated shared memory area just for this bookkeeping.
Sounds reasonable, but overkill?
Any suggestions or insights?

This sounds like a "shared ownership" scenario.
What you'd usually think of in such a scenario, would be shared pointers:
http://www.boost.org/doc/libs/1_58_0/doc/html/interprocess/interprocess_smart_ptr.html#interprocess.interprocess_smart_ptr.shared_ptr
Interprocess has specialized shared pointers (and ditto make_shared) for exactly this purpose.
Creating the shared memory realm can be done "optimistically" from each participating process (open_or_create). Note that creation needs to be synchronized. Further segment manager operations are usually already implicitly synchronized:
Whenever the same managed shared memory is accessed from different processes, operations such as creating, finding, and destroying objects are automatically synchronized. If two programs try to create objects with different names in the managed shared memory, the access is serialized accordingly. To execute multiple operations at one time without being interrupted by operations from a different process, use the member function atomic_func() (see Example 33.11).

LInux/c++, how to protect two data structures at same time?

I'm developing some programs with c/c++ in Linux. my question is:
I have an upper-level class called Vault, inside of which there are an OrderMap which uses unordered_map as data structure and a OrderBook, which has 2 std::list in side.
Both the OrderMap and OrderBook stores Order* as the element, they share an Order object which is allocated on heap. So either the OrderBook or the OrderMap modifies the order within.
I have two threads that will do read and write operations on them. the 2 threads can insert/modify/retrieve(read)/delete the elements.
My Question is: how can I protect this big structure of "Vault"? I can actually protect the map or list, but I don't know how to protect them both at the same time.
anybody gives me some ideas?

Add a mutex and lock it before accessing any of the two.
Make them private, so you know accesses are made via your member functions (which have proper locks)
Consider using std::shared_ptr instead of Order *

I don't think I've ever seen a multi-thread usage of an order book. I really think you'd be better off using it in one thread only.
But, to get back to your question, I'll assume you're staying with 2 threads for whatever reasons.
This data structure is too complex to make it lock free. So these are the multi-thread options I can see:
1. If you use a single Vault instance for both threads you will have to lock around it. I assume you don't care to burn some cpu time, so I strongly suggest you use a spin lock, not a mutex.
2. If you can allow having 2 Vault instances this can improve things as every thread can keep it's own private instance and exchange modifications with the other thread using some other means: a lock free queue, or something else.
3. If your book is fast enough to copy, you can have a single central Vault pointer, make a copy on every update, or a group of updates, CAS onto that central pointer, and reuse the old one to not have to allocate every time. You'll end up with one spare instance per thread. Like this:
Vault* old_ptr = centralVault;
Vault* new_ptr = cachedVault ? cachedVault : new Vault;
do {
*new_ptr = *old_ptr;
makeChanges(new_ptr);
}
while( !cas(&centralVault, old_ptr, new_ptr) );
cachedVault = old_ptr;

Good allocator for cross thread allocation and free

I am planning to write a C++ networked application where:
I use a single thread to accept TCP connections and also to read data from them. I am planning to use epoll/select to do this. The data is written into buffers that are allocated using some arena allocator say jemalloc.
Once there is enough data from a single TCP client to form a protocol message, the data is published on a ring buffer. The ring buffer structures contain the fd for the connection and a pointer to the buffer containing the relevant data.
A worker thread processes entries from the ring buffers and sends some result data to the client. After processing each event, the worker thread frees the actual data buffer to return it to the arena allocator for re use.
I am leaving out details on how the publisher makes data written by it visible to the worker thread.
So my question is: Are there any allocators which optimize for this kind of behavior i.e. allocating objects on one thread and freeing on another?
I am worried specifically about having to use locks to return memory to an arena which is not the thread affinitized arena. I am also worried about false sharing since the producer thread and the worker thread will both write to the same region. Seems like jemalloc or tcmalloc both don't optimize for this.

Before you go down the path of implementing a highly optimized allocator for your multi-threaded application, you should first just use the standard new and delete operators for your implementation. After you have a correct implementation of your application, you can move to address bottlenecks that are discovered through profiling it.
If you get to the stage where it is obvious that the standard new and delete allocators are a bottleneck to the application, the following is the approach I have used:
Assumption: The number of threads are fixed and are statically created.
Each thread has their own arena.
Each object taken from an arena has a reference back to the arena it came from.
Each arena has a separate garbage list for each thread.
When a thread frees an object, it goes back the arena it came from, but is placed in the thread specific garbage list.
The thread that actually owns the arena treats its garbage list as the real free list.
Periodically, the thread that owns an arena performs a garbage collection pass to fold objects from the other thread garbage lists into the real free list.
The "periodical" garbage collection pass doesn't necessarily have to be time based. A subset of the garbage could be reaped on every allocation and free, for example.

The best way to deal with memory allocation and deallocation issues is to not deal with it.
You mention a ring buffer. Those are usually a fixed size. If you can come up with a fixed maximum size for your protocol messages you can allocate all the memory you will ever need at program start. When deallocating, keep the memory but reset it to a fresh state.
Now, your program may need to allocate and deallocate memory while dealing with each message but that will be done in each thread and cross-thread issues will not come into play.
This can work even if your message maximum size is too large to preallocate if you can allocate the amount of memory that most messages will use and have handlers for allocating more when necessary.

Clearing STL in dedicated thread

In one of my projects, I identified a call to std::deque<T>::clear() as a major bottleneck.
I therefore decided to move this operation in a dedicated, low-priority thread:
template <class T>
void SomeClass::parallelClear(T& c)
{
if (!c.empty())
{
T* temp = new T;
c.swap(*temp); // swap contents (fast)
// deallocate on separate thread
boost::thread deleteThread([=] () { delete temp; } );
// Windows specific: lower priority class
SetPriorityClass(deleteThread.native_handle(), BELOW_NORMAL_PRIORITY_CLASS);
}
}
void SomeClass:clear(std::deque<ComplexObject>& hugeDeque)
{
parallelClear(hugeDeque);
}
This seems to work fine (VisualC++ 2010), but I wonder if I overlooked any major flaw. I would welcome your comments about the above code.
Additional information:
SomeClass:clear() is called from a GUI-thread, and the user interface is unresponsive until the call returns. The hugeQueue, on the other hand, is unlikely to be accessed by that thread for several seconds after clearing.

This is only valid if you guarantee that access to the heap is serialized. Windows does serialize access to the primary heap by default, but it's possible to turn this behaviour off and there's no guarantee that it holds across platforms or libraries. As such, I'd be careful depending on it- make sure that it's explicitly noted that it depends on the heap being shared between threads and that the heap is thread-safe to access.
I personally would simply suggest using a custom allocator to match the allocation/deallocation pattern would be the best performance improvement here- remember that creating threads has a non-trivial overhead.
Edit: If you are using GUI/Worker thread style threading design, then really, you should create, manage and destroy the deque on the worker thread.

Be aware, that its not sure that this will improve the overall performance of your application. The windows standard heap (also the Low fragmentation heap) is not laid out for frequently transmitting allocation information from one thread to another. This will work, but it might produce quite an overhead of processing.
The documentation of the hoard memory allocator might be a starting point do get deeper into that:
http://www.cs.umass.edu/~emery/hoard/hoard-documentation.html
Your approach will though improve responsiveness etc.

In addition to the things mentioned by the other posters you should consider if the objects contained in the collection have a thread affinity, for example COM objects in single threaded apartment may not be amenable to this kind of trick.

what is the best way to synchronize container access between multiple threads in real-time application

I have std::list<Info> infoList in my application that is shared between two threads. These 2 threads are accessing this list as follows:
Thread 1: uses push_back(), pop_front() or clear() on the list (Depending on the situation)
Thread 2: uses an iterator to iterate through the items in the list and do some actions.
Thread 2 is iterating the list like the following:
for(std::list<Info>::iterator i = infoList.begin(); i != infoList.end(); ++i)
{
DoAction(i);
}
The code is compiled using GCC 4.4.2.
Sometimes ++i causes a segfault and crashes the application. The error is caused in std_list.h line 143 at the following line:
_M_node = _M_node->_M_next;
I guess this is a racing condition. The list might have changed or even cleared by thread 1 while thread 2 was iterating it.
I used Mutex to synchronize access to this list and all went ok during my initial test. But the system just freezes under stress test making this solution totally unacceptable. This application is a real-time application and i need to find a solution so both threads can run as fast as possible without hurting the total applications throughput.
My question is this:
Thread 1 and Thread 2 need to execute as fast as possible since this is a real-time application. what can i do to prevent this problem and still maintain the application performance? Are there any lock-free algorithms available for such a problem?
Its ok if i miss some newly added Info objects in thread 2's iteration but what can i do to prevent the iterator from becoming a dangling pointer?
Thanks

Your for() loop can potentially keep a lock for a relatively long time, depending on how many elements it iterates. You can get in real trouble if it "polls" the queue, constantly checking if a new element became available. That makes the thread own the mutex for an unreasonably long time, giving few opportunities to the producer thread to break in and add an element. And burning lots of unnecessary CPU cycles in the process.
You need a "bounded blocking queue". Don't write it yourself, the lock design is not trivial. Hard to find good examples, most of it is .NET code. This article looks promising.

In general it is not safe to use STL containers this way. You will have to implement specific method to make your code thread safe. The solution you chose depends on your needs. I would probably solve this by maintaining two lists, one in each thread. And communicating the changes through a lock free queue (mentioned in the comments to this question). You could also limit the lifetime of your Info objects by wrapping them in boost::shared_ptr e.g.
typedef boost::shared_ptr<Info> InfoReference;
typedef std::list<InfoReference> InfoList;
enum CommandValue
{
Insert,
Delete
}
struct Command
{
CommandValue operation;
InfoReference reference;
}
typedef LockFreeQueue<Command> CommandQueue;
class Thread1
{
Thread1(CommandQueue queue) : m_commands(queue) {}
void run()
{
while (!finished)
{
//Process Items and use
// deleteInfo() or addInfo()
};
}
void deleteInfo(InfoReference reference)
{
Command command;
command.operation = Delete;
command.reference = reference;
m_commands.produce(command);
}
void addInfo(InfoReference reference)
{
Command command;
command.operation = Insert;
command.reference = reference;
m_commands.produce(command);
}
}
private:
CommandQueue& m_commands;
InfoList m_infoList;
}
class Thread2
{
Thread2(CommandQueue queue) : m_commands(queue) {}
void run()
{
while(!finished)
{
processQueue();
processList();
}
}
void processQueue()
{
Command command;
while (m_commands.consume(command))
{
switch(command.operation)
{
case Insert:
m_infoList.push_back(command.reference);
break;
case Delete:
m_infoList.remove(command.reference);
break;
}
}
}
void processList()
{
// Iterate over m_infoList
}
private:
CommandQueue& m_commands;
InfoList m_infoList;
}
void main()
{
CommandQueue commands;
Thread1 thread1(commands);
Thread2 thread2(commands);
thread1.start();
thread2.start();
waitforTermination();
}
This has not been compiled. You still need to make sure that access to your Info objects is thread safe.

I would like to know what is the purpose of this list, it would be easier to answer the question then.
As Hoare said, it is generally a bad idea to try to share data to communicate between two threads, rather you should communicate to share data: ie messaging.
If this list is modelling a queue, for example, you might simply use one of the various ways to communicate (such as sockets) between the two threads. Consumer / Producer is a standard and well-known problem.
If your items are expensive, then only pass the pointers around during communication, you'll avoid copying the items themselves.
In general, it's exquisitely difficult to share data, although it is unfortunately the only way of programming we hear of in school. Normally only low-level implementation of "channels" of communication should ever worry about synchronization and you should learn to use the channels to communicate instead of trying to emulate them.

To prevent your iterator from being invalidated you have to lock the whole for loop. Now I guess the first thread may have difficulties updating the list. I would try to give it a chance to do its job on each (or every Nth iteration).
In pseudo-code that would look like:
mutex_lock();
for(...){
doAction();
mutex_unlock();
thread_yield(); // give first thread a chance
mutex_lock();
if(iterator_invalidated_flag) // set by first thread
reset_iterator();
}
mutex_unlock();

You have to decide which thread is the more important. If it is the update thread, then it must signal the iterator thread to stop, wait and start again. If it is the iterator thread, it can simply lock the list until iteration is done.

The best way to do this is to use a container that is internally synchronized. TBB and Microsoft's concurrent_queue do this. Anthony Williams also has a good implementation on his blog here

Others have already suggested lock-free alternatives, so I'll answer as if you were stuck using locks...
When you modify a list, existing iterators can become invalidated because they no longer point to valid memory (the list automagically reallocates more memory when it needs to grow). To prevent invalidated iterators, you could make the producer block on a mutex while your consumer traverses the list, but that would be needless waiting for the producer.
Life would be easier if you used a queue instead of a list, and have your consumer use a synchronized queue<Info>::pop_front() call instead of iterators that can be invalidated behind your back. If your consumer really needs to gobble chunks of Info at a time, then use a condition variable that'll make your consumer block until queue.size() >= minimum.
The Boost library has a nice portable implementation of condition variables (that even works with older versions of Windows), as well as the usual threading library stuff.
For a producer-consumer queue that uses (old-fashioned) locking, check out the BlockingQueue template class of the ZThreads library. I have not used ZThreads myself, being worried about lack of recent updates, and because it didn't seem to be widely used. However, I have used it as inspiration for rolling my own thread-safe producer-consumer queue (before I learned about lock-free queues and TBB).
A lock-free queue/stack library seems to be in the Boost review queue. Let's hope we see a new Boost.Lockfree in the near future! :)
If there's interest, I can write up an example of a blocking queue that uses std::queue and Boost thread locking.
EDIT:
The blog referenced by Rick's answer already has a blocking queue example that uses std::queue and Boost condvars. If your consumer needs to gobble chunks, you can extend the example as follows:
void wait_for_data(size_t how_many)
{
boost::mutex::scoped_lock lock(the_mutex);
while(the_queue.size() < how_many)
{
the_condition_variable.wait(lock);
}
}
You may also want to tweak it to allow time-outs and cancellations.
You mentioned that speed was a concern. If your Infos are heavyweight, you should consider passing them around by shared_ptr. You can also try making your Infos fixed size and use a memory pool (which can be much faster than the heap).

As you mentioned that you don't care if your iterating consumer misses some newly-added entries, you could use a copy-on-write list underneath. That allows the iterating consumer to operate on a consistent snapshot of the list as of when it first started, while, in other threads, updates to the list yield fresh but consistent copies, without perturbing any of the extant snapshots.
The trade here is that each update to the list requires locking for exclusive access long enough to copy the entire list. This technique is biased toward having many concurrent readers and less frequent updates.
Trying to add intrinsic locking to the container first requires you to think about which operations need to behave in atomic groups. For instance, checking if the list is empty before trying to pop off the first element requires an atomic pop-if-not-empty operation; otherwise, the answer to the list being empty can change in between when the caller receives the answer and attempts to act upon it.
It's not clear in your example above what guarantees the iteration must obey. Must every element in the list eventually be visited by the iterating thread? Does it make multiple passes? What does it mean for one thread to remove an element from the list while another thread is running DoAction() against it? I suspect that working through these questions will lead to significant design changes.
You're working in C++, and you mentioned needing a queue with a pop-if-not-empty operation. I wrote a two-lock queue many years ago using the ACE Library's concurrency primitives, as the Boost thread library was not yet ready for production use, and the chance for the C++ Standard Library including such facilities was a distant dream. Porting it to something more modern would be easy.
This queue of mine -- called concurrent::two_lock_queue -- allows access to the head of the queue only via RAII. This ensures that acquiring the lock to read the head will always be mated with a release of the lock. A consumer constructs a const_front (const access to head element), a front (non-const access to head element), or a renewable_front (non-const access to head and successor elements) object to represent the exclusive right to access the head element of the queue. Such "front" objects can't be copied.
Class two_lock_queue also offers a pop_front() function that waits until at least one element is available to be removed, but, in keeping with std::queue's and std::stack's style of not mixing container mutation and value copying, pop_front() returns void.
In a companion file, there's a type called concurrent::unconditional_pop, which allows one to ensure through RAII that the head element of the queue will be popped upon exit from the current scope.
The companion file error.hh defines the exceptions that arise from use of the function two_lock_queue::interrupt(), used to unblock threads waiting for access to the head of the queue.
Take a look at the code and let me know if you need more explanation as to how to use it.

If you're using C++0x you could internally synchronize list iteration this way:
Assuming the class has a templated list named objects_, and a boost::mutex named mutex_
The toAll method is a member method of the list wrapper
void toAll(std::function<void (T*)> lambda)
{
boost::mutex::scoped_lock(this->mutex_);
for(auto it = this->objects_.begin(); it != this->objects_.end(); it++)
{
T* object = it->second;
if(object != nullptr)
{
lambda(object);
}
}
}
Calling:
synchronizedList1->toAll(
[&](T* object)->void // Or the class that your list holds
{
for(auto it = this->knownEntities->begin(); it != this->knownEntities->end(); it++)
{
// Do something
}
}
);

You must be using some threading library. If you are using Intel TBB, you can use concurrent_vector or concurrent_queue. See this.

If you want to continue using std::list in a multi-threaded environment, I would recommend wrapping it in a class with a mutex that provides locked access to it. Depending on the exact usage, it might make sense to switch to a event-driven queue model where messages are passed on a queue that multiple worker threads are consuming (hint: producer-consumer).
I would seriously take Matthieu's thought into consideration. Many problems that are being solved using multi-threaded programming are better solved using message-passing between threads or processes. If you need high throughput and do not absolutely require that the processing share the same memory space, consider using something like the Message-Passing Interface (MPI) instead of rolling your own multi-threaded solution. There are a bunch of C++ implementations available - OpenMPI, Boost.MPI, Microsoft MPI, etc. etc.

I don't think you can get away without any synchronisation at all in this case as certain operation will invalidate the iterators you are using. With a list, this is fairly limited (basically, if both threads are trying to manipulate iterators to the same element at the same time) but there is still a danger that you'll be removing an element at the same time you're trying to append one to it.
Are you by any chance holding the lock across DoAction(i)? You obviously only want to hold the lock for the absolute minimum of time that you can get away with in order to maximise the performance. From the code above I think you'll want to decompose the loop somewhat in order to speed up both sides of the operation.
Something along the lines of:
while (processItems) {
Info item;
lock(mutex);
if (!infoList.empty()) {
item = infoList.front();
infoList.pop_front();
}
unlock(mutex);
DoAction(item);
delayALittle();
}
And the insert function would still have to look like this:
lock(mutex);
infoList.push_back(item);
unlock(mutex);
Unless the queue is likely to be massive, I'd be tempted to use something like a std::vector<Info> or even a std::vector<boost::shared_ptr<Info> > to minimize the copying of the Info objects (assuming that these are somewhat more expensive to copy compared to a boost::shared_ptr. Generally, operations on a vector tend to be a little faster than on a list, especially if the objects stored in the vector are small and cheap to copy.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Memory management in a lock free queue - c++

Related

How to wipe some contents of boost managed_shared_memory?

LInux/c++, how to protect two data structures at same time?

Good allocator for cross thread allocation and free

Clearing STL in dedicated thread

what is the best way to synchronize container access between multiple threads in real-time application

Categories

Resources