Shared-memory IPC synchronization (lock-free)

Shared-memory IPC synchronization (lock-free) - c++

Consider the following scenario:
Requirements:
Intel x64 Server (multiple CPU-sockets => NUMA)
Ubuntu 12, GCC 4.6
Two processes sharing large amounts of data over (named) shared-memory
Classical producer-consumer scenario
Memory is arranged in a circular buffer (with M elements)
Program sequence (pseudo code):
Process A (Producer):
int bufferPos = 0;
while( true )
{
if( isBufferEmpty( bufferPos ) )
{
writeData( bufferPos );
setBufferFull( bufferPos );
bufferPos = ( bufferPos + 1 ) % M;
}
}
Process B (Consumer):
int bufferPos = 0;
while( true )
{
if( isBufferFull( bufferPos ) )
{
readData( bufferPos );
setBufferEmpty( bufferPos );
bufferPos = ( bufferPos + 1 ) % M;
}
}
Now the age-old question: How to synchronize them effectively!?
Protect every read/write access with mutexes
Introduce a "grace period", to allow writes to complete: Read data in buffer N, when buffer(N+3) has been marked as full (dangerous, but seems to work...)
?!?
Ideally I would like something along the lines of a memory-barrier, that guarantees that all previous reads/writes are visible across all CPUs, along the lines of:
writeData( i );
MemoryBarrier();
//All data written and visible, set flag
setBufferFull( i );
This way, I would only have to monitor the buffer flags and then could read the large data chunks safely.
Generally I'm looking for something along the lines of acquire/release fences as described by Preshing here:
http://preshing.com/20130922/acquire-and-release-fences/
(if I understand it correctly the C++11 atomics only work for threads of a single process and not along multiple processes.)
However the GCC-own memory barriers (__sync_synchronize in combination with the compiler barrier asm volatile( "" ::: "memory" ) to be sure) don't seem to work as expected, as writes become visible after the barrier, when I expected them to be completed.
Any help would be appreciated...
BTW: Under windows this just works fine using volatile variables (a Microsoft specific behaviour)...

Boost Interprocess has support for Shared Memory.
Boost Lockfree has a Single-Producer Single-Consumer queue type (spsc_queue). This is basically what you refer to as a circular buffer.
Here's a demonstration that passes IPC messages (in this case, of type string) using this queue, in a lock-free fashion.
Defining the types
First, let's define our types:
namespace bip = boost::interprocess;
namespace shm
{
template <typename T>
using alloc = bip::allocator<T, bip::managed_shared_memory::segment_manager>;
using char_alloc = alloc<char>;
using shared_string = bip::basic_string<char, std::char_traits<char>, char_alloc >;
using string_alloc = alloc<shared_string>;
using ring_buffer = boost::lockfree::spsc_queue<
shared_string,
boost::lockfree::capacity<200>
// alternatively, pass
// boost::lockfree::allocator<string_alloc>
>;
}
For simplicity I chose to demo the runtime-size spsc_queue implementation, randomly requesting a capacity of 200 elements.
The shared_string typedef defines a string that will transparently allocate from the shared memory segment, so they are also "magically" shared with the other process.
The consumer side
This is the simplest, so:
int main()
{
// create segment and corresponding allocator
bip::managed_shared_memory segment(bip::open_or_create, "MySharedMemory", 65536);
shm::string_alloc char_alloc(segment.get_segment_manager());
shm::ring_buffer *queue = segment.find_or_construct<shm::ring_buffer>("queue")();
This opens the shared memory area, locates the shared queue if it exists. NOTE This should be synchronized in real life.
Now for the actual demonstration:
while (true)
{
std::this_thread::sleep_for(std::chrono::milliseconds(10));
shm::shared_string v(char_alloc);
if (queue->pop(v))
std::cout << "Processed: '" << v << "'\n";
}
The consumer just infinitely monitors the queue for pending jobs and processes one each ~10ms.
The Producer side
The producer side is very similar:
int main()
{
bip::managed_shared_memory segment(bip::open_or_create, "MySharedMemory", 65536);
shm::char_alloc char_alloc(segment.get_segment_manager());
shm::ring_buffer *queue = segment.find_or_construct<shm::ring_buffer>("queue")();
Again, add proper synchronization to the initialization phase. Also, you would probably make the producer in charge of freeing the shared memory segment in due time. In this demonstration, I just "let it hang". This is nice for testing, see below.
So, what does the producer do?
for (const char* s : { "hello world", "the answer is 42", "where is your towel" })
{
std::this_thread::sleep_for(std::chrono::milliseconds(250));
queue->push({s, char_alloc});
}
}
Right, the producer produces precisely 3 messages in ~750ms and then exits.
Note that consequently if we do (assume a POSIX shell with job control):
./producer& ./producer& ./producer&
wait
./consumer&
Will print 3x3 messages "immediately", while leaving the consumer running. Doing
./producer& ./producer& ./producer&
again after this, will show the messages "trickle in" in realtime (in burst of 3 at ~250ms intervals) because the consumer is still running in the background
See the full code online in this gist: https://gist.github.com/sehe/9376856

Related

Which types of memory_order should be used for non-blocking behaviour with an atomic_flag?

I'd like, instead of having my threads wait, doing nothing, for other threads to finish using data, to do something else in the meantime (like checking for input, or re-rendering the previous frame in the queue, and then returning to check to see if the other thread is done with its task).
I think this code that I've written does that, and it "seems" to work in the tests I've performed, but I don't really understand how std::memory_order_acquire and std::memory_order_clear work exactly, so I'd like some expert advice on if I'm using those correctly to achieve the behaviour I want.
Also, I've never seen multithreading done this way before, which makes me a bit worried. Are there good reasons not to have a thread do other tasks instead of waiting?
/*test program
intended to test if atomic flags can be used to perform other tasks while shared
data is in use, instead of blocking
each thread enters the flag protected part of the loop 20 times before quitting
if the flag indicates that the if block is already in use, the thread is intended to
execute the code in the else block (only up to 5 times to avoid cluttering the output)
debug note: this doesn't work with std::cout because all the threads are using it at once
and it's not thread safe so it all gets garbled. at least it didn't crash
real world usage
one thread renders and draws to the screen, while the other checks for input and
provides frameData for the renderer to use. neither thread should ever block*/
#include <fstream>
#include <atomic>
#include <thread>
#include <string>
struct ThreadData {
int numTimesToWriteToDebugIfBlockFile;
int numTimesToWriteToDebugElseBlockFile;
};
class SharedData {
public:
SharedData() {
threadData = new ThreadData[10];
for (int a = 0; a < 10; ++a) {
threadData[a] = { 20, 5 };
}
flag.clear();
}
~SharedData() {
delete[] threadData;
}
void runThread(int threadID) {
while (this->threadData[threadID].numTimesToWriteToDebugIfBlockFile > 0) {
if (this->flag.test_and_set(std::memory_order_acquire)) {
std::string fileName = "debugIfBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", running, output #" << this->threadData[threadID].numTimesToWriteToDebugIfBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugIfBlockFile -= 1;
this->flag.clear(std::memory_order_release);
}
else {
if (this->threadData[threadID].numTimesToWriteToDebugElseBlockFile > 0) {
std::string fileName = "debugElseBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", standing by, output #" << this->threadData[threadID].numTimesToWriteToDebugElseBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugElseBlockFile -= 1;
}
}
}
}
private:
ThreadData* threadData;
std::atomic_flag flag;
};
void runThread(int threadID, SharedData* sharedData) {
sharedData->runThread(threadID);
}
int main() {
SharedData sharedData;
std::thread thread[10];
for (int a = 0; a < 10; ++a) {
thread[a] = std::thread(runThread, a, &sharedData);
}
thread[0].join();
thread[1].join();
thread[2].join();
thread[3].join();
thread[4].join();
thread[5].join();
thread[6].join();
thread[7].join();
thread[8].join();
thread[9].join();
return 0;
}```

The memory ordering you're using here is correct.
The acquire memory order when you test and set your flag (to take your hand-written lock) has the effect, informally speaking, of preventing any memory accesses of the following code from becoming visible before the flag is tested. That's what you want, because you want to ensure that those accesses are effectively not done if the flag was already set. Likewise, the release order on the clear at the end prevents any of the preceding accesses from becoming visible after the clear, which is also what you need so that they only happen while the lock is held.
However, it's probably simpler to just use a std::mutex. If you don't want to wait to take the lock, but instead do something else if you can't, that's what try_lock is for.
class SharedData {
// ...
private:
std::mutex my_lock;
}
// ...
if (my_lock.try_lock()) {
// lock was taken, proceed with critical section
my_lock.unlock();
} else {
// lock not taken, do non-critical work
}
This may have a bit more overhead, but avoids the need to think about atomicity and memory ordering. It also gives you the option to easily do a blocking wait if that later becomes useful. If you've designed your program around an atomic_flag and later find a situation where you must wait to take the lock, you may find yourself stuck with either spinning while continually retrying the lock (which is wasteful of CPU cycles), or something like std::this_thread::yield(), which may wait for longer than necessary after the lock is available.
It's true this pattern is somewhat unusual. If there is always non-critical work to be done that doesn't need the lock, commonly you'd design your program to have a separate thread that just does the non-critical work continuously, and then the "critical" thread can just block as it waits for the lock.

Moving from thread-based pipelining to task-based parallelism? (C++)

I'm looking at how to migrate some existing C++ code from thread-based to task-based parallelism, and whether that migration is desirable. Here's my scenario:
Suppose I have some function to execute on an event. Say I have a camera and each time a frame arrives I want to do some heavy processing and save the results. Some of the processing is serial, so if I just process each frame serially in the same thread, I don't get full CPU utilization. Say the frames arrive every 33ms and the processing latency for a frame is close to 100ms.
So in my current implementation I create say 3 threads that process frames and assign each new frame to one of these worker thread in a round-robin. So thread T0 might process frames F0, F3, F6, etc. Now I get full CPU utilization and I don't have to drop frames to maintain real-time rates.
Since the processing requires various big, temporary resources, I can allocate those up-front for each worker thread. So they do not have to be re-allocated for every frame. This strategy of per-thread resources works well for granularity: if they were being allocated per-frame this would take too long, but with many more worker threads we would run out of resources.
I do not see a way to replace this thread-based parallelism with task-based parallelism using standard C++11 or Microsoft's PPL library. If there is a pattern for doing so that could be sketched below, I would be very happy to learn it.
The question is where to store the state - the allocated temporary resources (e.g. GPU memory) - which can be re-used for subsequent frames but must not conflict with the resources for a currently processing frame.
Is it even desirable to migrate to task-based parallelism in this kind of case?

I figured this out. Here is an example solution:
#include <iostream>
#include <ppltasks.h>
#include <thread>
#include <vector>
using PipelineState = int;
using PipelineStateArg = std::shared_ptr<PipelineState>;
using FrameState = int;
struct Pipeline
{
PipelineStateArg state;
concurrency::task<void> task;
};
std::vector<Pipeline> pipelines;
void proc(const FrameState& fs, PipelineState& ps)
{
std::cout << "Process frame " << fs << " in pipeline " << ps << std::endl;
}
void on_frame(int index)
{
FrameState frame = index;
if (index < 2)
{
// Start a new pipeline
auto state = std::make_shared<PipelineState>(index);
pipelines.push_back({state, concurrency::create_task([=]()
{
proc(frame, *state);
})});
}
else
{
// Use an existing pipeline
auto& pipeline = pipelines[index & 1];
auto state = pipeline.state;
pipeline.task = pipeline.task.then([=]()
{
proc(frame, *state);
});
}
}
void main()
{
for (int i = 0; i < 100; ++i)
{
on_frame(i);
std::this_thread::sleep_for(std::chrono::milliseconds(33));
}
for (auto& pipeline : pipelines)
pipeline.tail.wait();
}

Word-Net Thread safety

I am using Word-Net in a C++ project (although the library is in C). In specific, I am calling only two functions:
findtheinfo_ds
traceptrs_ds
Now, if I understand correctly the underlying structure (its quite old as it was written in the late nineties I think), the library uses files as the database from where it retrieves the buffer results I get.
However, I am not sure about the thread safety of the library.
My current algorithm is:
SynsetPtr syn = findtheinfo_ds( query , NOUN, HYPERPTR, ALLSENSES );
if ( syn )
{
// Iterate all senses
while ( syn )
{
for ( int i = 0; i < syn->wcount; i++ )
std::cout << "synonym: " << syn->words[i] << std::endl;
int i = 0;
SynsetPtr ptr = traceptrs_ds( syn, HYPERPTR, NOUN, 1 );
while ( ptr )
{
for ( int x = 0; x <= i; x++ )
std::cout << "\t";
for ( int i = 0; i < ptr->wcount; i++ )
std::cout << ptr->words[i] << ", ";
std::cout << std::endl;
i++;
auto old_ptr = ptr;
ptr = traceptrs_ds( ptr, HYPERPTR, NOUN, 1 );
free_syns( old_ptr );
}
free_syns( ptr );
syn = syn->nextss;
}
free_syns( syn );
}
}
However, I want to run parallel threads, searching for different words at the same time.
I understand that most UNIX/Linux distributions of today have thread-safe file system calls.
Furthermore, I intend to access to the above loop, per one thread only.
What I am worried about, is that before this loop above, a
wninit();
call has to take place, which makes me assume that in the library, a singleton is somewhere initialized. I cannot take a peek at the code as it is closed-source, and I do not have access to that singleton, as winit() only returns an int for success.
Is there any way to either:
Ensure thread-safety in this scenario, or
Find out (through any possible way), if the library is thread safe?
It is loaded dynamically, from a Debian package called wordnet-base, which installs libwordnet-3.0.so
Many thanks to anyone who can help!

Well, the only way to ensure that a library is really thread-safe, is to analyze its code. Or simply ask its author and then trust hisr/her answer:). Usually data stored on disk isn't the cause of thread unsafety but there's a lot of places where code may break in a multi-threaded environment. One has to check for global variables, existance of variables declared static inside library functions etc.
There's however a solution which could be used if you don't have time and/or intent to study the code. You may use a multiprocess technique when parallel tasks are performed in worker processes, not worker threads, and there's a director process which prepares job units for workers and collects results. Depending on the task such workers may be implemented as FastCGI, or communicate with parent using Boost.Interprocess

What can you do to stop running out of stack space when multithreading?

I've implemented a working multithreaded merge sort in C++, but I've hit a wall.
In my implementation, I recursively split an input vector into two parts, and then thread these two parts:
void MergeSort(vector<int> *in)
{
if(in->size() < 2)
return;
vector<int>::iterator ite = in->begin();
vector<int> left = vector<int> (ite, ite + in->size()/2);
vector<int> right = vector<int> (ite + in->size()/2, in->end() );
//current thread spawns 2 threads HERE
thread t1 = thread(MergeSort, &left);
thread t2 = thread(MergeSort, &right);
t1.join();
t2.join();
vector<int> ret;
ret.reserve(in->size() );
ret = MergeSortMerge(left, right);
in->clear();
in->insert(in->begin(), ret.begin(), ret.end() );
return;
}
The code appears to be pretty, but it's one of the most vicious codes I've ever written. Trying to sort an array of more than 1000 int values causes so many threads to spawn, that I run out of stack space, and my computer BSODs :( Consistently.
I am well aware of the reason why this code spawns so many threads, which isn't so good, but technically (if not theoretically), is this not a proper implementation?
Based on a bit of Googling, I seem to have found the need for a threadpool. Would the use of a threadpool resolve the fundamental issue I am running into, the fact that I am trying to spawn too many threads? If so, do you have any recommendations on libraries?
Thank you for the advice and help!

As zdan explained, you shall limit the number of threads. There are two things to consider to determine what's the limit,
The number of CPU cores. In C++11, you can use std::thread::hardware_concurrency() to determine the hardware cores. However, this function may return 0 meaning that the program doesn't know how many cores, in this case, you may assume this value to be 2 or 4.
Limited by the number of data to be processed. You can divide the data to be processed by threads until 1 data per thread, but it will cost too much for only 1 data and it's not cost efficient. For example, you can probably say, when the number of data is smaller than 50, you don't want to divide anymore. So you can determine the maximum number of threads required based on something like total_data_number / 50 + 1.
Then, you choose a minimum number between case 1 & case 2 to determine the limit.
In your case, because you are generating thread by recursion, you can try to determine the recursion depth in similar ways.

I don't think a threadpool is going to help you. Since your algorithm is recursive you'll get to a point where all threads in your pool are consumed and the pool won't want to create any more threads and your algorithm will block.
You could probably just limit your thread creation recursion depth to 2 or 3 (unless you've got a LOT of CPU's it won't make any difference in performance).

You can set your limits on stack space but it is futile. Too many threads, even with a pool, will eat it up at log2(N)*cost per thread. Go for an iterative approach and reduce your overhead. Overhead is the killer.
As far as performance goes you are going to find that using some level of over commit of N threads, where is is the hardware concurrency will probably yield the best results. There will be a good balance between overhead and work per core. If N get's very large, like on a GPU, then other options exist(Bitonic) that make different trade-offs to reduce the communication(waiting/joining) overhead.
Assuming you have a task manager and a semaphore that is contructed for N notifies before allowing the waiting task through,
`
#include <algorithm>
#include <array>
#include <cstdint>
#include <vector>
#include <sometaskmanager.h>
void parallel_merge( size_t N ) {
std::array<int, 1000> ary {0};
// fill array...
intmax_t stride_size = ary.size( )/N; //TODO: Put a MIN size here
auto semaphore = make_semaphore( N );
using iterator = typename std::array<int, 1000>::iterator;
std::vector<std::pair<iterator, iterator>> ranges;
auto last_it = ary.begin( );
for( intmax_t n=stride_size; n<N; n +=stride_size ) {
ranges.emplace_back( last_it, std::next(last_it, std::min(std::distance(last_it, ary.end()), stride_size)));
semaphore.notify( );
}
for( auto const & rng: ranges ) {
add_task( [&semaphore,rng]( ) {
std::sort( rng.first, rng.second );
});
}
semaphore.wait( );
std::vector<std::pair<iterator, iterator>> new_rng;
while( ranges.size( ) > 1 ) {
semaphore = make_semaphore( ranges.size( )/2 );
for( size_t n=0; n<ranges.size( ); n+=2 ) {
auto first=ranges[n].first;
auto last=ranges[n+1].second;
add_task( [&semaphore, first, mid=ranges[n].second, last]( ) {
std::inplace_merge( first, mid, last );
semaphore.notify( );
});
new_rng.emplace_back( first, last );
}
if( ranges.size( ) % 2 != 0 ) {
new_rng.push_back( ranges.back( ) );
}
ranges = new_rng;
semaphore.wait( );
}
}
As you can see, the bottleneck is in the merge phase as there is a lot of cordination that must be done. Sean Parent does a good presentation of building a task manager if you don't have one and about how it compares along with a relative performance analysis in his presentation Better Code: Concurrency, http://sean-parent.stlab.cc/presentations/2016-11-16-concurrency/2016-11-16-concurrency.pdf . TBB and PPL have task managers.

critical section problem in Windows 7

Why does the code sample below cause one thread to execute way more than another but a mutex does not?
#include <windows.h>
#include <conio.h>
#include <process.h>
#include <iostream>
using namespace std;
typedef struct _THREAD_INFO_ {
COORD coord; // a structure containing x and y coordinates
INT threadNumber; // each thread has it's own number
INT count;
}THREAD_INFO, * PTHREAD_INFO;
void gotoxy(int x, int y);
BOOL g_bRun;
CRITICAL_SECTION g_cs;
unsigned __stdcall ThreadFunc( void* pArguments )
{
PTHREAD_INFO info = (PTHREAD_INFO)pArguments;
while(g_bRun)
{
EnterCriticalSection(&g_cs);
//if(TryEnterCriticalSection(&g_cs))
//{
gotoxy(info->coord.X, info->coord.Y);
cout << "T" << info->threadNumber << ": " << info->count;
info->count++;
LeaveCriticalSection(&g_cs);
//}
}
ExitThread(0);
return 0;
}
int main(void)
{
// OR unsigned int
unsigned int id0, id1; // a place to store the thread ID returned from CreateThread
HANDLE h0, h1; // handles to theads
THREAD_INFO tInfo[2]; // only one of these - not optimal!
g_bRun = TRUE;
ZeroMemory(&tInfo, sizeof(tInfo)); // win32 function - memset(&buffer, 0, sizeof(buffer))
InitializeCriticalSection(&g_cs);
// setup data for the first thread
tInfo[0].threadNumber = 1;
tInfo[0].coord.X = 0;
tInfo[0].coord.Y = 0;
h0 = (HANDLE)_beginthreadex(
NULL, // no security attributes
0, // defaut stack size
&ThreadFunc, // pointer to function
&tInfo[0], // each thread gets its own data to output
0, // 0 for running or CREATE_SUSPENDED
&id0 ); // return thread id - reused here
// setup data for the second thread
tInfo[1].threadNumber = 2;
tInfo[1].coord.X = 15;
tInfo[1].coord.Y = 0;
h1 = (HANDLE)_beginthreadex(
NULL, // no security attributes
0, // defaut stack size
&ThreadFunc, // pointer to function
&tInfo[1], // each thread gets its own data to output
0, // 0 for running or CREATE_SUSPENDED
&id1 ); // return thread id - reused here
_getch();
g_bRun = FALSE;
return 0;
}
void gotoxy(int x, int y) // x=column position and y=row position
{
HANDLE hdl;
COORD coords;
hdl = GetStdHandle(STD_OUTPUT_HANDLE);
coords.X = x;
coords.Y = y;
SetConsoleCursorPosition(hdl, coords);
}

That may not answer your question but the behavior of critical sections changed on Windows Server 2003 SP1 and later.
If you have bugs related to critical sections on Windows 7 that you can't reproduce on an XP machine you may be affected by that change.
My understanding is that on Windows XP critical sections used a FIFO based strategy that was fair for all threads while later versions use a new strategy aimed at reducing context switching between threads.
There's a short note about this on the MSDN page about critical sections
You may also want to check this forum post

Critical sections, like mutexes are designed to protect a shared resource against conflicting access (such as concurrent modification). Critical sections are not meant to replace thread priority.
You have artificially introduced a shared resource (the screen) and made it into a bottleneck. As a result, the critical section is highly contended. Since both threads have equal priority, that is no reason for Windows to prefer one thread over another. Reduction of context switches is a reason to pick one thread over another. As a result of that reduction, the utilization of the shared resource goes up. That is a good thing; it means that one thread will be finished a lot earlier and the other thread will finish a bit earlier.
To see the effect graphically, compare
A B A B A B A B A B
to
AAAAA BBBBB
The second sequence is shorter because there's only one switch from A to B.

In hand wavey terms:
CriticalSection is saying the thread wants control to do some things together.
Mutex is making a marker to show 'being busy' so others can wait and notifying of completion so somebody else can start. Somebody else already waiting for the mutex will grab it before you can start the loop again and get it back.
So what you are getting with CriticalSection is a failure to yield between loops. You might see a difference if you had Sleep(0); after LeaveCriticalSection

I can't say why you're observing this particular behavior, but it's probably to do with the specifics of the implementation of each mechanism. What I CAN say is that unlocking then immediately locking a mutex is a bad thing. You will observe odd behavior eventually.

From some MSDN docs (http://msdn.microsoft.com/en-us/library/ms682530.aspx):
Starting with Windows Server 2003 with Service Pack 1 (SP1), threads waiting on a critical section do not acquire the critical section on a first-come, first-serve basis. This change increases performance significantly for most code

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js