First of all, I know that it can be implemented with a mutex and condition variable, but I want the most efficient implementation possible.
I would like a semaphore with a fast-path when there's no contention. On Linux this is easy with a futex; for example, here's a wait:
if (AtomicDecremenIfPositive(_counter) > 0) return; // Uncontended
AtomicAdd(&_waiters, 1);
do
{
if (syscall(SYS_futex, &_counter, FUTEX_WAIT_PRIVATE, 0, nullptr, nullptr, 0) == -1) // Sleep
{
AtomicAdd(&_waiters, -1);
throw std::runtime_error("Failed to wait for futex");
}
}
while (AtomicDecrementIfPositive(_counter) <= 0);
AtomicAdd(&_waiters, -1);
and post:
AtomicAdd(&_counter, 1);
if (Load(_waiters) > 0 && syscall(SYS_futex, &_counter, FUTEX_WAKE_PRIVATE, 1, nullptr, nullptr, 0) == -1) throw std::runtime_error("Failed to wake futex"); // Wake one
At first I thought for Windows to just use NtWaitForKeyedEvent(). The problem is it's not a direct substitution because it doesn't atomically check the value at _counter before going into the kernel, and so can miss the wake from NtReleaseKeyedEvent(). Worse, then NtReleaseKeyedEvent() would block.
What's the best solution?
Windows has native semaphores with CreateSemaphore. Until and unless you have some kind of documented performance problem doing it the normal way, you shouldn't even consider optimizations that are fragile or hardware-specific.
I think something like this should work:
// bottom 16 bits: post count
// top 16 bits: wait count
struct Semaphore { unsigned val; }
wait(struct Semaphore *s)
{
retry:
do
old = s->val;
if old had posts (bottom 16 bits != 0)
new = old - 1
wait = false
else
new = old + 65536
wait = true
until successful CAS of &s->val from old to new
if wait == true
wait on keyed event
goto retry;
}
post(struct Semaphore *s)
{
do
old = s->val;
if old had waiters (top 16 bits != 0)
// perhaps new = old - 65536 and remove the "goto retry" above?
// not sure, but this is safer...
new = old - 65536 + 1
release = true
else
new = old + 1
release = false
until successful CAS of &s->val from old to new
if release == true
release keyed event
}
edit: that said, I'm not sure this would help you a lot. Your thread pool usually should be big enough that a thread is always ready to process your request. This means that not only waits, but also posts will always take the slow path and go to the kernel. So, counting semaphores are probably the one primitive where you do not really care about a userspace-only fastpath. Stock Win32 semaphores should be good enough. That said, I'm happy to be proven wrong!
I vote for your first idea, e.g critical section and condition variable. Critical section is fast enough and it does use interlocked operation before it goes to sleep. Or, you can experiment with SRWLocks instead of critical section. Condition variables (and SRWLocks) are very fast - their only problem is that there are no conditions on XP, but maybe you do not need to target this platform .
Qt has all kinds of things like QMutex, QSemaphore which are implemented in spirit like what you presented in your question.
Actually, I would suggest replacing the futex stuff with the usual OS-provided synchronization primitives; it should not matter much since that is the slow path anyway.
Related
I have been programming an pthread application. The application has mutex locks shared across threads by the parent thread. For some reason, it throws the following error:
../nptl/pthread_mutex_lock.c:428: __pthread_mutex_lock_full: Assertion `e != ESRCH || !robust' failed.
The application is for capturing high speed network traffic using packet_mmap based approach where there are multiple threads each associated with a socket. I am not sure why it is happening. It happens during testing and I am not able to reproduce the error all times. I googled a lot but I am not able to know about the cause. Thanks for your help.
The cause of the error is due to file read. When the line of file read is commented, the error does not occur. It happens in this line:
fread(this->bit_array, sizeof(int), this->m , fp);
where bit_array is an integer array which is dynamically allocated and m is the size of the array.
Thanks.
In GLIBC 2.31, you were running the following source code of pthread_mutex_lock():
oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
newval, 0);
if (oldval != 0)
{
/* The mutex is locked. The kernel will now take care of
everything. */
int private = (robust
? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
: PTHREAD_MUTEX_PSHARED (mutex));
int e = futex_lock_pi ((unsigned int *) &mutex->__data.__lock,
NULL, private);
if (e == ESRCH || e == EDEADLK)
{
assert (e != EDEADLK
|| (kind != PTHREAD_MUTEX_ERRORCHECK_NP
&& kind != PTHREAD_MUTEX_RECURSIVE_NP));
/* ESRCH can happen only for non-robust PI mutexes where
the owner of the lock died. */
assert (e != ESRCH || !robust);
/* Delay the thread indefinitely. */
while (1)
lll_timedwait (&(int){0}, 0, 0 /* ignored */, NULL,
private);
}
oldval = mutex->__data.__lock;
assert (robust || (oldval & FUTEX_OWNER_DIED) == 0);
}
In the above code, the current value of the mutex is atomically read and it appears that it is different than 0 meaning that the mutex is locked.
Then, the assert is triggered because the owner of the mutex died and the mutex was not a robust one (meaning that the mutex has not been automatically released upon the end of the owner thread).
If you can modify the source code, you may need to add the "robust" attribute to the mutex (pthread_mutexattr_setrobust()) in order to make the system release it automatically when the owner dies. But it is error prone as the corresponding critical section of code may have not reached a sane point and so may leave some un-achieved work...
So, it would be better to find the reason why a thread may die without unlocking the mutex. Either it is an error, either you forgot to release the mutex in the termination branch.
Per this copy of GLIBC's pthread_mutex_lock.c:
/* ESRCH can happen only for non-robust PI mutexes where
the owner of the lock died. */
assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
Either one of your threads/processes ended without releasing all its locked resources, or you're using pthread_cancel()/kill() and killing threads or processes while they're running.
Let’s say that I want to increment a property var incremental:Int32 every time a kernel thread is executed:
//SWIFT
var incremental:Int32 = 0
var incrementalBuffer:MTLBuffer!
var incrementalPointer: UnsafeMutablePointer<Int32>!
init(metalView: MTKView) {
...
incrementalBuffer = Renderer.device.makeBuffer(bytes: &incremental, length: MemoryLayout<Int32>.stride)
incrementalPointer = incrementalBuffer.contents().bindMemory(to: Int32.self, capacity: 1)
}
func draw(in view: MTKView) {
...
computeCommandEncoder.setComputePipelineState(computePipelineState)
let width = computePipelineState.threadExecutionWidth
let threadsPerGroup = MTLSizeMake(width, 1, 1)
let threadsPerGrid = MTLSizeMake(10, 1, 1)
computeCommandEncoder.setBuffer(incrementalBuffer, offset: 0, index: 0)
computeCommandEncoder.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerGroup)
computeCommandEncoder.endEncoding()
commandBufferCompute.commit()
commandBufferCompute.waitUntilCompleted()
print(incrementalPointer.pointee)
}
//METAL
kernel void compute_shader (device int& incremental [[buffer(0)]]){
incremental++;
}
So I expect outputs:
10
20
30
...
but I get:
1
2
3
...
EDIT:
After some work based on the answer of #JustSomeGuy, Caroline from raywenderlich and one Apple Engineer I get:
[[kernel]] void compute_shader (device atomic_int& incremental [[buffer(0)]],
ushort lid [[thread_position_in_threadgroup]] ){
threadgroup atomic_int local_atomic;
if (lid==0) atomic_store_explicit(&local_atomic, 0, memory_order_relaxed);
atomic_fetch_add_explicit(&local_atomic, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if(lid == 0) {
int local_non_atomic = atomic_load_explicit(&local_atomic, memory_order_relaxed);
atomic_fetch_add_explicit(&incremental, local_non_atomic, memory_order_relaxed);
}
}
and works as expected
The reason you are seeing this problem is because ++ is not atomic. It basically comes down to a code like this
auto temp = incremental;
incremental = temp + 1;
temp;
which means that because the threads are executed in "parallel" (it's not really true cause a number of threads forms a SIMD-group which executes in step-lock, but it's not really important here).
Since the access is not atomic, the result is basically undefined, because there's no way to tell which thread observed which value.
A quick fix is to use atomic_fetch_add_explicit(incremental, 1, memory_order_relaxed). This makes all accesses to incremental atomic. memory_order_relaxed here means that guarantees on the order of operations is relaxed, so this will work only if you are just adding or just subtracting from the value. memory_order_relaxed is the only memory_order supported in MSL. You can read more on this in Metal Shading Language Specification, section 6.13.
But this quick fix is pretty bad because it's going to be slow, because access to incremental will have to be synchronized across all the threads. The other way is to use a common pattern where all threads in threadgroup update a value in threadgroup memory and then one or more of threads atomically update the device memory. So the kernel will looks something like
kernel void compute_shader (device int& incremental [[buffer(0)]], threadgroup int& local [[threadgroup(0)]], ushort lid [[thread_position_in_threadgroup]] ){
atomic_fetch_add_explicit(local, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if(lid == 0) {
atomic_fetch_add_explicit(incremental, local, memory_order_relaxed);
}
}
Which basically means: every thread in threadgroup should add atomically 1 to local, wait until every thread is done (threadgroup_barrier) and then exactly one thread adds atomically the total local to incremental.
atomic_fetch_add_explicit on a threadgroup variable will use threadgroup atomics instead of global atomics which should be faster.
You can read specification I linked above to learn more, these patterns are mentioned in samples there.
I'm trying to implement a readers writers solution in C++ with std::thread.
I create several reader-threads that run in an infinite loop, pausing for some time in between each read access. I tried to recreate the algorithm presented in Tanenbaum's book of Operating Systems:
rc_mtx.lock(); // lock for incrementing readcount
read_count += 1;
if (read_count == 1) // if this is the first reader
db_mtx.lock(); // then make a lock on the database
rc_mtx.unlock();
cell_value = data_base[cell_number]; // read data from database
rc_mtx.lock();
read_count -= 1; // when finished 'sign this reader off'
if (read_count == 0) // if this was the last one
db_mtx.unlock(); // release the lock on the database mutex
rc_mtx.unlock();
Of course, the problem is that the thread who might satisfy the condition of being the last reader (and therefore want to do the unlock) has never acquired the db_mtx.
I tried to open another 'mother' thread for the readers to take care of
acquiring and releasing the mutex, but I go lost during the process.
If there is an elegant way to overcome this issue (thread might try to release a mutex that has never been acquired) in an elegant way I'd love to hear!
You can use a condition variable to pause writers if readers are in progress, instead of using a separate lock.
// --- read code
rw_mtx.lock(); // will block if there is a write in progress
read_count += 1; // announce intention to read
rw_mtx.unlock();
cell_value = data_base[cell_number];
rw_mtx.lock();
read_count -= 1; // announce intention to read
if (read_count == 0) rw_write_q.notify_one();
rw_mtx.unlock();
// --- write code
std::unique_lock<std::mutex> rw_lock(rw_mtx);
write_count += 1;
rw_write_q.wait(rw_lock, []{return read_count == 0;});
data_base[cell_number] = cell_value;
write_count -= 1;
if (write_count > 0) rw_write_q.notify_one();
This implementation has a fairness issue, because new readers can cut in front of waiting writers. A completely fair implementation would probably involve a proper queue that would allow new readers to wait behind waiting writers, and new writers to wait behind any waiting readers.
In C++14, you can use a shared_timed_mutex instead of mutex to achieve multiple readers/single writer access.
// --- read code
std::shared_lock<std::shared_timed_mutex> read_lock(rw_mtx);
cell_value = data_base[cell_number];
// --- write code
std::unique_lock<std::shared_timed_mutex> write_lock(rw_mtx);
data_base[cell_number] = cell_value;
There will likely be a plain shared_mutex implementation in the next C++ standard (probably C++17).
I'm working on a program that simulates a gas station. Each car at the station is it's own thread. Each car must loop through a single bitmask to check if a pump is open, and if it is, update the bitmask, fill up, and notify other cars that the pump is now open. My current code works but there are some issues with load balancing. Ideally all the pumps are used the same amount and all cars get equal fill-ups.
EDIT: My program basically takes a number of cars, pumps, and a length of time to run the test for. During that time, cars will check for an open pump by constantly calling this function.
int Station::fillUp()
{
// loop through the pumps using the bitmask to check if they are available
for (int i = 0; i < pumpsInStation; i++)
{
//Check bitmask to see if pump is open
stationMutex->lock();
if ((freeMask & (1 << i)) == 0 )
{
//Turning the bit on
freeMask |= (1 << i);
stationMutex->unlock();
// Sleeps thread for 30ms and increments counts
pumps[i].fillTankUp();
// Turning the bit back off
stationMutex->lock();
freeMask &= ~(1 << i);
stationCondition->notify_one();
stationMutex->unlock();
// Sleep long enough for all cars to have a chance to fill up first.
this_thread::sleep_for(std::chrono::milliseconds((((carsInStation-1) * 30) / pumpsInStation)-30));
return 1;
}
stationMutex->unlock();
}
// If not pumps are available, wait until one becomes available.
stationCondition->wait(std::unique_lock<std::mutex>(*stationMutex));
return -1;
}
I feel the issue has something to do with locking the bitmask when I read it. Do I need to have some sort of mutex or lock around the if check?
It looks like every car checks the availability of pump #0 first, and if that pump is busy it then checks pump #1, and so on. Given that, it seems expected to me that pump #0 would service the most cars, followed by pump #1 serving the second-most cars, all the way down to pump #(pumpsInStation-1) which only ever gets used in the (relatively rare) situation where all of the pumps are in use simultaneously at the time a new car pulls in.
If you'd like to get better load-balancing, you should probably have each car choose a different random ordering to iterate over the pumps, rather than having them all check the pumps' availability in the same order.
Normally I wouldn't suggest refactoring as it's kind of rude and doesn't go straight to the answer, but here I think it would help you a bit to break your logic into three parts, like so, to better show where the contention lies:
int Station::acquirePump()
{
// loop through the pumps using the bitmask to check if they are available
ScopedLocker locker(&stationMutex);
for (int i = 0; i < pumpsInStation; i++)
{
// Check bitmask to see if pump is open
if ((freeMask & (1 << i)) == 0 )
{
//Turning the bit on
freeMask |= (1 << i);
return i;
}
}
return -1;
}
void Station::releasePump(int n)
{
ScopedLocker locker(&stationMutex);
freeMask &= ~(1 << n);
stationCondition->notify_one();
}
bool Station::fillUp()
{
// If a pump is available:
int i = acquirePump();
if (i != -1)
{
// Sleeps thread for 30ms and increments counts
pumps[i].fillTankUp();
releasePump(i)
// Sleep long enough for all cars to have a chance to fill up first.
this_thread::sleep_for(std::chrono::milliseconds((((carsInStation-1) * 30) / pumpsInStation)-30));
return true;
}
// If no pumps are available, wait until one becomes available.
stationCondition->wait(std::unique_lock<std::mutex>(*stationMutex));
return false;
}
Now when you have the code in this form, there is a load balancing issue which is important to fix if you don't want to "exhaust" one pump or if it too might have a lock inside. The issue lies in acquirePump where you are checking the availability of free pumps in the same order for each car. A simple tweak you can make to balance it better is like so:
int Station::acquirePump()
{
// loop through the pumps using the bitmask to check if they are available
ScopedLocker locker(&stationMutex);
for (int n = 0, i = startIndex; n < pumpsInStation; ++n, i = (i+1) % pumpsInStation)
{
// Check bitmask to see if pump is open
if ((freeMask & (1 << i)) == 0 )
{
// Change the starting index used to search for a free pump for
// the next car.
startIndex = (startIndex+1) % pumpsInStation;
// Turning the bit on
freeMask |= (1 << i);
return i;
}
}
return -1;
}
Another thing I have to ask is if it's really necessary (ex: for memory efficiency) to use bit flags to indicate whether a pump is used. If you can use an array of bool instead, you'll be able to avoid locking completely and simply use atomic operations to acquire and release pumps, and that'll avoid creating a traffic jam of locked threads.
Imagine that the mutex has a queue associated with it, containing the waiting threads. Now, one of your threads manages to get the mutex that protects the bitmask of occupied stations, checks if one specific place is free. If it isn't, it releases the mutex again and loops, only to go back to the end of the queue of threads waiting for the mutex. Firstly, this is unfair, because the first one to wait is not guaranteed to get the next free slot, only if that slot happens to be the one on its loop counter. Secondly, it causes an extreme amount of context switches, which is bad for performance. Note that your approach should still produce correct results in that no two cars collide while accessing a single filling station, but the behaviour is suboptimal.
What you should do instead is this:
lock the mutex to get exclusive access to the possible filling stations
locate the next free filling station
if none of the stations are free, wait for the condition variable and restart at point 2
mark the slot as occupied and release the mutex
fill up the car (this is where the sleep in the simulation actually makes sense, the other one doesn't)
lock the mutex
mark the slot as free and signal the condition variable to wake up others
release the mutex again
Just in case that part isn't clear to you, waiting on a condition variable implicitly releases the mutex while waiting and reacquires it afterwards!
I have a program that spawns 3 worker threads that do some number crunching, and waits for them to finish like so:
#define THREAD_COUNT 3
volatile LONG waitCount;
HANDLE pSemaphore;
int main(int argc, char **argv)
{
// ...
HANDLE threads[THREAD_COUNT];
pSemaphore = CreateSemaphore(NULL, THREAD_COUNT, THREAD_COUNT, NULL);
waitCount = 0;
for (int j=0; j<THREAD_COUNT; ++j)
{
threads[j] = CreateThread(NULL, 0, Iteration, p+j, 0, NULL);
}
WaitForMultipleObjects(THREAD_COUNT, threads, TRUE, INFINITE);
// ...
}
The worker threads use a custom Barrier function at certain points in the code to wait until all other threads reach the Barrier:
void Barrier(volatile LONG* counter, HANDLE semaphore, int thread_count = THREAD_COUNT)
{
LONG wait_count = InterlockedIncrement(counter);
if ( wait_count == thread_count )
{
*counter = 0;
ReleaseSemaphore(semaphore, thread_count - 1, NULL);
}
else
{
WaitForSingleObject(semaphore, INFINITE);
}
}
(Implementation based on this answer)
The program occasionally deadlocks. If at that point I use VS2008 to break execution and dig around in the internals, there is only 1 worker thread waiting on the Wait... line in Barrier(). The value of waitCount is always 2.
To make things even more awkward, the faster the threads work, the more likely they are to deadlock. If I run in Release mode, the deadlock comes about 8 out of 10 times. If I run in Debug mode and put some prints in the thread function to see where they hang, they almost never hang.
So it seems that some of my worker threads are killed early, leaving the rest stuck on the Barrier. However, the threads do literally nothing except read and write memory (and call Barrier()), and I'm quite positive that no segfaults occur. It is also possible that I'm jumping to the wrong conclusions, since (as mentioned in the question linked above) I'm new to Win32 threads.
What could be going on here, and how can I debug this sort of weird behavior with VS?
How do I debug weird thread behaviour?
Not quite what you said, but the answer is almost always: understand the code really well, understand all the possible outcomes and work out which one is happening. A debugger becomes less useful here, because you can either follow one thread and miss out on what is causing other threads to fail, or follow from the parent, in which case execution is no longer sequential and you end up all over the place.
Now, onto the problem.
pSemaphore = CreateSemaphore(NULL, THREAD_COUNT, THREAD_COUNT, NULL);
From the MSDN documentation:
lInitialCount [in]: The initial count for the semaphore object. This value must be greater than or equal to zero and less than or equal to lMaximumCount. The state of a semaphore is signaled when its count is greater than zero and nonsignaled when it is zero. The count is decreased by one whenever a wait function releases a thread that was waiting for the semaphore. The count is increased by a specified amount by calling the ReleaseSemaphore function.
And here:
Before a thread attempts to perform the task, it uses the WaitForSingleObject function to determine whether the semaphore's current count permits it to do so. The wait function's time-out parameter is set to zero, so the function returns immediately if the semaphore is in the nonsignaled state. WaitForSingleObject decrements the semaphore's count by one.
So what we're saying here, is that a semaphore's count parameter tells you how many threads are allowed to perform a given task at once. When you set your count initially to THREAD_COUNT you are allowing all your threads access to the "resource" which in this case is to continue onwards.
The answer you link uses this creation method for the semaphore:
CreateSemaphore(0, 0, 1024, 0)
Which basically says none of the threads are permitted to use the resource. In your implementation, the semaphore is signaled (>0), so everything carries on merrily until one of the threads manages to decrease the count to zero, at which point some other thread waits for the semaphore to become signaled again, which probably isn't happening in sync with your counters. Remember when WaitForSingleObject returns it decreases the counter on the semaphore.
In the example you've posted, setting:
::ReleaseSemaphore(sync.Semaphore, sync.ThreadsCount - 1, 0);
Works because each of the WaitForSingleObject calls decrease the semaphore's value by 1 and there are threadcount - 1 of them to do, which happen when the threadcount - 1 WaitForSingleObjects all return, so the semaphore is back to 0 and therefore unsignaled again, so on the next pass everybody waits because nobody is allowed to access the resource at once.
So in short, set your initial value to zero and see if that fixes it.
Edit A little explanation: So to think of it a different way, a semaphore is like an n-atomic gate. What you do is usually this:
// Set the number of tickets:
HANDLE Semaphore = CreateSemaphore(0, 20, 200, 0);
// Later on in a thread somewhere...
// Get a ticket in the queue
WaitForSingleObject(Semaphore, INFINITE);
// Only 20 threads can access this area
// at once. When one thread has entered
// this area the available tickets decrease
// by one. When there are 20 threads here
// all other threads must wait.
// do stuff
ReleaseSemaphore(Semaphore, 1, 0);
// gives back one ticket.
So the use we're putting semaphores to here isn't quite the one for which they were designed.
It's a bit hard to guess exactly what you might be running into. Parallel programming is one of those places that (IMO) it pays to follow the philosophy of "keep it so simple it's obviously correct", and unfortunately I can't say that your Barrier code seems to qualify. Personally, I think I'd have something like this:
// define and initialize the array of events use for the barrier:
HANDLE barrier_[thread_count];
for (int i=0; i<thread_count; i++)
barrier_[i] = CreateEvent(NULL, true, false, NULL);
// ...
Barrier(size_t thread_num) {
// Signal that this thread has reached the barrier:
SetEvent(barrier_[thread_num]);
// Then wait for all the threads to reach the barrier:
WaitForMultipleObjects(thread_count, barrier_, true, INFINITE);
}
Edit:
Okay, now that the intent has been clarified (need to handle multiple iterations), I'd modify the answer, but only slightly. Instead of one array of Events, have two: one for the odd iterations and one for the even iterations:
// define and initialize the array of events use for the barrier:
HANDLE barrier_[2][thread_count];
for (int i=0; i<thread_count; i++) {
barrier_[0][i] = CreateEvent(NULL, true, false, NULL);
barrier_[1][i] = CreateEvent(NULL, true, false, NULL);
}
// ...
Barrier(size_t thread_num, int iteration) {
// Signal that this thread has reached the barrier:
SetEvent(barrier_[iteration & 1][thread_num]);
// Then wait for all the threads to reach the barrier:
WaitForMultipleObjects(thread_count, &barrier[iteration & 1], true, INFINITE);
ResetEvent(barrier_[iteration & 1][thread_num]);
}
In your barrier, what prevents this line:
*counter = 0;
to be executed while this other one is executed by another thread?
LONG wait_count =
InterlockedIncrement(counter);