Counting semaphore that supports multiple acquires in an atomic call - how would this be implemented in terms of other primitives?

Counting semaphore that supports multiple acquires in an atomic call - how would this be implemented in terms of other primitives? - concurrency

Suppose that I have many tasks to run that each will consume a known amount of some finite resource (say, memory or disk space) while they are running. Each task will fit within the available resource constraints individually, but there is not enough available to run everything simultaneously. My goal is to allow tasks to "claim" the resources that they will need, or wait until those resources are available. I don't care about optimizing what order the tasks run in (so, for example, first-come first-served would be fine), but the solution must not deadlock and should meet a reasonable notion of "as concurrent as possible" (so "just do tasks one at a time", while technically correct, is not a solution).
I'm imagining a synchronization object somewhat like a semaphore, but instead of acquiring a single permit from a fixed-size pool, each task acquires a "slice" of the appropriate size. While some programming languages appear to offer this functionality (for example, Java's Semaphore supports it), it doesn't seem common, and it's not clear if these implementations can deadlock or not. It's possible that I'm just missing the right search term, though.
A naive approach would be to use a counting semaphore with a large initial value (say, 1 per MiB of memory available), and for each task to acquire permits from the semaphore repeatedly - so a task needing 10 MiB of memory would acquire ten permits. However, this is prone to deadlock - tasks could be blocked partway through acquiring their permits, preventing other smaller tasks from running.
Is there a name for this synchronization object? How could it be safely implemented in terms of more common primitives such as locks, condition variables, and (ordinary) counting semaphores?

This can be implemented using a condition variable (CV) and a private counter, as long as the CV supports notifying all waiting tasks. You can use the CV to provide exclusive access to the counter, and to notify threads to re-check the counter if they're waiting for it to be large enough.
To initialize, set the private counter to the initial value of the multi-acquire semaphore. To implement acquire(count), use the CV to wait for the counter to be at least count, decrement the counter by count, and return. To implement release(count), acquire the CV, increment the counter by count, and notify all listeners on the CV so that the waiting tasks can re-check the value of the counter.
An example of this approach in Python:
from asyncio import Condition
class Mlemaphore:
"""Like a Semaphore, but with atomic acquire(count). Not thread-safe.
Has a silly name until someone tells me what these are supposed to be called."""
def __init__(self, value: int = 1):
if value < 0:
raise ValueError("Mlemaphore initial value must be >= 0")
self._counter = value
self._condition = Condition()
pass
async def acquire(self, count: int = 1) -> None:
"""Acquire count permits atomically, or wait until they are available."""
async with self._condition:
await self._condition.wait_for(lambda: self._counter >= count)
self._counter -= count
def locked(self, count: int = 1) -> bool:
"""Return True if acquire(count) would not return immediately."""
return self._counter < count
async def release(self, count: int = 1) -> None:
"""Release count permits.
Note that unlike in Semaphore, this method is async."""
async with self._condition:
self._counter += count
self._condition.notify_all()
Note: the warning about thread-safety is because Python has both "asyncio", which is a single-threaded approach to concurrency using an event loop, and threading. This sample code is based on the asyncio CV primitive, and thus is asyncio-safe but not thread-safe. Building the exact same logic out of a thread-safe CV (in Python or any other language) would result in a thread-safe Mlemaphore.

Related

Creating Mutexes with Binary Semaphore

I'm working with a simple system that does NOT have mutexes, but rather a limited array of hardware binary semaphores. Typically, all multithreading is been done with heavy Semaphore techniques that makes code both poor in performance and difficult to write correctly without deadlocks.
A naive implementation is to use one semaphore globally in order to ensure atomic access to a critical section. However, this means that unrelated objects (even of different types) will block if any critical section is being accessed.
My current solution to this problem is to use a single global semaphore to ensure atomic access to a guard byte that then ensures atomic access to the particular critical section. I currently have this so far:
while (true) {
while (mutexLock == Mutex::Locked) {
} //wait on mutex
Semaphore semaLock(SemaphoreIndex::Mutex); //RAII semaphore object
if (mutexLock == Mutex::Unlocked) {
mutexLock = Mutex::Locked;
break;
}
} //Semaphore is released by destructor here
// ... atomically safe code
mutexLock = Mutex::Unlocked;
I have a few questions: Is this the best way to approach this problem? Is this code thread-safe? Is this the same as a "double checked lock"? If so, does it suffer from the same problems and therefore need memory barriers?
EDIT: A few notes on the system this is being implemented on...
It is a RISC 16-bit processor with 32kB RAM. While it has heavy multithreading capabilities, its memory model is very primitive. Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads. Memory barriers are mostly for the compiler to know it should reload memory into general purpose registers, not for any hardware reason (no cache)

Even if the processor has heavy multithreading capabilities (not sure what you mean by that btw.), but is a single core processor, it still means that only one thread can execute at any one time.
A task switching system must be employed (which would normally run in a privileged mode) Using this system you must be able to define critical sections to atomically execute a (software implemented) mutex lock/unlock.
When you say "one core, many threads" does that mean that you have some kind of kernel running on your processor? The task switching system will be implemented by that kernel.
It might pay to look through the documentation of your kernel or ask your vendor for any tips.
Good luck.

Right now we still don't have that much information about your system (for example, what kind of registers are available for each instruction in parrallel? do you use bank architecture?, how many simultaneous instructions can you actually execute?) but hopefully what I suggest will help you
If I understand your situation you have a piece of hardware that does not have true cores, but simply MIMD ability via a vectorized operation (based on your reply). with a It is a "RISC 16-bit processor with 32kB RAM" where:
Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads
The key here is that loads and stores are atomic. Note you won't be able to do larger than 16bit load and stores atomically, since they will be compiled to two separate atomic instructions (and thus not being atomic itself).
Here is the functionality of a mutex:
attempt to lock
unlock
To lock, you might run into issues if each resource attempts to lock. For example say in your hardware N = 4 (number of processes to be run in parrallel). If instruction 1 (I1) and I2 try to lock, they will both be successful in locking. Since your loads and stores are atomic, both processes see "unlocked" at the same time, and both acquire the lock.
This means you can't do the following:
if mutex_1 unlocked:
lock mutex_1
which might look like this in an arbitrary assembly language:
load arithmetic mutex_addr
or arithmetic immediate(1) // unlocked = 0000 or 1 = 0001, locked = 0001 or 1 = 0001
store mutex_addr arithmetic // not putting in conditional label to provide better synchronization.
jumpifzero MUTEXLABEL arithmetic
To get around this you will need to have each "thread" either know if its currently getting a lock some one else is or avoid simultaneous lock access entirely.
I only see one kind of way this can be done on your system (via flag/mutex id checking). Have a mutex id associated with each thread for each mutex it is currently checking, and check for all other threads to see if you can actually acquire a lock. Your binary semaphores don't really help here because you need to associate an individual mutex with a semaphore if you were going to use it (and still have to load the mutex from ram).
A simple implementation of the check every thread unlock and lock, basically each mutex has an ID and a state, in order to avoid race conditions per instruction, the current mutex being handled is identified well before it is actually acquired. Having the "identify which lock you want to use" and "actually try to get the lock" come in two steps stops accidental acquisition on simultaneous access. With this method you can have 2^16-1 (because 0 is used to say no lock was found) mutexes and your "threads" can exist on any instruction pipe.
// init to zero
volatile uint16_t CURRENT_LOCK_ATTEMPT[NUM_THREADS]{0};
// make thread id associated with priority
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state){
if(mutex_lock_state == false){
// do not actually attempt to take the lock until checked everything.
// No race condition can happen now, you won't have actually set the lock
// if two attempt to acquire the same lock at the same time, you'll both
// be able to see some one else is as well.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//checking all lower threads, need some sort of priority
//predetermined to figure out locking.
for( int i = 0; i < MY_THREAD_ID; i++ ){
if((CURRENT_LOCK_ATTEMPT[i] == mutex_id){
//clearing bit.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return false;
}
}
// make sure to lock mutex before clearing which mutex you are currently handling
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
// its your fault if you didn't make sure you owned the lock in the first place
// if you did own it, theres no race condition, because of atomic store load.
// if you happen to set the state while another instruction is attempting to
// acquire the lock they simply wont get the lock and no race condition occurs
bool unlock(bool& mutex_lock_state){
mutex_lock_state = false;
}
If you want more equal access of resources you could change indexing instead of being based on i = 0 to i < MY_THREAD_ID, you randomly pick a "starting point" to circle around back to MY_THREAD_ID using modulo arithmetic. IE:
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state, uint16_t per_mutex_random_seed){
if(mutex_lock_state == false){
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//need a per thread linear congruence generator for speed and consistency
std::minstd_rand0 random(per_mutex_random_seed)
for(int i = random() % TOTAL_NUM_THREADS; i != MY_THREAD_ID i = (i + 1) % TOTAL_NUM_THREADS)
{
//same as before
}
// if we actually acquired the lock
GLOBAL_SEED = global_random() // use another generator to set the next seed to be used
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
In general your lack of test and set ability really throws a wrench into everything, meaning you are forced to use other algorithms for mutexes. For more information on other algorithms that you can use for non test and set architectures check out this SO post, and these wikipedia algorithms which only rely on atomic loads and stores:
Dekker's Algorithm
Eisenberg Mcguire Algorithm
Peterson's Algorithm
Lamport's Algorithm
Szymanski's Algorithm
All of these algorithms basically decompose into checking a set of flags to see if you can access the resource safely by going through every one elses flags.

NSOperationQueue vs pthread priority

I have this problem:
I have a C++ code that use some threads. These thread are pthread type.
In my iPhone app I use NSOperationQueue and also some C++ code.
The problem is this: the C++ pthread always have lower priority than NsOperationQueue.
How can I fix this? I have also tried to give low priority to NSOpertionQueue but this fix does not work.

If you have to resort to twiddling priority (notably upwards), it's usually indicative of a design flaw in concurrent models. This should be reserved for very special cases, like a realtime thread (e.g. audio playback).
First assess how your threads and tasks operate, and make sure you have no other choice. Typically, you can do something simple, like reducing the operation queue's max operation count, reducing total thread count, or by grouping your tasks by the resource they require.
What method are you using to determine the threads' priorities?
Also note that setting an operation's priority affects the ordering of enqueued operations (not the thread itself).
I've always been able to solve this problem by tweaking distribution. You should stop reading now :)
Available, but NOT RECOMMENDED:
To lower an operation's priority, you could approach it in your operation's main:
- (void)main
{
#autorelease {
const double priority = [NSThread threadPriority];
const bool isMainThread = [NSThread isMainThread];
if (!isMainThread) {
[NSThread setThreadPriority:priority * 0.5];
}
do_your_work_here
if (!isMainThread) {
[NSThread setThreadPriority:priority];
}
}
}
If you really need to push the kernel after all that, this is how you can set a pthread's priority:
pthreads with real time priority
How to increase thread priority in pthreads?

Is a ticket lock bounded wait-free? (under certain assumptions)

I am talking about a ticket lock that might look as follows (in pseudo-C syntax):
unsigned int ticket_counter = 0, lock_counter = 0;
void lock() {
unsigned int my_ticket = fetch_and_increment(ticket_counter);
while (my_ticket != lock_counter) {}
}
void unlock() {
atomic_increment(lock_counter);
}
Let's assume such a ticket lock synchronizes access to a critical section S that is wait-free, i.e. executing the critical section takes exactly c cycles/instructions. Assuming, that there are at most p threads in the system, is the synchronization of S using the ticket lock bounded wait-free?
In my opinion, it is, since the ticket lock is fair and thus the upper bound for waiting is O(p * c).
Do I make a mistake? I am little bit confused. I always thought locking implies not to be (bounded) wait-free because of the following statement:
"It is impossible to construct a wait-free implementation of a queue, stack, priority queue, set, or list from a set of atomic register." (Corollary 5.4.1 in Art of Multiprocessor Programming, Herlihy and Shavit)
However, if the ticket lock (and maybe any other fair locking mechanism) is bounded wait-free under the mentioned assumptions, it (might) enables the construction of bounded wait-free implementations of queue, stack, etc. (That's the question I am actually faced with.)
Recall the definition of bounded wait-free in "Art of Multiprocessor Programming", p.59 by Herlihy and Shavit:
"A method is wait-free if it guarantees that every call finishes its execution in a finite number of steps. It is bounded wait-free if there is a bound on the number of steps a method call can take."

Well, I believe you are correct, with some caveats.
Namely, the bounded wait-free property holds only if the critical section S is non-preemptive, which I suppose you can guarantee only for kernel space code (by disabling interrupts in the critical section). Otherwise the OS might decide to switch to another thread while one thread is in the critical section, and then the wait time is unbounded, no?
Also, for kernel code, I suppose p is not the number of software threads but rather the number of hardware threads (or cores, for CPU's that don't support several threads per CPU core). Because at most p software threads will be runnable at a time, and as S is non-preemptive you have no sleeping threads waiting on the lock.

How many mutex and cond variable?

I am wokring on pthread pool and there will be five separate thread and one queue. All the five threads are competing to get a job from the queue and I know the basic idea that I need to do lock/unlock and wait/signal.
But I am not sure how many mutex and cond variable I should have. Right now I have only one mutex and cond variable and all five thread will use it.

One mutex and at least one condition variable.
One mutex because there is one 'thing' (i.e. piece of memory) to synchronize access to: the shared state between all workers and the thread pushing the work.
One condition variable per, well, condition that one or more threads need to wait on. At the very least you need one condition variable for waiting on new jobs, the condition here being: "is there more stuff to do?" (or the converse: "is the work queue empty?").
A somewhat more substantive answer would be that there is a one-to-many relationship between a mutex and associated condition variables, and a one-to-one relationship between shared states and mutexes. From what you've told us and since you're learning, I recommend using only one shared state for your design. When or if you need more that one state I'd recommend looking for some higher level concepts (e.g. channels, futures/promises) to build up on abstraction.
In any case, don't use the same condition variable with different mutexes.

To elaborate on #Ivan's solution...
Instead of a mutex + condition variable, you can use a counting semaphore + atomic operations to create a very efficient queue.
semaphore dequeue_sem = 0;
semaphore enqueue_sem = 37; // or however large you want to bound your queue
Enqueue operation is just:
wait_for(enqueue_sem)
atomic_add_to_queue(element)
signal(dequeue_sem)
Dequeue operation is:
wait_for(dequeue_sem)
element = atomic_remove_from_queue()
signal(enqueue_sem)
The "atomic_add_to_queue" and "atomic_remove_from_queue" are typicaly implemented using atomic compare&exchange in a tight loop.
In addition to its symmetry, this formulation bounds the maximum size of the queue; if a thread calls enqueue() on a full queue, it will block. This is almost certainly what you want for any queue in a multi-threaded environment. (Your computer has finite memory; consuming it without bound should be avoided when possible.)
If you do stick with a mutex and condition variables, you want two conditions, one for enqueue to wait on (and deque to signal) and one for the other way around. The conditions mean "queue not full" and "queue not empty", respectively, and the enqueue/dequeue code is similarly symmetric.

I think that you could do stealing work from queue without locking at all via Interlocked operations if you organize it as stack/linked list (it will require semaphore instead of condition variable to prevent problem described in comments to this answer).
Pseudo-code is like that:
candidate = head.
if (candidate == null) wait_for_semaphore;
if (candidate == InterlockedCompareExchange(head, candidate->next, candidate)) perform_work(candidate->data);
else goto 1;
Of course, adding work to queue should be also via InterlockedCompareExchange in this case and signaling the semaphore.

Design options for a C++ thread-safe object cache

I'm in the process of writing a template library for data-caching in C++ where concurrent read can be done and concurrent write too, but not for the same key. The pattern can be explained with the following environment:
A mutex for the cache write.
A mutex for each key in the cache.
This way if a thread requests a key from the cache and is not present can start a locked calculation for that unique key. In the meantime other threads can retrieve or calculate data for other keys but a thread that tries to access the first key get locked-wait.
The main constraints are:
Never calculate the value for a key at the same time.
Calculating the value for 2 different keys can be done concurrently.
Data-retrieval must not lock other threads from retrieve data from other keys.
My other constraints but already resolved are:
fixed (known at compile time) maximum cache size with MRU-based ( most recently used ) thrashing.
retrieval by reference ( implicate mutexed shared counting )
I'm not sure using 1 mutex for each key is the right way to implement this but i didn't find any other substantially different way.
Do you know of other patterns to implements this or do you find this a suitable solution? I don't like the idea of having about 100 mutexs. ( the cache size is around 100 keys )

You want to lock and you want to wait. Thus there shall be "conditions" somewhere (as pthread_cond_t on Unix-like systems).
I suggest the following:
There is a global mutex which is used only to add or remove keys in the map.
The map maps keys to values, where values are wrappers. Each wrapper contains a condition and potentially a value. The condition is signaled when the value is set.
When a thread wishes to obtain a value from the cache, it first acquires the global mutex. It then looks in the map:
If there is a wrapper for that key, and that wrapper contains a value, then the thread has its value and may release the global mutex.
If there is a wrapper for that key but no value yet, then this means that some other thread is currently busy computing the value. The thread then blocks on the condition, to be awaken by the other thread when it has finished.
If there is no wrapper, then the thread registers a new wrapper in the map, and then proceeds to computing the value. When the value is computed, it sets the value and signals the condition.
In pseudo code this looks like this:
mutex_t global_mutex
hashmap_t map
lock(global_mutex)
w = map.get(key)
if (w == NULL) {
w = new Wrapper
map.put(key, w)
unlock(global_mutex)
v = compute_value()
lock(global_mutex)
w.set(v)
signal(w.cond)
unlock(global_mutex)
return v
} else {
v = w.get()
while (v == NULL) {
unlock-and-wait(global_mutex, w.cond)
v = w.get()
}
unlock(global_mutex)
return v
}
In pthreads terms, lock is pthread_mutex_lock(), unlock is pthread_mutex_unlock(), unlock-and-wait is pthread_cond_wait() and signal is pthread_cond_signal(). unlock-and-wait atomically releases the mutex and marks the thread as waiting on the condition; when the thread is awaken, the mutex is automatically reacquired.
This means that each wrapper will have to contain a condition. This embodies your various requirements:
No threads holds a mutex for a long period of time (either blocking or computing a value).
When a value is to be computed, only one thread does it, the other threads which wish to access the value just wait for it to be available.
Note that when a thread wishes to get a value and finds out that some other thread is already busy computing it, the threads ends up locking the global mutex twice: once in the beginning, and once when the value is available. A more complex solution, with one mutex per wrapper, may avoid the second locking, but unless contention is very high, I doubt that it is worth the effort.
About having many mutexes: mutexes are cheap. A mutex is basically an int, it costs nothing more than the four-or-so bytes of RAM used to store it. Beware of Windows terminology: in Win32, what I call here a mutex is deemed an "interlocked region"; what Win32 creates when CreateMutex() is called is something quite different, which is accessible from several distinct processes, and is much more expensive since it involves roundtrips to the kernel. Note that in Java, every single object instance contains a mutex, and Java developers do not seem to be overly grumpy on that subject.

You could use a mutex pool instead of allocating one mutex per resource. As reads are requested, first check the slot in question. If it already has a mutex tagged to it, block on that mutex. If not, assign a mutex to that slot and signal it, taking the mutex out of the pool. Once the mutex is unsignaled, clear the slot and return the mutex to the pool.

One possibility that would be a much simpler solution would be to use a single reader/writer lock on the entire cache. Given that you know there is a maximum number of entries (and it is relatively small), it sounds like adding new keys to the cache is a "rare" event. The general logic would be:
acquire read lock
search for key
if found
use the key
else
release read lock
acquire write lock
add key
release write lock
// acquire the read lock again and use it (probably encapsulate in a method)
endif
Not knowing more about the usage patterns, I can't say for sure if this is a good solution. It is very simple, though, and if the usage is predominantly reads, then it is very inexpensive in terms of locking.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js