Please take a look on the following pseudo-code:
boolean blocked[2];
int turn;
void P(int id) {
while(true) {
blocked[id] = true;
while(turn != id) {
while(blocked[1-id])
/* do nothing */;
turn = id;
}
/* critical section */
blocked[id] = false;
/* remainder */
}
}
void main() {
blocked[0] = false;
blocked[1] = false;
turn = 0;
parbegin(P(0), P(1)); //RUN P0 and P1 parallel
}
I thought that a could implement a simple Mutual - Exclution solution using the code above. But it's not working. Has anyone got an idea why?
Any help would really be appreciated!
Mutual Exclusion is in this exemple not guaranteed because of the following:
We begin with the following situation:
blocked = {false, false};
turn = 0;
P1 is now executes, and skips
blocked[id] = false; // Not yet executed.
The situation is now:
blocked {false, true}
turn = 0;
Now P0 executes. It passes the second while loop, ready to execute the critical section. And when P1 executes, it sets turn to 1, and is also ready to execute the critical section.
Btw, this method was originally invented by Hyman. He sent it to Communications of the Acm in 1966
Mutual Exclusion is in this exemple not guaranteed because of the following:
We begin with the following situation:
turn= 1;
blocked = {false, false};
The execution runs as follows:
P0: while (true) {
P0: blocked[0] = true;
P0: while (turn != 0) {
P0: while (blocked[1]) {
P0: }
P1: while (true) {
P1: blocked[1] = true;
P1: while (turn != 1) {
P1: }
P1: criticalSection(P1);
P0: turn = 0;
P0: while (turn != 0)
P0: }
P0: critcalSection(P0);
Is this homework, or some embedded platform? Is there any reason why you can't use pthreads or Win32 (as relevant) synchronisation primitives?
Maybe you need to declare blocked and turn as volatile, but without specifying the programming language there is no way to know.
Concurrency can not be implemented like this, especially in a multi-processor (or multi-core) environment: different cores/processors have different caches. Those caches may not be coherent. The pseudo-code below could execute in the order shown, with the results shown:
get blocked[0] -> false // cpu 0
set blocked[0] = true // cpu 1 (stored in CPU 1's L1 cache)
get blocked[0] -> false // cpu 0 (retrieved from CPU 0's L1 cache)
get glocked[0] -> false // cpu 2 (retrieved from main memory)
You need hardware knowledge to implement concurrency.
Compiler might have optimized out the "empty" while loop. Declaring variables as volatile might help, but is not guaranteed to be sufficient on multiprocessor systems.
Related
Let’s say that I want to increment a property var incremental:Int32 every time a kernel thread is executed:
//SWIFT
var incremental:Int32 = 0
var incrementalBuffer:MTLBuffer!
var incrementalPointer: UnsafeMutablePointer<Int32>!
init(metalView: MTKView) {
...
incrementalBuffer = Renderer.device.makeBuffer(bytes: &incremental, length: MemoryLayout<Int32>.stride)
incrementalPointer = incrementalBuffer.contents().bindMemory(to: Int32.self, capacity: 1)
}
func draw(in view: MTKView) {
...
computeCommandEncoder.setComputePipelineState(computePipelineState)
let width = computePipelineState.threadExecutionWidth
let threadsPerGroup = MTLSizeMake(width, 1, 1)
let threadsPerGrid = MTLSizeMake(10, 1, 1)
computeCommandEncoder.setBuffer(incrementalBuffer, offset: 0, index: 0)
computeCommandEncoder.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerGroup)
computeCommandEncoder.endEncoding()
commandBufferCompute.commit()
commandBufferCompute.waitUntilCompleted()
print(incrementalPointer.pointee)
}
//METAL
kernel void compute_shader (device int& incremental [[buffer(0)]]){
incremental++;
}
So I expect outputs:
10
20
30
...
but I get:
1
2
3
...
EDIT:
After some work based on the answer of #JustSomeGuy, Caroline from raywenderlich and one Apple Engineer I get:
[[kernel]] void compute_shader (device atomic_int& incremental [[buffer(0)]],
ushort lid [[thread_position_in_threadgroup]] ){
threadgroup atomic_int local_atomic;
if (lid==0) atomic_store_explicit(&local_atomic, 0, memory_order_relaxed);
atomic_fetch_add_explicit(&local_atomic, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if(lid == 0) {
int local_non_atomic = atomic_load_explicit(&local_atomic, memory_order_relaxed);
atomic_fetch_add_explicit(&incremental, local_non_atomic, memory_order_relaxed);
}
}
and works as expected
The reason you are seeing this problem is because ++ is not atomic. It basically comes down to a code like this
auto temp = incremental;
incremental = temp + 1;
temp;
which means that because the threads are executed in "parallel" (it's not really true cause a number of threads forms a SIMD-group which executes in step-lock, but it's not really important here).
Since the access is not atomic, the result is basically undefined, because there's no way to tell which thread observed which value.
A quick fix is to use atomic_fetch_add_explicit(incremental, 1, memory_order_relaxed). This makes all accesses to incremental atomic. memory_order_relaxed here means that guarantees on the order of operations is relaxed, so this will work only if you are just adding or just subtracting from the value. memory_order_relaxed is the only memory_order supported in MSL. You can read more on this in Metal Shading Language Specification, section 6.13.
But this quick fix is pretty bad because it's going to be slow, because access to incremental will have to be synchronized across all the threads. The other way is to use a common pattern where all threads in threadgroup update a value in threadgroup memory and then one or more of threads atomically update the device memory. So the kernel will looks something like
kernel void compute_shader (device int& incremental [[buffer(0)]], threadgroup int& local [[threadgroup(0)]], ushort lid [[thread_position_in_threadgroup]] ){
atomic_fetch_add_explicit(local, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if(lid == 0) {
atomic_fetch_add_explicit(incremental, local, memory_order_relaxed);
}
}
Which basically means: every thread in threadgroup should add atomically 1 to local, wait until every thread is done (threadgroup_barrier) and then exactly one thread adds atomically the total local to incremental.
atomic_fetch_add_explicit on a threadgroup variable will use threadgroup atomics instead of global atomics which should be faster.
You can read specification I linked above to learn more, these patterns are mentioned in samples there.
I'm looking at the spin lock implementation in JVM HotSpot from OpenJDK12. Here is how it is implemented (comments preserved):
// Polite TATAS spinlock with exponential backoff - bounded spin.
// Ideally we'd use processor cycles, time or vtime to control
// the loop, but we currently use iterations.
// All the constants within were derived empirically but work over
// over the spectrum of J2SE reference platforms.
// On Niagara-class systems the back-off is unnecessary but
// is relatively harmless. (At worst it'll slightly retard
// acquisition times). The back-off is critical for older SMP systems
// where constant fetching of the LockWord would otherwise impair
// scalability.
//
// Clamp spinning at approximately 1/2 of a context-switch round-trip.
// See synchronizer.cpp for details and rationale.
int Monitor::TrySpin(Thread * const Self) {
if (TryLock()) return 1;
if (!os::is_MP()) return 0;
int Probes = 0;
int Delay = 0;
int SpinMax = 20;
for (;;) {
intptr_t v = _LockWord.FullWord;
if ((v & _LBIT) == 0) {
if (Atomic::cmpxchg (v|_LBIT, &_LockWord.FullWord, v) == v) {
return 1;
}
continue;
}
SpinPause();
// Periodically increase Delay -- variable Delay form
// conceptually: delay *= 1 + 1/Exponent
++Probes;
if (Probes > SpinMax) return 0;
if ((Probes & 0x7) == 0) {
Delay = ((Delay << 1)|1) & 0x7FF;
// CONSIDER: Delay += 1 + (Delay/4); Delay &= 0x7FF ;
}
// Stall for "Delay" time units - iterations in the current implementation.
// Avoid generating coherency traffic while stalled.
// Possible ways to delay:
// PAUSE, SLEEP, MEMBAR #sync, MEMBAR #halt,
// wr %g0,%asi, gethrtime, rdstick, rdtick, rdtsc, etc. ...
// Note that on Niagara-class systems we want to minimize STs in the
// spin loop. N1 and brethren write-around the L1$ over the xbar into the L2$.
// Furthermore, they don't have a W$ like traditional SPARC processors.
// We currently use a Marsaglia Shift-Xor RNG loop.
if (Self != NULL) {
jint rv = Self->rng[0];
for (int k = Delay; --k >= 0;) {
rv = MarsagliaXORV(rv);
if (SafepointMechanism::should_block(Self)) return 0;
}
Self->rng[0] = rv;
} else {
Stall(Delay);
}
}
}
Link to source
Where Atomic::cmpxchg implemented on x86 as
template<>
template<typename T>
inline T Atomic::PlatformCmpxchg<8>::operator()(T exchange_value,
T volatile* dest,
T compare_value,
atomic_memory_order /* order */) const {
STATIC_ASSERT(8 == sizeof(T));
__asm__ __volatile__ ("lock cmpxchgq %1,(%3)"
: "=a" (exchange_value)
: "r" (exchange_value), "a" (compare_value), "r" (dest)
: "cc", "memory");
return exchange_value;
}
Link to source
The thing that I don't understand is the reason behind the backoff on "older SMP" systems. It was said in the commnets that
The back-off is critical for older SMP systems where constant fetching
of the LockWord would otherwise impair scalability.
The reason I can imagine is on older SMP systems when fetching and then CASing the LockWord bus lock is always asserted (not cache lock). As it is said in the Intel Manual Vol 3. 8.1.4:
For the Intel486 and Pentium processors, the LOCK# signal is always
asserted on the bus during a LOCK operation, even if the area of
memory being locked is cached in the processor. For the P6 and more
recent processor families, if the area of memory being locked during a
LOCK operation is cached in the processor that is performing the LOCK
operation as write-back memory and is completely contained in a cache
line, the processor may not assert the LOCK# signal on the bus.
Is that the actual reason? Or what is that?
I'm trying to implement some algorithm using threads that must be synchronized at some moment. More or less the sequence for each thread should be:
1. Try to find a solution with current settings.
2. Synchronize solution with other threads.
3. If any of the threads found solution end work.
4. (empty - to be inline with example below)
5. Modify parameters for algorithm and jump to 1.
Here is a toy example with algorithm changed to just random number generation - all threads should end if at least one of them will find 0.
#include <iostream>
#include <condition_variable>
#include <thread>
#include <vector>
const int numOfThreads = 8;
std::condition_variable cv1, cv2;
std::mutex m1, m2;
int lockCnt1 = 0;
int lockCnt2 = 0;
int solutionCnt = 0;
void workerThread()
{
while(true) {
// 1. do some important work
int r = rand() % 1000;
// 2. synchronize and get results from all threads
{
std::unique_lock<std::mutex> l1(m1);
++lockCnt1;
if (r == 0) ++solutionCnt; // gather solutions
if (lockCnt1 == numOfThreads) {
// last thread ends here
lockCnt2 = 0;
cv1.notify_all();
}
else {
cv1.wait(l1, [&] { return lockCnt1 == numOfThreads; });
}
}
// 3. if solution found then quit all threads
if (solutionCnt > 0) return;
// 4. if not, then set lockCnt1 to 0 to have section 2. working again
{
std::unique_lock<std::mutex> l2(m2);
++lockCnt2;
if (lockCnt2 == numOfThreads) {
// last thread ends here
lockCnt1 = 0;
cv2.notify_all();
}
else {
cv2.wait(l2, [&] { return lockCnt2 == numOfThreads; });
}
}
// 5. Setup new algorithm parameters and repeat.
}
}
int main()
{
srand(time(NULL));
std::vector<std::thread> v;
for (int i = 0; i < numOfThreads ; ++i) v.emplace_back(std::thread(workerThread));
for (int i = 0; i < numOfThreads ; ++i) v[i].join();
return 0;
}
The questions I have are about sections 2. and 4. from code above.
A) In a section 2 there is synchronization of all threads and gathering solutions (if found). All is done using lockCnt1 variable. Comparing to single use of condition_variable I found it hard how to set lockCnt1 to zero safely, to be able to reuse this section (2.) next time. Because of that I introduced section 4. Is there better way to do that (without introducing section 4.)?
B) It seems that all examples shows using condition_variable rather in context of 'producer-consumer' scenario. Is there better way to synchronization all threads in case where all are 'producers'?
Edit: Just to be clear, I didn't want to describe algorithm details since this is not important here - anyway this is necessary to have all solution(s) or none from given loop execution and mixing them is not allowed. Described sequence of execution must be followed and the question is how to have such synchronization between threads.
A) You could just not reset the lockCnt1 to 0, just keep incrementing it further. The condition lockCnt2 == numOfThreads then changes to lockCnt2 % numOfThreads == 0. You can then drop the block #4. In future you could also use std::experimental::barrier to get the threads to meet.
B) I would suggest using std::atomic for solutionCnt and then you can drop all other counters, the mutex and the condition variable. Just atomically increase it by one in the thread that found solution and then return. In all threads after every iteration check if the value is bigger than zero. If it is, then return. The advantage is that the threads do not have to meet regularly, but can try to solve it at their own pace.
Out of curiosity, I tried to solve your problem using std::async. For every attempt to find a solution, we call async. Once all parallel attempts have finished, we process feedback, adjust parameters, and repeat. An important difference with your implementation is that feedback is processed in the calling (main) thread. If processing feedback takes too long — or if we don't want to block the main thread at all — then the code in main() can be adjusted to also call std::async.
The code is supposed to be quite efficient, provided that the implementation of async uses a thread pool (e. g. Microsoft's implementation does that).
#include <chrono>
#include <future>
#include <iostream>
#include <vector>
const int numOfThreads = 8;
struct Parameters{};
struct Feedback {
int result;
};
Feedback doTheWork(const Parameters &){
// do the work and provide result and feedback for future runs
return Feedback{rand() % 1000};
}
bool isSolution(const Feedback &f){
return f.result == 0;
}
// Runs doTheWork in parallel. Number of parallel tasks is same as size of params vector
std::vector<Feedback> findSolutions(const std::vector<Parameters> ¶ms){
// 1. Run async tasks to find solutions. Normally threads are not created each time but re-used from a pool
std::vector<std::future<Feedback>> futures;
for (auto &p: params){
futures.push_back(std::async(std::launch::async,
[&p](){ return doTheWork(p); }));
}
// 2. Syncrhonize: wait for all tasks
std::vector<Feedback> feedback(futures.size());
for (auto nofRunning = futures.size(), iFuture = size_t{0}; nofRunning > 0; ){
// Check if the task has finished (future is invalid if we already handled it during an earlier iteration)
auto &future = futures[iFuture];
if (future.valid() && future.wait_for(std::chrono::milliseconds(1)) != std::future_status::timeout){
// Collect feedback for next attempt
// Alternatively, we could already check if solution has been found and cancel other tasks [if our algorithm supports cancellation]
feedback[iFuture] = std::move(future.get());
--nofRunning;
}
if (++iFuture == futures.size())
iFuture = 0;
}
return feedback;
}
int main()
{
srand(time(NULL));
std::vector<Parameters> params(numOfThreads);
// 0. Set inital parameter values here
// If we don't want to block the main thread while the algorithm is running, we can use std::async here too
while (true){
auto feedbackVector = findSolutions(params);
auto itSolution = std::find_if(std::begin(feedbackVector), std::end(feedbackVector), isSolution);
// 3. If any of the threads has found a solution, we stop
if (itSolution != feedbackVector.end())
break;
// 5. Use feedback to re-configure parameters for next iteration
}
return 0;
}
I am running a thread in the Windows kernel communicating with an application over shared memory. Everything is working fine except the communication is slow due to a Sleep loop. I have been investigating spin locks, mutexes and interlocked but can't really figure this one out. I have also considered Windows events but don't know about the performance of that one. Please advice on what would be a faster solution keeping the communication over shared memory possibly suggesting Windows events.
KERNEL CODE
typedef struct _SHARED_MEMORY
{
BOOLEAN mutex;
CHAR data[BUFFER_SIZE];
} SHARED_MEMORY, *PSHARED_MEMORY;
ZwCreateSection(...)
ZwMapViewOfSection(...)
while (TRUE) {
if (((PSHARED_MEMORY)SharedSection)->mutex == TRUE) {
//... do work...
((PSHARED_MEMORY)SharedSection)->mutex = FALSE;
}
KeDelayExecutionThread(KernelMode, FALSE, &PollingInterval);
}
APPLICATION CODE
OpenFileMapping(...)
MapViewOfFile(...)
...
RtlCopyMemory(&SM->data, WriteData, Size);
SM->mutex = TRUE;
while (SM->mutex != FALSE) {
Sleep(1); // Slow and removing it will cause an infinite loop
}
RtlCopyMemory(ReadData, &SM->data, Size);
UPDATE 1
Currently this is the fastest solution I have come up with:
while(InterlockedCompareExchange(&SM->mutex, FALSE, FALSE));
However I find it funny that you need to do an exchange and that there is no function for only compare.
You don't want to use InterlockedCompareExchange. It burns the CPU, saturates core resources that might be needed by another thread sharing that physical core, and can saturate inter-core buses.
You do need to do two things:
1) Write an InterlockedGet function and use it.
2) Prevent the loop from burning CPU resources and from taking the mother of all mispredicted branches when it finally gets unblocked.
For 1, this is known to work on all compilers that support InterlockedCompareExchange, at least last time I checked:
__inline static int InterlockedGet(int *val)
{
return *((volatile int *)val);
}
For 2, put this as the body of the wait loop:
__asm
{
rep nop
}
For x86 CPUs, this is specified to solve the resource saturation and branch prediction problems.
Putting it together:
while ((*(volatile int *) &SM->mutex) != FALSE) {
__asm
{
rep nop
}
}
Change int as needed if it's not appropriate.
Well, Recently I have come to learn Strict Alternation while studying operating systems concepts. To reduce chances of race condition and handle two processes we go like this:
Process 0:
While (TRUE) {
while (turn != 0); // wait
critical_section();
turn = 1;
noncritical_section();}
}
Process 1:
While (TRUE) {
while (turn != 1); // wait
critical_section();
turn = 0;
noncritical_section();
}
But I'm wondering how I can handle 3 processes to reduce racing condition even more?
My approach is:
Process 0:
while (turn != 0 && turn != 2); // wait
critical_section();
turn = 1;
noncritical_section();}
Process 1:
while (turn != 1 && turn != 0); // wait
critical_section();
turn = 2;
noncritical_section();}
Process 3:
while (turn != 1 && turn != 2); // wait
critical_section();
turn = 0;
noncritical_section();}
Is my approach okay? what do you guys suggest? and are there anything better out there?
Thanks
With what you have it wouldn't necessarily strictly alternate anyway, for instance, it could go into turn = 0 and the turn = 1 or turn = 2 code could follow. My suggestion would be to use OS-level events, one for each code path and each process triggers the one that follows.
With all the justified criticism on the presented concept, no one yet cared to correct the error in the posted approach.
The general pattern is simple in each process:
while it's not my turn, wait
do critical section
set turn to next process
…
So, the posted while (turn …) conditions in the case of 3 processes are, as SoronelHaetir noted, wrong. For process 0 and 1 they have to stay the same as in the 2-process case; generally for process i it's while (turn != i) ; // wait.