example about memory ordering in c++ - c++

Consider following example:
-Thread 1-
y.store (20, memory_order_relaxed);
x.store (10, memory_order_release);
-Thread 2-
if (x.load(memory_order_acquire) == 10) {
assert (y.load(memory_order_relaxed) == 20)
y.store (10, memory_order_release)
}
-Thread 3-
if (y.load(memory_order_acquire) == 10)
assert (x.load(memory_order_relaxed) == 10)
In this example second assert will fire(am i correct?). is it because there is no store to x in thread 2 before y.store (10, memory_order_release)?
(in cppreference.com they say this sentence about release: "A store operation with this memory order performs the release operation: prior writes to other memory locations become visible to the threads that do a consume or an acquire on the same location.")
Can i change the order of store to y in thread2 from release to sec/cst to solve the problem?

Your example isn't complete because you haven't specified initial values for x & y. But let's assume that the thread that starts all threads has initialized both to 0.
Then if thread 2 does a store to y, it must have read from thread 1's store to x and synchronized with it. If thread 3's load from y reads thread 2's store to y, it must synchronize also. Therefore, the store to x in thread 1 must happen before the load in thread 3 and it must happen after the initialization store to x. Thus thread 3's x.load must get the value of 10. Happens before in the absence of consume is transitive.
I suggest using CDSChecker on these examples to see what values are possible.

Related

How to use Maxima if then else

does anyone know how to use standard if then else structure in Maxima syntax when you need more than one instruction after then and else??? Like a block in a standard computer language???
Thanks
leon
You can put multiple expressions into block(...) or (...). The difference is that block allows local variables, e.g. block([a, b], a: ..., b: ...).
For both block(...) and (...), the result value is whatever was evaluated last. When there aren't any control structures such as if, the last expression in the block(...) or (...) is evaluated last. Otherwise, the result of the block(...) or (...) is whatever is the result of control structure. See also return.
Example:
if x < 4
then block([y], print("Hi, x is less than 4"), y: 2*x, y - 1)
else (print("I guess x >= 4"), x^3 - 10);

C++ atomic acquire / release operations what it actually means

I was going thru this page and trying to understand Memory model synchronization modes. In below example extracted from there:
-Thread 1- -Thread 2-
y = 1 if (x.load() == 2)
x.store (2); assert (y == 1)
to which states that the store to 'y' happens-before the store to x in thread 1. Is 'y' variable here a normal global variable or is atomic?
Further if the load of 'x' in thread 2 gets the results of the store that happened in thread 1, it must all see all operations that happened before the store in thread 1, even unrelated ones.
So what it means that x.store() operation would mean that all read / write to memory should have respective memory data values updated?
Then for std::memory_order_relaxed means "no thread can count on a specific ordering from another thread" - what does it means - is it that reordering will be done by compiler that value of y meynot be updated even though y.store() is called?
-Thread 1-
y.store (20, memory_order_relaxed)
x.store (10, memory_order_relaxed)
-Thread 2-
if (x.load (memory_order_relaxed) == 10)
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed)
}
-Thread 3-
if (y.load (memory_order_relaxed) == 10)
assert (x.load(memory_order_relaxed) == 10) /* assert B */
For Acquire / release memory model is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables.
Assuming 'x' and 'y' are initially 0:
-Thread 1-
y.store (20, memory_order_release);
-Thread 2-
x.store (10, memory_order_release);
-Thread 3-
assert (y.load (memory_order_acquire) == 20 && x.load (memory_order_acquire) == 0)
-Thread 4-
assert (y.load (memory_order_acquire) == 0 && x.load (memory_order_acquire) == 10)
What does it means in explicit terms?
-Thread 1- -Thread 2-
y = 1 if (x.load() == 2)
x.store (2); assert (y == 1)
Naturally, compiler may change order of operations that are not dependent to boost performance.
But when std::memory_order_seq_cst is in action, any atomic operator works as memory barrier.
This does not mean variable y is the atomic, compiler just guarantees that y = 1; happens before x.store (2);. If there was another thread 3 that manipulates variable y, assertion may fail due to the other thread.
If my explanation is hard to understand(due to my poor English...) please check memory barrier & happened-before.
If A happened before B relationship is made, all threads must see the side-effect of A if B's side-effect has been sighted.
-Thread 1-
y.store (20, memory_order_relaxed) // 1-1
x.store (10, memory_order_relaxed) // 1-2
-Thread 2-
if (x.load (memory_order_relaxed) == 10) // 2-1
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed) // 2-2
}
-Thread 3-
if (y.load (memory_order_relaxed) == 10) // 3-1
assert (x.load(memory_order_relaxed) == 10) /* assert B */
To understand std::memory_order_relaxed, you need to understand data dependency. Clearly, x & y does not have any dependency to each other. So compiler may change the order of execution for thread 1, unlike std::memory_order_seq_cst, where y.store(20) MUST executed before x.store(10) happens.
Let's see how each assertion may fail. I've added tag for each instruction.
assert A : 1-2 → 2-1 → assert A FAILED
assert B : See post for detailed answer.
In short summary, thread 3 may see final updated variable y and get 10, but not the side-effect of 1-2. Even tho thread 2 must have seen it's side-effect in order to store 10 into y, compiler does not guarantee instruction's side effect must have synchronized between threads(happens-before)
On the other hand, below example from the page is example of instruction's order preserved when instructions have data dependency. assert(y <= z) is guaranteed to be passed.
-Thread 1-
x.store (1, memory_order_relaxed)
x.store (2, memory_order_relaxed)
-Thread 2-
y = x.load (memory_order_relaxed)
z = x.load (memory_order_relaxed)
assert (y <= z)
2-2. is it that reordering will be done by compiler that value of y may not be updated even though y.store() is called?
NO. As I've described in 2., it means compiler may change the order of instructions that does not have data dependency. Of course y must be updated when y.store() is called. After all, that's the definition of atomic instruction.
Assuming 'x' and 'y' are initially 0:
-Thread 1-
y.store (20, memory_order_release);
-Thread 2-
x.store (10, memory_order_release);
-Thread 3-
assert (y.load (memory_order_acquire) == 20 && x.load (memory_order_acquire) == 0)
-Thread 4-
assert (y.load (memory_order_acquire) == 0 && x.load (memory_order_acquire) == 10)
Consistent mode requires happens-before relationship to all data. So under consistent mode, y.store() must happens-before x.store() or vice versa.
If thread 3's assert gets passed, it means y.store() happened before x.store(). So thread 4 must have seen y.load() == 20 before x.load() == 10. Therefore assert is failed. Same thing happens if thread 4's assert gets passed.
acquire / release memory model does not enforce happens-before relationship to independent variables. So below order can be made without violating any rules.
thread 4 y.load() → thread 1 y.store() → thread 3 y.load() → thread 3 x.load() → thread 4 x.load()
Resulting both assertion gets passed.

Verfiy the number of times a cuda kernel is called

Say you have a cuda kernel that you want to run 2048 times, so you define your kernel like this:
__global__ void run2048Times(){ }
Then you call it from your main code:
run2048Times<<<2,1024>>>();
All seems well so far. However now say for debugging purposes when you're calling the kernel millions of times, you want to verify that your actually calling the Kernel that many times.
What I did was pass a pointer to the kernel and ++'d the pointer every time the kernel ran.
__global__ void run2048Times(int *kernelCount){
kernelCount[0]++; // Add to the pointer
}
However when I copied that pointer back to the main function I get "2".
At first it baffeled me, then after 5 minutes of coffee and pacing back and forth I realized this probably makes sense because the cuda kernel is running 1024 instances of itself at the same time, which means that the kernels overwrite the "kernelCount[0]" instead of truly adding to it.
So instead I decided to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id > kernelCount[0]){
kernelCount[0] = id;
}
}
Genius!! This was guaranteed to work I thought. Until I ran it and got all sorts of numbers between 0 and 2000.
Which tells me that the problem mentioned above still happens here.
Is there any way to do this, even if it involves forcing the kernels to pause and wait for each other to run?
Assuming this is a simplified example, and you are not in fact trying to do profiling as others have already suggested, but want to use this in a more complex scenario, you can achieve the result you want with atomicAdd, which will ensure that the increment operation is executed as a single atomic operation:
__global__ void run2048Times(int *kernelCount){
atomicAdd(kernelCount, 1); // Add to the pointer
}
Why your solutions didn't work:
The problem with your first solution is that it gets compiled into the following PTX code (see here for description of PTX instructions):
ld.global.u32 %r1, [%rd2];
add.s32 %r2, %r1, 1;
st.global.u32 [%rd2], %r2;
You can verify this by calling nvcc with the --ptx option to only generate the intermediate representation.
What can happen here is the following timeline, assuming you launch 2 threads (Note: this is a simplified example and not exactly how GPUs work, but it is enough to illustrate the problem):
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 0 increases it's local copy by 1
thread 0 stores 1 back to kernelCount
thread 1 increases it's local copy by 1
thread 1 stores 1 back to kernelCount
and you end up with 1 even though 2 threads were launched.
Your second solution is wrong even if the threads are launched sequentially because thread indexes are 0-based. So I'll assume you wanted to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id + 1 > kernelCount[0]){
kernelCount[0] = id + 1;
}
}
This will compile into:
ld.global.u32 %r5, [%rd1];
setp.lt.s32 %p1, %r1, %r5;
#%p1 bra BB0_2;
add.s32 %r6, %r1, 1;
st.global.u32 [%rd1], %r6;
BB0_2:
ret;
What can happen here is the following timeline:
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 1 compares 0 to 1 + 1 and stores 2 into kernelCount
thread 0 compares 0 to 0 + 1 and stores 1 into kernelCount
You end up having the wrong result of 1.
I suggest you pick up a good parallel programming / CUDA book if you want to better understand problems with synchronization and non-atomic operations.
EDIT:
For completeness, the version using atomicAdd compiles into:
atom.global.add.u32 %r1, [%rd2], 1;
It seems like the only point of that counter is to do profiling (i.e. analyse how the code runs) rather than to actually count something (i.e. no functional benefit to the program).
There are profiling tools available designed for this task. For example, nvprof gives the number of calls, as well as some time metrics for each kernel in your codebase.

acquire/release memory ordering example

I don't understand this sample from here :
" Assuming 'x' and 'y' are initially 0:
-Thread 1-
y.store (20, memory_order_release);
-Thread 2-
x.store (10, memory_order_release);
-Thread 3-
assert (y.load (memory_order_acquire) == 20 && x.load (memory_order_acquire) == 0)
-Thread 4-
assert (y.load (memory_order_acquire) == 0 && x.load (memory_order_acquire) == 10)
Both of these asserts can pass since there is no ordering imposed between the stores in thread 1 and thread 2.
If this example were written using the sequentially consistent model, then one of the stores must happen-before the other (although the order isn't determined until run-time), the values are synchronized between threads, and if one assert passes, the other assert must therefore fail. "
Why in acquire/release two assert can pass?
When your memory model is not sequentially consistent, then different threads can see a different state of the world, and in such a way that there is no single, global state (or sequence of states) that is consistent with what both thread see.
In the example, thread 3 could see the world as follows:
x = 0
y = 0
y = 20 // x still 0
And thread 4 could see the world as follows:
x = 0
y = 0
x = 10 // y still 0
There is no global sequence of state changes of the world that is compatible with both those views at once, but that's exactly what's allowed if the memory model is not sequentially consistent.
(In fact, the example doesn't contain anything that demonstrates the affirmative ordering guarantees provided by release/acquire. So there's a lot more to this than what is captured here, but it's a good, simple demonstration of the complexities of relaxed memory orders.)

How do Binary Semaphores proceed?

I was studying binary semaphores when the following question turned up:
Suppose there are 3 concurrent processes and 3 binary semaphores... The semaphores are intitialised as S0=1, S1=0, S2=0. The processes have the following code:
Process P0: Process P1: Process P2:
while (true){ wait(S1); wait(S2);
wait (S0); release (S0); release(S0);
print '0';
release (S1);
release (S2);
}
Now the question is how many times the process will print 0 ?
Let me explain How i was solving it.. suppose the first three statements of the three processes be executed concurrently! i.e, the while statement of process p0, wait(S1) of process p1 and wait(S2) of process P2.. Now, the wait(S1) and wait(S2) will both make the semaphore values -1 and the processes P1 and P2 will be blocked.. Then wait(S0) of Process P0 will be executed. When this happens S0's value becomes 0 and the process P0 moves into blocked state, as a result all the processes will be blocked and in a deadlock state!! But unfortunately thats not the answer. . Please tell me where I am wrong and how the solution proceeds ? :|
EDIT:
I was wrong in my approach to binary semaphores.. they can take only 0 and 1!
Ok .. so here i am answering my own question :P ..
The solution proceeds as below:
Only process P0 can execute first. That's because semaphore used by process P0 i.e S0 has an initial value of 1. Now when P0 calls wait on S0 the value of S0 becomes 0, implying that S0 has been taken by P0. As far as Process P1 and P2 are concerned, when they call wait on S1 and S2 respectively, they can't proceed because the semaphores are already initialized as taken i.e 0, so they have to wait until S1 and S2 are released!
P0 proceeds first and prints 0. Now the next statements release S1 and S2! When S1 is released the wait of process P1 is over as the value of S1 goes up by 1 and is flagged not taken. P1 takes S1 and makes S1 as taken. The same goes with Process P2.
Now, only one of P1 or P2 can execute, because either of them can be in the critical section at a given time.. Suppose P2 executes. It releases S0 and terminates.
Let P1 execute next.. P1 starts Releases S0 and terminates.
Now only P0 can execute because its in a while loop whose condition is set to true, which makes it to run always. P0 executes prints 0 second time and releases S1 and S2. But P1 and P2 have already been terminated so P0 waits forever for the release of S0.
Here's a second solution which prints 0 three times:
P0 starts, prints 0 adn releases S1 and S2.
Let P2 execute. P2 starts, releases S0 and terminates. After this only P0 or P1 can execute.
Let P0 execute. Prints 0 for second time and releases S1 and S2. At this point only P1 can execute.
P1 starts, releases S0, P1 terminates. At this point only P0 can execute because its in a while loop whose condition is set to true!
P0 starts, prints 0 for the 3rd time and releases S1 and S2. It then waits for someone to release S0 which never happens.
So the answer becomes exactly twice or exactly thrice, which can also be said "atleast twice"!
Please tell me if i am wrong anywhere!!