acquire/release memory ordering example - c++

I don't understand this sample from here :
" Assuming 'x' and 'y' are initially 0:
-Thread 1-
y.store (20, memory_order_release);
-Thread 2-
x.store (10, memory_order_release);
-Thread 3-
assert (y.load (memory_order_acquire) == 20 && x.load (memory_order_acquire) == 0)
-Thread 4-
assert (y.load (memory_order_acquire) == 0 && x.load (memory_order_acquire) == 10)
Both of these asserts can pass since there is no ordering imposed between the stores in thread 1 and thread 2.
If this example were written using the sequentially consistent model, then one of the stores must happen-before the other (although the order isn't determined until run-time), the values are synchronized between threads, and if one assert passes, the other assert must therefore fail. "
Why in acquire/release two assert can pass?

When your memory model is not sequentially consistent, then different threads can see a different state of the world, and in such a way that there is no single, global state (or sequence of states) that is consistent with what both thread see.
In the example, thread 3 could see the world as follows:
x = 0
y = 0
y = 20 // x still 0
And thread 4 could see the world as follows:
x = 0
y = 0
x = 10 // y still 0
There is no global sequence of state changes of the world that is compatible with both those views at once, but that's exactly what's allowed if the memory model is not sequentially consistent.
(In fact, the example doesn't contain anything that demonstrates the affirmative ordering guarantees provided by release/acquire. So there's a lot more to this than what is captured here, but it's a good, simple demonstration of the complexities of relaxed memory orders.)

Related

C++ atomic acquire / release operations what it actually means

I was going thru this page and trying to understand Memory model synchronization modes. In below example extracted from there:
-Thread 1- -Thread 2-
y = 1 if (x.load() == 2)
x.store (2); assert (y == 1)
to which states that the store to 'y' happens-before the store to x in thread 1. Is 'y' variable here a normal global variable or is atomic?
Further if the load of 'x' in thread 2 gets the results of the store that happened in thread 1, it must all see all operations that happened before the store in thread 1, even unrelated ones.
So what it means that x.store() operation would mean that all read / write to memory should have respective memory data values updated?
Then for std::memory_order_relaxed means "no thread can count on a specific ordering from another thread" - what does it means - is it that reordering will be done by compiler that value of y meynot be updated even though y.store() is called?
-Thread 1-
y.store (20, memory_order_relaxed)
x.store (10, memory_order_relaxed)
-Thread 2-
if (x.load (memory_order_relaxed) == 10)
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed)
}
-Thread 3-
if (y.load (memory_order_relaxed) == 10)
assert (x.load(memory_order_relaxed) == 10) /* assert B */
For Acquire / release memory model is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables.
Assuming 'x' and 'y' are initially 0:
-Thread 1-
y.store (20, memory_order_release);
-Thread 2-
x.store (10, memory_order_release);
-Thread 3-
assert (y.load (memory_order_acquire) == 20 && x.load (memory_order_acquire) == 0)
-Thread 4-
assert (y.load (memory_order_acquire) == 0 && x.load (memory_order_acquire) == 10)
What does it means in explicit terms?
-Thread 1- -Thread 2-
y = 1 if (x.load() == 2)
x.store (2); assert (y == 1)
Naturally, compiler may change order of operations that are not dependent to boost performance.
But when std::memory_order_seq_cst is in action, any atomic operator works as memory barrier.
This does not mean variable y is the atomic, compiler just guarantees that y = 1; happens before x.store (2);. If there was another thread 3 that manipulates variable y, assertion may fail due to the other thread.
If my explanation is hard to understand(due to my poor English...) please check memory barrier & happened-before.
If A happened before B relationship is made, all threads must see the side-effect of A if B's side-effect has been sighted.
-Thread 1-
y.store (20, memory_order_relaxed) // 1-1
x.store (10, memory_order_relaxed) // 1-2
-Thread 2-
if (x.load (memory_order_relaxed) == 10) // 2-1
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed) // 2-2
}
-Thread 3-
if (y.load (memory_order_relaxed) == 10) // 3-1
assert (x.load(memory_order_relaxed) == 10) /* assert B */
To understand std::memory_order_relaxed, you need to understand data dependency. Clearly, x & y does not have any dependency to each other. So compiler may change the order of execution for thread 1, unlike std::memory_order_seq_cst, where y.store(20) MUST executed before x.store(10) happens.
Let's see how each assertion may fail. I've added tag for each instruction.
assert A : 1-2 → 2-1 → assert A FAILED
assert B : See post for detailed answer.
In short summary, thread 3 may see final updated variable y and get 10, but not the side-effect of 1-2. Even tho thread 2 must have seen it's side-effect in order to store 10 into y, compiler does not guarantee instruction's side effect must have synchronized between threads(happens-before)
On the other hand, below example from the page is example of instruction's order preserved when instructions have data dependency. assert(y <= z) is guaranteed to be passed.
-Thread 1-
x.store (1, memory_order_relaxed)
x.store (2, memory_order_relaxed)
-Thread 2-
y = x.load (memory_order_relaxed)
z = x.load (memory_order_relaxed)
assert (y <= z)
2-2. is it that reordering will be done by compiler that value of y may not be updated even though y.store() is called?
NO. As I've described in 2., it means compiler may change the order of instructions that does not have data dependency. Of course y must be updated when y.store() is called. After all, that's the definition of atomic instruction.
Assuming 'x' and 'y' are initially 0:
-Thread 1-
y.store (20, memory_order_release);
-Thread 2-
x.store (10, memory_order_release);
-Thread 3-
assert (y.load (memory_order_acquire) == 20 && x.load (memory_order_acquire) == 0)
-Thread 4-
assert (y.load (memory_order_acquire) == 0 && x.load (memory_order_acquire) == 10)
Consistent mode requires happens-before relationship to all data. So under consistent mode, y.store() must happens-before x.store() or vice versa.
If thread 3's assert gets passed, it means y.store() happened before x.store(). So thread 4 must have seen y.load() == 20 before x.load() == 10. Therefore assert is failed. Same thing happens if thread 4's assert gets passed.
acquire / release memory model does not enforce happens-before relationship to independent variables. So below order can be made without violating any rules.
thread 4 y.load() → thread 1 y.store() → thread 3 y.load() → thread 3 x.load() → thread 4 x.load()
Resulting both assertion gets passed.

Advice about how to make Z3 evaluate simple constraints faster

I'm trying to use Z3 (with C++ API) to check if lots of variable configurations satisfy my constraints, but I'm having big performance issues.
I'm looking for advice about which logic or parameter setting I might be able to use to improve the runtime, or hints about how I could try and feed the problem to Z3 in a different way.
Short description of what I'm doing and how I'm doing it:
//_______________Pseudocode and example_______________
context ctx()
solver s(ctx)
// All my variables are finite domain, maybe some 20 values at most, but usually less.
// They can only be ints, bools, or enums.
// There are not that many variables, maybe 10 or 20 for now.
//
// Since I need to be able to solve constraints of the type (e == f), where
// e and f are two different enum variables, all my
// enum types are actually contained in only one enumeration_sort(), populated
// with all the different values.
sort enum_sort = {"green", "red", "yellow", "blue", "null"}
expr x = ctx.int_const("x")
expr y = ctx.int_const("y")
expr b = ctx.bool_const("b")
expr e = ctx.constant("e", enum_sort)
expr f = ctx.constant("f", enum_sort)
// now I assert the finite domains, for each variable
// enum_value(s) is a helper function, that returns the matching enum expression
//
// Let's say that these are the domains:
//
// int x is from {1, 3, 4, 7, 8}
// int y is from {1, 2, 3, 4}
// bool b is from {0, 1}
// enum e is from {"green", "red", "yellow"}
// enum f is from {"red", "blue", "null"}
s.add(x == 1 || x == 3 || x == 3 || x == 7 || x == 8)
s.add(y == 1 || y == 2 || y == 3 || y == 4)
s.add(b == 0 || b == 1)
s.add(e == enum_value("green") || e == enum_value("red") || enum_value("yellow"))
s.add(f == enum_value("red") || f == enum_value("blue") || enum_value("null"))
// now I add in my constraints. There are also about 10 or 20 of them,
// and each one is pretty short
s.add(b => (x + y >= 5))
s.add((x > 1) => (e != f))
s.add((y == 4 && x == 1) || b)
// setup of the solver is now done. Here I start to query different combinations
// of values, and ask the solver if they are "sat" or "unsat"
// some values are left empty, because I don't care about them
expr_vector vec1 = {x == 1, y == 3, b == 1, e == "red"}
print(s.check(vec1))
expr_vector vec2 = {x == 4, e == "green", f == "null"}
print(s.check(vec2))
....
// I want to answer many such queries.
Of course, in my case this isn't hardcoded, but I read and parse the constraints, variables and their domains from files, then feed the info to Z3.
But it's slow.
Even for something like ten thousand queries, my program is already running over 10s. All of this is inside s.check(). Is it possible to make it run faster?
Hopefully it is, because what I'm asking of the solver doesn't look like it's overly difficult.
No quantifiers, finite domain, no functions, everything is a whole number or an enum, domains are small, the values of the numbers are small, there's only simple arithmetic, constraints are short, etc.
If I try to use parameters for parallel processing, or set the logic to "QF_FD", the runtime doesn't change at all.
Thanks in advance for any advice.
Is it always slow? Or does it get progressively slower as you query for more and more configurations using the same solver?
If it's the former, then your problem is just too hard and this is the price to pay. I don't see anything obviously wrong in what you've shown; though you should never use booleans as integers. (Just looking at your b variable in there. Stick to booleans as booleans, and integers as integers, and unless you really have to, don't mix the two together. See this answer for some further elaboration on this point: Why is Z3 slow for tiny search space?)
If it's the latter, you might want to create a solver from scratch for each query to clean-up all the extra stuff the solver created. While additional lemmas always help, they could also hurt performance if the solver cannot make good use of them in subsequent queries. And if you follow this path, then you can simply "parallelize" the problem yourself in your C++ program; i.e., create many threads and call the solver separately for each problem, taking advantage of many-cores your computer no doubt has and OS-level multi-tasking.
Admittedly, this is very general advice and may not apply directly to your situation. But, without a particular "running" example that we can see and inspect, it's hard to be any more specific than this.
Some Ideas:
1. Replace x == 1 || x == 3 || x == 3 || x == 7 || x == 8 with (1 <= x && x <= 8) && (x <= 1 || (3 <= x) && (x <= 4 || 7 <= x). Similar change with y.
rationale: the solver for linear arithmetic now knows that x is always confined in the interval [1,8], this can be useful information for other linear equalities/inequalities; it may be useful to also learn the trivial mutual exclusion constraints not(x <= 1) || not(3 <= x) and not(x <= 4) || not(7 <= x); there are now exactly 3 boolean assignments that cover your original 5 cases, this makes the reasoning of the linear arithmetic solver more cost-efficient because each invocation deals with a larger chunk of the search space. (Furthermore, it is more likely that clauses learned from conflicts are going to be useful with subsequent calls to the solver)
(Your queries may also contain set of values rather than specific assignments of values; this may allow one to prune some unsatisfiable ranges of values with fewer queries)
2. Just like #alias mentioned, Boolean variables ought to be Booleans and not 0/1 Integer variables. The example you provided is a bit confusing, b is declared as a bool const but then you state b == 0 || b == 1
3. I am not familiar with the enum_sort of z3, meaning that I don't know how it is internally encoded and what solving techniques are applied to deal with it. Therefore, I am not sure whether the solver may try to generate trivially inconsistent truth-assignments in which e == enum_value("green") e e == enum_value("red") are both assigned to true at the same time. This might be worth a bit of investigation. For instance, another possibility could be to declare e and f as Int and give them an appropriate interval domain (as contiguous as possible) with the same approach shown in 1., that will be interpreted by your software as a list of enum values. This should remove a number of Boolean assignments from the search space, make conflict clauses more effective and possibly speed-up the search.
4. Given the small number of problem variables, values and constraints, I would suggest you to try to encode everything using just the Bit-Vector theory and nothing else (using small-but-big-enough domains). If you then configure the solver to encode Bit-Vectors eagerly, then everything is bit-blasted into SAT, and z3 should only use Boolean Constraint Propagation for satisfiability, which is the cheapest technique.
This might be an X Y problem, why are you performing thousands of queries, what are you trying to achieve? Are you trying to explore all possible combination of values? Are you trying to perform model counting?

Alternate way around programming multiple variables

I am pretty new to programming and I have got this task on my hand. I have 10 battery packs that communicate with my host system various data. I have 3 main Fail Safe processes called Fail Safe A, Fail Safe B and Fail Safe C and a No Fail.
Here are some details:
Fail safe A is high (meaning ==1) when there are voltage or relay failures and here are example of the scenarios in which it is set.
(lb_fail == 00 and (lb_status == 2 or lb_status == 3 or lb_status == 6
or lb_status ==7)) or voltage > 400 or voltage < 190 etc.
There about 14 different conditions during which Fail safe A is 1. The system only prints Fail safe A = 1 when any of these conditions are met. However, in order to understand which condition caused Fail safe A to be 1, I am introducing another variable reason_bit.
For example
reason_bitA == 0001b when (lb_fail == 00 and (lb_status == 2))
reason_bit == 0010b when (lb_fail == 00 and (lb_status == 3))
reason_bit == 0011b when (lb_fail == 00 and (lb_status == 6)) and so on.
Similarly there are 12 conditions for Fail safe B to be 1 and 12 conditions for fail safe C to be 1. Fail safe B and C will have reason_bitB and reason_bitC respectively.
There are Fail Safe A, B and C for each of 10 battery packs.
I only know of a primitive method of coding it this way:
if(Fail_safe_A1 ==1) // Fail_safe_A1 corresponds to battery pack 1
{
if (lb_fail ==0 & lb_status == 2)
cout << "reason_bitA1 = 0001" << endl;
if (lb_fail ==0 & lb_status == 3)
cout << "reason_bitA1 = 0010" << endl;
if (lb_fail ==0 & lb_status == 2)
cout << "reason_bitA1 = 0001" << endl;
}and so on.
I am going to have to do this 14 times Fail safe A, 12 times each for Fail safe B and C and all that is just for one pack. I have 9 more packs that needs the same thing.
Coding it this way is only increasing the lines in the code. I feel, it's not the most effective way of coding this. Could somebody please help me with inputs on how to code this more efficiently? I am using C++.
My apologies, for a long question! Appreciate your inputs and patience.
Make yourself a class that represents all flags for a single pack:
class pack_status {
int status_bits;
int lb_fail;
int lb_status;
...
};
Now you can make an array or a vector of pack_status for each battery pack, rather than duplicating the code for each one:
pack_status pack_stat[10];
Further, you could code up a class for fail_safe, and use it inside pack_status class to avoid duplicating individual conditions.

correct use of ``progress`` label

According to the man pages,
Progress labels are used to define correctness claims. A progress label states the requirement that the labeled global state must be visited infinitely often in any infinite system execution. Any violation of this requirement can be reported by the verifier as a non-progress cycle.
and
Spin has a special mode to prove absence of non-progress cycles. It does so with the predefined LTL formula:
(<>[] np_)
which formalizes non-progress as a standard Buchi acceptance property.
But let's take a look at the very primitive promela specification
bool initialised = 0;
init{
progress:
initialised++;
assert(initialised == 1);
}
In my understanding, the assert should hold but verification fail because initialised++ is executed exactly once whereas the progress label claims it should be possible to execute it arbitrarily often.
However, even with the above LTL formula, this verifies just fine in ispin (see below).
How do I correctly test whether a statement can be executed arbitrarily often (e.g. for a locking scheme)?
(Spin Version 6.4.7 -- 19 August 2017)
+ Partial Order Reduction
Full statespace search for:
never claim + (:np_:)
assertion violations + (if within scope of claim)
non-progress cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 28 byte, depth reached 7, errors: 0
6 states, stored (8 visited)
3 states, matched
11 transitions (= visited+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.000 equivalent memory usage for states (stored*(State-vector + overhead))
0.293 actual memory usage for states
64.000 memory used for hash table (-w24)
0.343 memory used for DFS stack (-m10000)
64.539 total actual memory usage
unreached in init
(0 of 3 states)
pan: elapsed time 0.001 seconds
No errors found -- did you verify all claims?
UPDATE
Still not sure how to use this ...
bool initialised = 0;
init{
initialised = 1;
}
active [2] proctype reader()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
progress_reader:
assert(true);
od
}
active [2] proctype writer()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
(initialised == 0)
progress_writer:
assert(true);
od
}
And let's select testing for non-progress cycles. Then ispin runs this as
spin -a test.pml
gcc -DMEMLIM=1024 -O2 -DXUSAFE -DNP -DNOCLAIM -w -o pan pan.c
./pan -m10000 -l
Which verifies without error.
So let's instead try this with ltl properties ...
/*pid: 0 = init, 1-2 = reader, 3-4 = writer*/
ltl progress_reader1{ []<> reader[1]#progress_reader }
ltl progress_reader2{ []<> reader[2]#progress_reader }
ltl progress_writer1{ []<> writer[3]#progress_writer }
ltl progress_writer2{ []<> writer[4]#progress_writer }
bool initialised = 0;
init{
initialised = 1;
}
active [2] proctype reader()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
progress_reader:
assert(true);
od
}
active [2] proctype writer()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
(initialised == 0)
progress_writer:
assert(true);
od
}
Now, first of all,
the model contains 4 never claims: progress_writer2, progress_writer1, progress_reader2, progress_reader1
only one claim is used in a verification run
choose which one with ./pan -a -N name (defaults to -N progress_reader1)
or use e.g.: spin -search -ltl progress_reader1 test.pml
Fine, I don't care, I just want this to finally run, so let's just keep progress_writer1 and worry about how to stitch it all together later:
/*pid: 0 = init, 1-2 = reader, 3-4 = writer*/
/*ltl progress_reader1{ []<> reader[1]#progress_reader }*/
/*ltl progress_reader2{ []<> reader[2]#progress_reader }*/
ltl progress_writer1{ []<> writer[3]#progress_writer }
/*ltl progress_writer2{ []<> writer[4]#progress_writer }*/
bool initialised = 0;
init{
initialised = 1;
}
active [2] proctype reader()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
progress_reader:
assert(true);
od
}
active [2] proctype writer()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
(initialised == 0)
progress_writer:
assert(true);
od
}
ispin runs this with
spin -a test.pml
ltl progress_writer1: [] (<> ((writer[3]#progress_writer)))
gcc -DMEMLIM=1024 -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c
./pan -m10000
Which does not yield an error, but instead reports
unreached in claim progress_writer1
_spin_nvr.tmp:3, state 5, "(!((writer[3]._p==progress_writer)))"
_spin_nvr.tmp:3, state 5, "(1)"
_spin_nvr.tmp:8, state 10, "(!((writer[3]._p==progress_writer)))"
_spin_nvr.tmp:10, state 13, "-end-"
(3 of 13 states)
Yeah? Splendid! I have absolutely no idea what to do about this.
How do I get this to run?
The problem with your code example is that it does not have any infinite system execution.
Progress labels are used to define correctness claims. A progress
label states the requirement that the labeled global state must be
visited infinitely often in any infinite system execution. Any
violation of this requirement can be reported by the verifier as a
non-progress cycle.
Try this example instead:
short val = 0;
init {
do
:: val == 0 ->
val = 1;
// ...
val = 0;
:: else ->
progress:
// super-important progress state
printf("progress-state\n");
assert(val != 0);
od;
};
A normal check does not find any error:
~$ spin -search test.pml
(Spin Version 6.4.3 -- 16 December 2014)
+ Partial Order Reduction
Full statespace search for:
never claim - (none specified)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 12 byte, depth reached 2, errors: 0
3 states, stored
1 states, matched
4 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.000 equivalent memory usage for states (stored*(State-vector + overhead))
0.292 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
unreached in init
test.pml:12, state 5, "printf('progress-state\n')"
test.pml:13, state 6, "assert((val!=0))"
test.pml:15, state 10, "-end-"
(3 of 10 states)
pan: elapsed time 0 seconds
whereas, checking for progress yields the error:
~$ spin -search -l test.pml
pan:1: non-progress cycle (at depth 2)
pan: wrote test.pml.trail
(Spin Version 6.4.3 -- 16 December 2014)
Warning: Search not completed
+ Partial Order Reduction
Full statespace search for:
never claim + (:np_:)
assertion violations + (if within scope of claim)
non-progress cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 20 byte, depth reached 7, errors: 1
4 states, stored
0 states, matched
4 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.000 equivalent memory usage for states (stored*(State-vector + overhead))
0.292 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
pan: elapsed time 0 seconds
WARNING: ensure to write -l after option -search, otherwise it is not handed over to the verifier.
You ask:
How do I correctly test whether a statement can be executed arbitrarily often (e.g. for a locking scheme)?
Simply write a liveness property:
ltl prop { [] <> proc[0]#label };
This checks that process with name proc and pid 0 executes infinitely often the statement corresponding to label.
Since your edit substantially changes the question, I write a new answer to avoid confusion. This answer addresses only the new content. Next time, consider creating a new, separate, question instead.
This is one of those cases in which paying attention to the unreached in ... warning message is really important, because it affects the outcome of the verification process.
The warning message:
unreached in claim progress_writer1
_spin_nvr.tmp:3, state 5, "(!((writer[3]._p==progress_writer)))"
_spin_nvr.tmp:3, state 5, "(1)"
_spin_nvr.tmp:8, state 10, "(!((writer[3]._p==progress_writer)))"
_spin_nvr.tmp:10, state 13, "-end-"
(3 of 13 states)
relates to the content of file _spin_nvr.tmp that is created during the compilation process:
...
never progress_writer1 { /* !([] (<> ((writer[3]#progress_writer)))) */
T0_init:
do
:: (! (((writer[3]#progress_writer)))) -> goto accept_S4 // state 5
:: (1) -> goto T0_init
od;
accept_S4:
do
:: (! (((writer[3]#progress_writer)))) -> goto accept_S4 // state 10
od;
} // state 13 '-end-'
...
Roughly speaking, you can view this as the specification of a Buchi Automaton which accepts executions of your writer process with _pid equal to 3 in which it does not reach the statement with label progress_writer infinitely often, i.e. it does so only a finite number of times.
To understand this you should know that, to verify an ltl property φ, spin builds an automaton containing all paths in the original Promela model that do not satisfy φ. This is done by computing the synchronous product of the automaton modeling the original system with the automaton representing the negation of the property φ you want to verify. In your example, the negation of φ is given by the excerpt of code above taken from _spin_nvr.tmp and labeled with never progress_writer1. Then, Spin checks if there is any possible execution of this automaton:
if there is, then property φ is violated and such execution trace is a witness (aka counter-example) of your property
otherwise, property φ is verified.
The warning tells you that in the resulting synchronous product none of those states is ever reachable. Why is this the case?
Consider this:
active [2] proctype writer()
{
1: assert(_pid >= 1);
2: (initialised == 1)
3: do
4: :: else ->
5: (initialised == 0);
6: progress_writer:
7: assert(true);
8: od
}
At line 2:, you check that initialised == 1. This statement forces writer to block at line 2: until when initialised is set to 1. Luckily, this is done by the init process.
At line 5:, you check that initialised == 0. This statement forces writer to block at line 5: until when initialised is set to 0. However, no process ever sets initialised to 0 anywhere in the code. Therefore, the line of code labeled with progress_writer: is effectively unreachable.
See the documentation:
(1) /* always executable */
(0) /* never executable */
skip /* always executable, same as (1) */
true /* always executable, same as skip */
false /* always blocks, same as (0) */
a == b /* executable only when a equals b */
A condition statement can only be executed (passed) if it holds. [...]

example about memory ordering in c++

Consider following example:
-Thread 1-
y.store (20, memory_order_relaxed);
x.store (10, memory_order_release);
-Thread 2-
if (x.load(memory_order_acquire) == 10) {
assert (y.load(memory_order_relaxed) == 20)
y.store (10, memory_order_release)
}
-Thread 3-
if (y.load(memory_order_acquire) == 10)
assert (x.load(memory_order_relaxed) == 10)
In this example second assert will fire(am i correct?). is it because there is no store to x in thread 2 before y.store (10, memory_order_release)?
(in cppreference.com they say this sentence about release: "A store operation with this memory order performs the release operation: prior writes to other memory locations become visible to the threads that do a consume or an acquire on the same location.")
Can i change the order of store to y in thread2 from release to sec/cst to solve the problem?
Your example isn't complete because you haven't specified initial values for x & y. But let's assume that the thread that starts all threads has initialized both to 0.
Then if thread 2 does a store to y, it must have read from thread 1's store to x and synchronized with it. If thread 3's load from y reads thread 2's store to y, it must synchronize also. Therefore, the store to x in thread 1 must happen before the load in thread 3 and it must happen after the initialization store to x. Thus thread 3's x.load must get the value of 10. Happens before in the absence of consume is transitive.
I suggest using CDSChecker on these examples to see what values are possible.