What exactly is the problem that memory barriers deal with? - c++

I'm trying to wrap my head around the issue of memory barriers right now. I've been reading and watching videos about the subject, and I want to make sure I understand it correctly, as well as ask a question or two.
I start with understanding the problem accurately. Let's take the following classic example as the basis for the discussion: Suppose we have 2 threads running on 2 different cores
This is pseudo-code!
We start with int f = 0; int x = 0; and then run those threads:
# Thread 1
while(f == 0);
print(x)
# Thread 2
x = 42;
f = 1;
Of course, the desired result of this program is that thread 1 will print 42.
NOTE: I leave "compile-time reordering" out of this discussion, I only want to focus on what happens in runtime, so ignore all kinds of optimizations that the compiler might do.
Ok so from what I understand the problem here is what is called "memory reordering": the CPU is free to reorder memory operations as long as the end result is what the program expects it to be. In this case, within thread 2, the f = 1 may be executed before x = 42. In this case, thread 1 will print 0, which is not what the programmer want.
At this point, Wikipedia points at another possible scenario that may occur:
Similarly, thread #1's load operations may be executed out-of-order and it is possible for x to be read before f is checked
Since we're talking right now about "out of order execution" - let's ignore for a moment from the caches of the cores. So let's analyze what happens here. Start with thread 2 - the compiled instructions will look (in pseudo-assembly) something like:
1 put 42 into register1
2 write register1 to memory location of x
3 put 1 into register 2
4 write register2 to memory location of f
Ok so I understand that 3-4 may be executed before 1-2. But I don't understand the equivalent in thread 1:
Let's say the instructions of thread 1 will be something like:
1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2
What exactly may be out of order here? 3 can be before 1-2?
Let's go on: Up until now we talked about out-of-order execution, which brings me to my primary confusion:
In this great post the author describes the problem as such: Each core has its own cache, and the core does the memory operations against the cache, not against the main memory. The movement of memory from the core-specific caches to the main memory (or a shared cache) occurs in unpredictable time and order. So in our example - even if thread 2 will execute its instructions in-order - the writing of x=42 will occur before f=1, but that will be only to the cache of core2. The movement of these values to a shared memory may be in the opposite order, and hence the problem.
So I don't understand - when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?

when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?
When a thread observes changes of values in a particular order, then from the programmer's perspective it is indistinguishable whether that was due to out-of-order execution of loads, a store buffer delaying stores relative to loads and possibly letting them commit out of order (regardless of execution order), or (hypothetically in a CPU without coherent cache) cache synchronization.
Or even by forwarding store data between logical cores without going through cache, before it commits to cache and becomes visible to all cores. Some POWER CPUs can do this in real life but few if any others.
Real CPUs have coherent caches; once a value commits to cache, it's visible to all cores; it can't happen until other copies are already invalidated, so this is not the mechanism for reading "stale" data. Memory reordering on real-world CPUs is something that happens within a core, with reads and writes of coherent cache possibly happening in a different order than program order. Cache doesn't re-sync after getting out of sync; it maintains coherency in the first place.
The important effect, regardless of mechanism, is that another thread observing the same variables you're reading/writing, can see the effects happen in a different order than assembly program-order.

The two mail questions you have both have the same answer (Yes!), but for different reasons.
First let's look at this particular piece of pseudo-machine-code
Let's say the instructions of thread 1 will be something like:
1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2
What exactly may be out of order here? 3 can be before 1-2?
To answer your question, this is a reverberating "YES!". Since the contents of register1 are not tied in any way to the contents of register2 the CPU may happily (and correctly, for that matter) preload register2, so that when the the 1,2 loop finally breaks, it can immediately go to 4.
For a practical example, register1 might be an I/O peripheral register tied to a polled serial clock, and the CPU is just waiting for the clock to transition to low, so that it can bit-bang the next value onto the data output lines. Doing it that way for one saves precious time on the data fetch and more importantly may avoid contention on the peripheral data bus.
So, yes, this kind of reordering is perfectly fine and allowed, even with optimizations turned off and happening on a single threaded, single core CPU. The only way to make sure that register2 is definitely read, after the loop breaks, is to insert a barrier.
The second question is about cache coherence. And again, the answer to the need of memory barriers is "yes! you need them". Cache coherence is an issue, because modern CPUs don't talk to the system memory directly, but through their caches. As long as you're dealing with only a single CPU core, and a single cache, coherence is not an issue, since all the threads running on the same core do work against the same cache. However the moment you have multiple cores with independent caches, their individual views of the system memory contents may differ, and some form of memory consistency model is required. Either through explicit insertion of memory barriers, or on the hardware level.

For my point of view you missed the most important thing!
As the compiler did not see that the change of x nor f has any side effect, the compiler also can optimize all of that away. And also the loop with condition f==0 will result in "nothing" as the compiler only sees that you propagate a constant for f=0 before, it can assume that f==0 will always be true and optimize it away.
And for all of that you have to tell the compiler that there will be something happen which is not visible from the given flow of code. That can be something like a call to some semaphore/mutex/... or other IPC functionality or the use of atomic vars.
If you compile your code, I assume you get more or less "nothing" as for each of both code parts nothing has any effect and the compiler did not see that the variables are used from two thread context and optimize all and everything away.
If we implement the code as the following example, we see it fails and print 0 on my system.
int main()
{
int f = 0;
int x = 0;
std::thread s( [&f,&x](){ x=42; f=1; } );
while( f==0);
std::cout << x << std::endl;
s.join();
}
and if we change int f = 0; to std::atomic<int> f = 0 we get the expected result.

Related

With memory_order_relaxed how is total order of modification of an atomic variable assured on typical architectures?

As I understand memory_order_relaxed is to avoid costly memory fences that may be needed with more constrained ordering on a particular architecture.
In that case how is total modification order for an atomic variable achieved on popular processors?
EDIT:
atomic<int> a;
void thread_proc()
{
int b = a.load(memory_order_relaxed);
int c = a.load(memory_order_relaxed);
printf(“first value %d, second value %d\n, b, c);
}
int main()
{
thread t1(thread_proc);
thread t2(thread_proc);
a.store(1, memory_order_relaxed);
a.store(2, memory_order_relaxed);
t1.join();
t2.join();
}
What will guarantee that the output won’t be:
first value 1, second value 2
first value 2, second value 1
?
Multi-processors often use the MESI protocol to ensure total store order on a location. Information is transferred at cache-line granularity. The protocol ensures that before a processor modifies the contents of a cache line, all other processors relinquish their copy of the line, and must reload a copy of the modified line. Hence in the example where a processor writes x and then y to the same location, if any processor sees the write of x, it must have reloaded from the modified line, and must relinquish the line again before the writer writes y.
There is usually a specific set of assembly instructions that corresponds to operations on std::atomics, for example an atomic addition on x86 is lock xadd.
By specifying memory order relaxed you can conceptually think of it as telling the compiler "you must use this technique to increment the value, but I impose no other restrictions outside of the standard as-if optimisations rules on top of that". So literally just replacing an add with an lock xadd is likely sufficient under a relaxed ordering constraint.
Also keep in mind 'memory_order_relaxed' specifies a minimum standard that the compiler has to respect. Some intrinsics on some platforms will have implicit hardware barriers, which doesn't violate the constraint for being too constrained.
All atomic operations act in accord with [intro.races]/14:
If an operation A that modifies an atomic object M happens before an operation B that modifies M, then A shall be earlier than B in the modification order of M.
The two stores from the main thread are required to happen in that order, since the two operations are ordered within the same thread. Therefore, they cannot happen outside of that order. If someone sees the value 2 in the atomic, then the first thread must have executed past the point where the value was set to 1, per [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.
This of course only applies to atomic operations on a specific atomic object; ordering with respect to other things doesn't exist when using relaxed ordering (which is the point).
How does this get achieved on real machines? In whatever way the compiler sees fit to do so. The compiler could decide that, since you're overwriting the value of the variable you just set, then it can remove the first store per the as-if rule. Nobody ever seeing the value 1 is a perfectly legitimate implementation according to the C++ memory model.
But otherwise, the compiler is required to emit whatever is needed to make it work. Note that out-of-order processors aren't typically allowed to complete dependent operations out of order, so that's typically not a problem.
There are two parts in an inter thread communication:
a core that can do loads and stores
the memory system which consists of coherent caches
The issue is the speculative execution in the CPU core.
A processor load and store unit always need to compare addresses in order to avoid reordering two writes to the same location (if it reorders writes at all) or to pre-fetch a stale value that has just been written to (when reads are done early, before previous writes).
Without that feature, any sequence of executable code would be at risk of having its memory accesses completely randomized, seeing values written by a following instruction, etc. All memory locations would be "renamed" in crazy ways with no way for a program to refer to the same (originally named) location twice in a row.
All programs would break.
On the other hand, memory locations in potentially running code can have two "names":
the location that can hold a modifiable value, in L1d
the location that can be decoded as executable code, in L1i
And these are not connected in any way until a special reload code instruction is performed, not only the L1i but also the instruction decoder can have in cache locations that are otherwise modifiable.
[Another complication is when two virtual addresses (used by speculative loads or stores) refer to the same physical addresses (aliasing): that's another conflict that needs to be dealt with.]
Summary: In most cases, a CPU will naturally to provide an order for accesses on each data memory location.
EDIT:
While a core needs to keep track of operations that invalidate speculative execution, mainly a write to a location later read by a speculative instruction. Reads don't conflict with each others and a CPU core might want to keep track of modification of cached memory after a speculative read (making reads happen visibly in advance) and if reads can be executed out of order it's conceivable that a later read might be complete before an earlier read; on why the system would begin a later read first, a possible cause would be if the address computation is easier and complete first.
So a system that can begin reads out of order and that would consider them completed as soon as a value is made available by the cache, and valid as long as no write by the same core ends up conflicting with either read, and does not monitor L1i cache invalidations caused by another CPU wanting to modify a close memory location (possible that one location), such sequence is possible:
decompose the soon to be executed instructions into sequence A which is long a list of sequenced operations ending with a result in r1 and B a shorter sequence ending with a result in r2
run both in parallel, with B producing a result earlier
speculatively try load (r2), noting that a write that address may invalidate the speculation (suppose the location is available in L1i)
then another CPU annoys us stealing the cache line holding location of (r2)
A completes making r1 value available and we can speculatively do load (r1) (which happens to be the same address as (r2)); which stalls until our cache gets back its cache line
the value of the last done load can be different from the first
Neither speculations of A nor B invalided any memory location, as the system doesn't consider either the loss of cache line or the return of a different value by the last load to be an invalidation of a speculation (which would be easy to implement as we have all the information locally).
Here the system sees any read as non conflicting with any local operation that isn't a local write and the loads are done in an order depending on the complexity of A and B and not whichever comes first in program order (the description above doesn't even say that the program order was changed, just that it was ignored by speculation: I have never described which of the loads was first in the program).
So for a relaxed atomic load, a special instruction would be needed on such system.
The cache system
Of course the cache system doesn't change orders of requests, as it works like a global random access system with temporary ownership by cores.

Do we need to use lock for multi-threaded x32 system for just reading or writing into a uint32_t variable

I have a question:
Consider a x32 System,
therefore for a uint32_t variable does the system read and write to it atomically?
Meaning, that the entire operation of read or write can be completed in one instruction cycle.
If this is the case then for a multi-threaded x32 system we wont have to use locks for just reading or writing into a uint32_t variable.
Please confirm my understanding.
It is only atomic if you write the code in assembler and pick the appropriate instruction. When using a higher level language, you don't have any control over which instructions that will get picked.
If you have some C code like a = b; then the machine code generated might be "load b into register x", "store register x in the memory location of a", which is more than one instruction. An interrupt or another thread executed between those two will mean data corruption if it uses the same variable. Suppose the other thread writes a different value to a - then that change will be lost when returning to the original thread.
Therefore you must use some manner of protection mechanism, such as _Atomic qualifier, mutex or critical sections.
Yes, one needs to use locks or some other appropriate mechanism, like the atomics.
C11 5.1.2.4p4:
Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
C11 5.1.2.4p25:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
Additionally if you've got a variable that is not volatile-qualified then the C standard does not even require that the changes hit to the memory at all; unless you use some synchronization mechanism the data races could have much longer spans in an optimized program than one would perhaps initially think could be possible - for example the writes can be totally out of order and so forth.
The usage of locks is not (only) to ensure atomicity, 32-bit variables are already been written atomically.
Your problem is to protect simultaneous writing:
int x = 0;
Function 1: x++;
Function 2: x++;
If there is no synchronization, x might up end as 1 instead of 2 because function 2 might be reading x = 0, before function 1 modifies x. The worst thing in all this is that it might happen or not at random (or at your client's PC), so debugging is difficult.
The issue is that variables aren't updated instantly.
Each processor's core has its own private memory (L1 and L2 caches). So if you modify a variable, say x++, in two different threads on two different cores - then each core updates their own version of x.
Atomic operations and mutexes ensure synchronization of these variables with the shared memory (RAM / L3 cache).

Is memory barrier related to some specific memory location?

I'm trying to learn the basics about low-level concurrency.
From Linux documentation:
A write memory barrier gives a guarantee that all the STORE operations
specified before the barrier will appear to happen before all the STORE
operations specified after the barrier with respect to the other
components of the system.
I think that "all the STORE operations" must mean that there are more instances of a particular barrier type than one and there is probably 1:N relationship between a barrier instance and a STORE. Where can I find confirmation for this?
Memory barriers are not related to any specific memory locations.
It's not about "write to memory address x should happen before write to address y", it's about execution order of instructions, e.g. for program
x = 2
y = 1
processor may decide: "I don't want to wait until 2 will be finally stored in x, I can start writing 1 to y while x = 2 is still in progress" (also known as out-of-order execution/reordering), so reader on another core may observe 0 in x (its initial value) after observing 1 in y which is counter-intuitive behaviour.
If you place write barrier between two stores, then reader can be sure that if he observes results of second store, the first one is also happened, so if he reads y == 1, then it's known that x == 2 (It's not that easy though, because reads can be executed out of order too, so you need read barrier). In other words, such barrier forbids executing y = 1 while x = 2 is not finished.
As #RafaelWinterhalter mentioned, there is awesome guide for JVM compiler writers, which has many concrete examples how barriers are mapped to real code.
As additional reading see Preshing blog, it has many articles about low level concurrency, e.g. this one about barriers.

(C/C++) Why is it in/valid to synchronize a single reader and a single writer with a global variable?

Let's assume there is a data structure like a std::vector and a global variable int syncToken initialized to zero.
Also given, exactly two threads as reader/writer, why is the following (pseudo) code (in)valid?
void reader_thread(){
while(1){
if(syncToken!=0){
while(the_vector.length()>0){
// ... process the std::vector
}
syncToken = 0; // let the writer do it's work
}
sleep(1);
}
}
void writer_thread(){
while(1){
std::string data = waitAndReadDataFromSomeResource(the_resource);
if(syncToken==0){
the_vector.push(data);
syncToken = 1; // would syncToken++; be a difference here?
}
// drop data in case we couldn't write to the vector
}
}
Although this code is not (time-)efficient, as far as I can see the code is valid, because the two threads only synchronize on the global variable value in a way such that no undefined behaviour could result. The only problem might occur at using the vector concurrently, but this shouldn't happen because of only switching between zero and one as a synchronization value, right?
UPDATE
Since I made the mistake of asking just a yes/no question, I updated my question to why in hope of getting a very specific case as an answer.
It also seems that the question itself draws the wrong picture based on the answers so I'll elaborate more on what my problem/question is with above code.
Beforehand, I want to point out that I'm asking for a specific use case/example/proof/detailed explanation which demonstrates exactly what goes out of sync. Even a C example code which let a an example counter behave non monotonic increasing would just answer the yes/no question but not why!
I'm interested in the why. So, if you provide an example which demonstrates that it has a problem I'm interested in the why.
By (my) definition above code shall be named synchronized if and only if the code within the if statement, excluding the syncToken assignment at the bottom of the if block, can only be executed by exactly one of those two given threads at a given time.
Based on this thought I'm searching for a, maybe assembler based, example where both threads execute the if block at the same time - meaning they are out of sync or namely not synchronized.
As a reference, let's look at the relevant part of assembler code produced by gcc:
; just the declaration of an integer global variable on a 64bit cpu initialized to zero
syncToken:
.zero 4
.text
.globl main
.type main, #function
; writer (Cpu/Thread B): if syncToken == 0, jump not equal to label .L1
movl syncToken(%rip), %eax
testl %eax, %eax
jne .L1
; reader (Cpu/Thread A): if syncToken != 0, jump to Label L2
movl syncToken(%rip), %eax
testl %eax, %eax
je .L2
; set syncToken to be zero
movl $0, syncToken(%rip)
Now my problem is that, I don't see a way why those instructions can get out of sync.
Assume both threads run on their own CPU core like Thread A runs on core A, Thread B runs on core B. The initialization is global and done before both threads begin execution, so we can ignore the initialization and assume both Threads start with syncToken=0;
Example:
Cpu A: movl syncToken(%rip), %eax
Cpu A: context switch (saving all registers)
Cpu B: movl syncToken(%rip), %eax
Cpu B: testl %eax, %eax
Cpu B: jne .L1 ; this one is false => execute writer if block
Cpu B: context switch
Cpu A: context switch to thread (restoring all registers)
Cpu A: testl %eax, %eax
Cpu A: je .L2 ; this is false => not executing if block
Honestly I've constructed an example which works well, but it demonstrates that I don't see a way why the variable should go out of sync such that both threads execute the if block concurrently.
My point is: although the context switch will result in an inconsistency between %eax and the actual value of syncToken in RAM, the code should do the right thing and just not execute the if block if it is not the single only thread allowed to run it.
UPDATE 2
It can be assumed that syncToken will only be used like in the code as shown. No other function (like waitAndReadDataFromSomeResource) is allowed to use it in any way
UPDATE 3
Let's go one step further by asking a slight different question: Is it possible to synchronize two threads, one reader, one writer using an int syncToken such that the threads won't go out of sync all time by executing the if block concurrently? If yes - that's very interesting ^^
If no - why?
The basic problem is you are assuming updates to syncToken are atomic with updates to the vector, which they aren't.
There's no guarantee that on a multi core CPU these two threads won't be running on different cores. And there's no guarantee of the sequence in which memory updates get written to main memory or that cache gets refreshed from main memory.
So when in the read thread you set syncToken to zero, it could be that the writer thread sees that change before it sees the change to the vector memory. So it could start pushing stuff to an out of date end of the vector.
Similarly, when you set the token in the writer thread, the reader may start accessing an old version of the contents of the vector. Even more fun, depending on how the vector is implemented, the reader might see the vector header containing an old pointer to the contents of the memory
void reader_thread(){
while(1){
if(syncToken!=0){
while(the_vector.length()>0){
// ... process the std::vector
}
syncToken = 0; // let the writer do it's work
}
sleep(1);
This sleep will cause a memory flush as it goes to the OS, but there's no guarantee of the order of the memory flush or in which order the writer thread will see it.
}
}
void writer_thread(){
while(1){
std::string data = waitAndReadDataFromSomeResource(the_resource);
This might cause a memory flush. On the other hand it might not.
if(syncToken==0){
the_vector.push(data);
syncToken = 1; // would syncToken++; be a difference here?
}
// drop data in case we couldn't write to the vector
}
}
Using syncToken++ would (in general) not help, as that performs a read/modify/write, so if the other end happens to be doing a modification at the same time, you could get any sort of result out of it.
To be safe you need to use memory synchronisation or locks to ensure memory gets read/written in the correct order.
In this code, you would need to use a read synchronisation barrier before you read syncToken and a write synchronisation barrier before you write it.
Using the write synchronisation ensures that all memory updates up to that point are visible to main memory before any updates afterwards are - so that the_vector is appropriately updated before syncToken is set to one.
Using the read synchronisation before you read syncToken will ensure that what is in your cache will be correct with main memory.
Generally this can be rather tricky to get right, and you'd be better off using mutexes or semaphores to ensure the synchronisation, unless performance is very critical.
As noted by Anders, the compiler is still free to re-order access to syncToken with accesses to the_vector (if it can determine what these functions do, which with std::vector it probably can) - adding memory barriers will stop this re-ordering. Making syncToken volatile will also stop the reordering, but it won't address the issues with memory coherency on a multicore system, and it won't allow you to safely do read/modify/writes to the same variable from 2 threads.
Short answer: No, this example is not properly synchronized and will not (always) work.
For software it is generally understood that working sometimes but not always is the same thing as broken. Now, you could ask something like "would this work for synchronizing an interrupt controller with the foreground task on an ACME brand 32-bit micro-controller with XYZ compiler at optimization level -O0" and the answer might certainly be yes. But in the general case, the answer is no. In fact, the likelihood of this working in any real situation is low because the intersection of the "uses STL" and "simple enough hardware and compiler to just work" is probably empty.
As other comments/answers have stated, it is also technically Undefined Behavior (UB). Real implementations are free to make UB work properly too. So just because it is not "standard" it may still work, but it will not be strictly conforming or portable. Whether it works depends on the exact situation, based heavily on the processor and the compiler, and perhaps also the OS.
What works
As your (code) comment implies, it is very possible that data will be dropped so this is presumed to be intentional. This example will have poor performance because the only time the vector needs to be "locked" is just when data is being added, removed, or length tested. However reader_thread() owns the vector until it is done testing, removing and processing all of the items. This is longer than desired, so it is more likely to drop data than it otherwise would need to be.
However, as long as variable accesses were synchronous and the statements occur in "naive" program order, the logic appears to be correct. The writer_thread() does not access the vector until it "owns" it (syncToken == 0). Similarly reader_thread() does not access the vector until it owns it (syncToken == 1). Even without atomic writes/reads (say this was a 16 bit machine and syncToken was 32 bits), this would still
"work".
Note 1: the pattern if(flag) { ... flag = x } is a non-atomic test-and-set. Ordinarily this would be a race condition. But in this very specific case, that race is side-stepped. In general (e.g. more than one reader or writer) that would be a problem too.
Note 2: syncToken++ is less likely to be atomic than syncToken = 1. Normally this would be another bellwether of misbehavior because it involves a read-modify-write. In this specific case, it should make no difference.
What goes wrong
What if the writes to syncToken are not synchronous with the other threads? What if writes to syncToken are to a register and not to memory? In this case the likelihood is that reader_thread() will never execute at all because it will not see syncToken set. Even though syncToken is a normal global variable, it only might be written back to memory when waitAndReadDataFromSomeResource() is called or just randomly when register pressure happens to be high enough. But since the writer_thread() function is an infinite while loop and never exits, it is also entirely possible that it never happens. To workaround this, syncToken would have to be declared as volatile, forcing every write and read to go to memory.
As other comments/answers mentioned, the possibility of caching may be a problem. But for most architectures in normal system memory, it would not be. The hardware will, via cache coherency protocols like MESI, insure that all caches on all processors maintain coherency. If syncToken is written to L1 cache on processor P1, when P2 tries to access the same location, the hardware insures the dirty cache line from P1 will be flushed before P2 loads it. So for normal cache-coherent system memory this is probably "OK".
However, this is scenario not entirely far fetched if the writes were to device or IO memory where caches and buffers are not automatically synchronized. For example, the PowerPC EIEIO instruction is required to synchronize external bus memory, and PCI posted writes may be buffered by bridges and must be flushed programatically. If either the vector or syncToken were not stored in normal cache-coherent system memory, this could also cause a synchronization problem.
More realistically, if synchronization isn't the problem, then re-ordering by the compiler's optimizer will be. The optimizer can decide that since the_vector.push(data) and syncToken = 1 have no dependency, it is be free to move the syncToken = 1 first. Obviously that breaks things by allowing reader_thread() to be messing with vector at the same time as writer_thread().
Simply declaring syncToken as volatile would not be enough either. Volatile accesses are only guaranteed to be ordered against other volatile accesses, but not between volatile and non-volatile accesses. So unless the vector was also volatile, this will still be a problem. Since vector is probably an STL class, it is not obvious that declaring it volatile would even work.
Presume now that synchronization issues and the compiler optimizers have been beaten into submission. You review the assembler code and see clearly that everything now appears in the proper order. The final problem is that modern CPUs have a habit of executing and retiring instructions out-of-order. Since there is no dependency between the last instruction in whatever the_vector.push(data) compiles into and syncToken = 1, then the processor can decide to do the movl $0x1, syncToken(%rip) before other instructions that are part of the_vector.push(data) have finished, for example, saving the new length field. This is regardless of what the order of the assembly language opcodes appear to be.
Normally the CPU knows that instruction #3 depends on the result of instruction #1 so it knows that #3 must be done after #1. Perhaps instruction #2 has no dependency on either and could be before or after either of them. This scheduling occurs dynamically at runtime based on whatever CPU resources are available at the moment.
What goes wrong is that there is no explicit dependency between the instructions that access the_vector and those that access syncToken. Yet the program still implicitly requires them to be ordered for correct operation. There is no way for the CPU to know this.
The only way to prevent the reordering would be to use a memory fence, barrier, or other synchronizing instruction specific to the particular CPU. For example, the intel mfence instruction or PPC sync could be inserted between touching the_vector and syncToken. Just which instruction or series of instructions, and where they are required to be placed is very specific to the CPU model and situation.
At the end of the day, it would be much easier to use "proper" synchronization primitives. Synchronization library calls also handle placing compiler and CPU barriers in the right places. Furthermore, if you did something like the following, it would perform better and not need to drop data (although the sleep(1) is still dodgey - better to use a condition variable or semaphore):
void reader_thread(){
while(1){
MUTEX_LOCK()
if(the_vector.length()>0){
std::string data = the_vector.pop();
MUTEX_UNLOCK();
// ... process the data
} else {
MUTEX_UNLOCK();
}
sleep(1);
}
}
void writer_thread(){
while(1){
std::string data = waitAndReadDataFromSomeResource(the_resource);
MUTEX_LOCK();
the_vector.push(data);
MUTEX_UNLOCK();
}
}
That program could have worked correctly about 20 years ago. Those days over and done with and are not likely to come back any time soon. People buy processors that are fast and consume little power. They don't buy the ones that give programmers an easier time writing code like this.
Modern processor design is an exercise in dealing with latency. The most severe latency problem by a long shot is the speed of memory. Typical RAM access time (the affordable kind) hovers around 100 nanoseconds. A modern core can easily execute a thousand instructions in that time. Processors are filled to the brim with tricks to deal with that huge difference.
Power is a problem, they cannot make processors faster anymore. Practical clock speeds topped out at ~3.5 gigahertz. Going faster requires more power and, beyond draining a battery too fast, there's an upper limit to how much heat you can effectively deal with. Having a thumbnail size sliver of silicon generate a hundred watts is where it stops getting practical. Only other thing that processor designers could do to make processor more powerful is by adding more execution cores. On the theory that you would know how to write code to use them effectively. That requires using threads.
The memory latency problem is addressed by giving the processor caches. Local copies of the data in memory. Sitting physically close to the execution unit and thus having less latency. Modern cores have 64 KB of L1 cache, the smallest and therefore the closest and therefore the fastest. A bigger and slower L2 cache, 256 KB typically. And a yet bigger and slower L3 cache, 4 MB typ that's shared between all the cores on the chip.
The caches still do squat if they don't have a copy of data stored in the memory location that your program needs. So processors have a prefetcher, a logical circuit that looks ahead in the instruction stream and guesses which locations will be required. In other words, it reads memory before your program uses it.
Another circuit deals with writes, the store buffer. It accepts a write instruction from the execution core so it doesn't have to wait for the physical write to be completed. In other words, it writes memory after your program updates it.
Perhaps you start seeing the bear-trap, when your program reads the syncToken variable value then it gets a stale value, one that easily mismatches the logical value. Another core could have updated it a handful of nanoseconds earlier but your program will not be aware of that. Producing a logical error in your code. Very hard to debug since it so critically depends on timing, nanoseconds.
Avoiding such undebuggable nasty bugs requires using fences, special instructions that ensure that the memory access is synchronized. They are expensive, they cause the processor to stall. They are wrapped in C++ by std::atomic.
They however can only solve part of the problem, note another undesirable trait of your code. As long as you can't obtain the syncToken, your code is spinning in the while-loop. Burning 100% core and not getting the job done. That's okay if it another thread isn't holding on to it for too long. It is not okay when it starts to take microseconds. You then need to get the operating system involved, it needs to put the thread on hold so another thread of another program can get some useful work done. Wrapped by std::mutex and friends.
They say, that the reasons such c++ code is not thread safe are:
Compiler may reorder instructions. (This was not the case, as you've demonstrated in assembler, but with different compiler setting the reordering might happen. To prevent the reordering, make syncToken a volatile.
Processor's caches out of sync. The reader's thread CPU sees new syncToken, but old vector.
The processor hardware might reorder the instructions. Plus the assembly instructions could be not atomic. But instead internally they could be a bunch of microcode that in turn could be reordered. That is, the assembly you saw could be different from actual microcode the cpu executes. So syncToken updaaaatiiing and vector updaaaatiiing could be mixed.
One can prevent all these following thread safe patterns.
On a particular CPU, or particular vendor, with particular compiler your code may work fine. It may even work on all platforms that you target. But it is not portable.
Given
that syncToken is of type int and
you use syncToken!=0 and syncToken==0 as sync conditions (to say it in your terms) and
copy assignments syncToken = 1 and syncToken = 0 to update the sync conditions
the conclusion is
no, it is not valid
because
syncToken!=0, syncToken==0, syncToken = 1 and syncToken = 0 are not atomic
If you run enough tests you might encounter desynchronized effects in some of them.
C++ provides facilities in the STL library to deal with threads, mutex, tasks, etc. I recommend to read upon those. You are likely to find simple examples in the internet.
In your case (I think fairly similar) you could refer to this answer: https://stackoverflow.com/a/9291792/1566187
This type of synchronization is not the correct way.
For example:
To test this condition "syncToken==0" cpu might execute more than one assempbly language instructions in series,
MOV DX, #syncToken
CMP DX, 00 ; Compare the DX value with zero
JE L7 ; If yes, then jump to label L7
Similarly, to change value of syncToken variable cpu might execute more than one assembly language instructions in series.
In case of multithreading operating system may pre-empt(Context switch) threads while execution.
Now lets consider,
Thread A, is executing this condition "syncToken==0" and OS switches the context as indicated below
assembly lang instr 1
assembly lang instr 2
Context switch to Thread B
assembly lang instr 3
assembly lang instr 4
And Thread B, is executing this condition "syncToken=1" and OS switches the context as indicated below,
assembly lang instr 1
assembly lang instr 2
assembly lang instr 3
Context switch to Thread A
assembly lang instr 4
In this case value in variable syncToken may overlap. Which will cause problem.
Even if you make syncToken variable atomic and continue with this, Which is not good for best performance.
Hence, I would suggest using mutex for synchronization. Or as per the use you can go for reader writer lock.
You assume that the value of SyncToken is written to and read from memory even time you change it or read it. It is not. It is cached in the CPU and may not be written to memory.
If you consider this, the writer thread would think that SyncToken is 1 (since he set it that way) and the reader thread would think that SyncToken is 0 (since he set it that way) and no one will work until the CPU cache is flushed. (could take forever, who knows).
Defining it as volatile/atomic/interlocked would prevent this caching effect and cause your code to run the way you intended it to.
Edit:
Another thing you should consider is what happens to your code with out-of-order-execution. I could write about it myself but this answer covers it: Handling out of order execution
So, pitfall 1 is that the threads might stop working at some point, and pitfall 2 is that an out-of-order execution might cause SyncToken to be updated prematurely.
I would recommend using boost lockfree queue for such tasks.

If we use memory fences to enforce consistency, how does "thread-thrashing" ever occur?

Before I knew of the CPU's store buffer I thought thread-thrashing simply occured when two threads wanted to write to the same cacheline. One would prevent the other from writing. However, this seems pretty synchronous. I later learnt that there is a store buffer, which temporarily flushes the writes. It is forced to flush through the SFENCE instruction, kinda implying there is no synchronous prevention of multiple cores accessing the same cacheline....
I am totally confused how thread-thrashing occurs, if we have to be careful and use SFENCEs? Thread-thrashing implies blocking, whereas SFENCEs implies the writes are done asynchronously and the programmer must manually flush the write??
(My understanding of SFENCEs may be confused too- because I also read the Intel memory model is "strong" and therefore memory fences are only required for string x86 instructions).
Could somebody please remove my confusion?
"Thrashing" meaning multiple cores retrieving the same cpu cacheline and this causing latency overhead for other cores competing for the same cacheline.
So, at least in my vocabulary, thread-thrashing happens when you have something like this:
// global variable
int x;
// Thread 1
void thread1_code()
{
while(!done)
x++;
}
// Thread 2
void thread2_code()
{
while(!done)
x++;
}
(This code is of course total nonsense - I'm making it ridiculously simple but pointless to not have complicated code that is complicated to explain what is going on in the thread itself)
For simplicity, we'll assume thread 1 always runs on processor 1, and thread 2 always runs on processor 2 [1]
If you run these two threads on an SMP system - and we've JUST started this code [both threads start, by magic, at almost exactly the same time, not like in a real system, many thousand clock-cycles apart], thread one will read the value of x, update it, and write it back. By now, thread 2 is also running, and it will also read the value of x, update it, and write it back. To do that, it needs to actually ask the other processor(s) "do you have (new value for) x in your cache, if so, can you please give me a copy". And of course, processor 1 will have a new value because it has just stored back the value of x. Now, that cache-line is "shared" (our two threads both have a copy of the value). Thread two updates the value and writes it back to memory. When it does so, another signal is sent from this processor saying "If anyone is holding a value of x, please get rid of it, because I've just updated the value".
Of course, it's entirely possible that BOTH threads read the same value of x, update to the same new value, and write it back as the same new modified value. And sooner or later one processor will write back a value that is lower than the value written by the other processor, because it's fallen behind a bit...
A fence operation will help ensure that the data written to memory has actually got all the way to cache before the next operation happens, because as you say, there are write-buffers to hold memory updates before they actually reach memory. If you don't have a fence instruction, your processors will probably get seriously out of phase, and update the value more than once before the other has had time to say "do you have a new value for x?" - however, it doesn't really help prevent processor 1 asking for the data from processor 2 and processor 2 immediately asking for it "back", thus ping-ponging the cache-content back and forth as quickly as the system can achieve.
To ensure that ONLY ONE processor updates some shared value, it is required that you use a so called atomic instruction. These special instructons are designed to operate in conjunction with write buffers and caches, such that they ensure that ONLY one processor actually holds an up-to-date value for the cache-line that is being updated, and NO OTHER processor is able to update the value until this processor has completed the update. So you never get "read the same value of x and write back the same value of x" or any similar thing.
Since caches don't work on single bytes or single integer sized things, you can also have "false sharing". For example:
int x, y;
void thread1_code()
{
while(!done) x++;
}
void thread2_code()
{
while(!done) y++;
}
Now, x and y are not actually THE same variable, but they are (quite plausibly, but we can't know for 100% sure) located within the same cache-line of 16, 32, 64 or 128 bytes (depending on processor architecture). So although x and y are distinct, when one processor says "I've just updated x, please get rid of any copies", the other processor will get rid of it's (still correct) value of y at the same time as getting rid of x. I had such an example where some code was doing:
struct {
int x[num_threads];
... lots more stuff in the same way
} global_var;
void thread_code()
{
...
global_var.x[my_thread_number]++;
...
}
Of course, two threads would then update value right next to each other, and the performance was RUBBISH (about 6x slower than when we fixed it by doing:
struct
{
int x;
... more stuff here ...
} global_var[num_threads];
void thread_code()
{
...
global_var[my_thread_number].x++;
...
}
Edit to clarify:
fence does not (as my recent edit explains) "help" against ping-poning the cache-content between threads. It also doesn't, in and of itself, prevent data from being updated out of sync between the processors - it does, however, ensure that the processor performing the fence operation doesn't continue doing OTHER memory operations until this particular operations memory content has got "out of" the processor core itself. Since there are various pipeline stages, and most modern CPU's have multiple execution units, one unit may well be "ahead" of another that is technically "behind" in the execution stream. A fence will ensure that "everything has been done here". It's a bit like the man with the big stop-board in Formula 1 racing, that ensures that the driver doesn't drive off from the tyre-change until ALL new tyres are securely on the car (if everyone does what they should).
The MESI or MOESI protocol is a state-machine system that ensures that operations between different processors is done correctly. A processor can have a modified value (in which case a signal is sent to all other processors to "stop using the old value"), a processor may "own" the value (it is the holder of this data, and may modify the value), a processor may have "exclusive" value (it's the ONLY holder of the value, everyone else has got rid of their copy), it may be "shared" (more than one processor has a copy, but this processor should not update the value - it is not the "owner" of the data), or Invalid (data is not present in the cache). MESI doesn't have the "owned" mode which means a little more traffic on the snoop bus ("snoop" means "Do you have a copy of x", "please get rid of your copy of x" etc)
[1] Yes, processor numbers usually start with zero, but I can't be bothered to go back and rename thread1 to thread0 and thread2 to thread1 by the time I wrote this additional paragraph.