Race Condition with writing same value in C++? - c++

Is there any issue with having a race condition in your code when the operation is writing a single constant value? For example if there is a parallel loop that populated a seen array for every value that is in another array arr (assuming no issues with out of bounds indices). the critical section could be the below code:
//parallel body with index i
int val = arr[i];
seen[val] = true;
Since the only value being written is true does that make the need for a mutex not necessary, and possibly detrimental to performance? Even if threads stomp on each other they would just be filling in the address with the same value, correct?

The C++ memory model does not give you a free pass for writing the same value.
If two threads are writing to a non-atomic object without synchronization, that is simply a race condition. And a race condition means your program executes undefined behavior. And undefined behaviour occuring anywhere in your program's execution means that the behavior of your program, both before and after the point of undefined behavior, is not restricted by the C++ standard in any way.
A given compiler is free to provide a more free memory model. I'm unaware of any that do.
One thing you must understand is that C++ is not an assembler macro language. It doesn't have to produce the naive assembler you imagine in your head. C++ instead tries to make it easy for your compiler to produce assembler, which is a very different thing.
Compilers can and do determine "if X happens, we get undefined behavior; so I'll optimize around the fact that X does not happen" when generating code. In this case here, the compiler can prove that program with defined behavior could ever have the same val in two different unsynchrnoized threads.
All of this can happen long before any assembly is generated.
And at the assembly level, some hardware might do funny things with unaligned assignment to multi-byte values. Some hardware could (in theory; I'm unaware of any in practice) raise traps when instructions that claim to be single-thread writes occur in two different cores on the same bytes.
So this is UB in C++. And once you have UB, you have to audit the assembly code produced by your program in everywhere the compiler who touches this can see. If you do LTO, that means in your entire program, at least everywhere that calls or interacts with your code that does UB, to an unclear distance.
Just write defined behavior. And only if this turns out to be a mission critical performance bottleneck should you spend more effort on optimizing it (first faster defined behavior, and only if that fails do you even consider UB).

There may be an architecture-dependent constraint requiring your seen array elements to be separated by a certain amount to prevent competing threads from destroying values that collided in the same machine word (or cache row, even).
That is, if seen is defined as bool seen[N]; then seen is N bytes long and each element is directly adjacent to its neighbor. If one thread changes element 0 and another thread changes element 2, both of these changes occur in the same 64-bit machine-word. If these two changes are made concurrently by different cores (or even on different CPUs of a multi-cpu system), they will attempt to resolve the collision as an entire 64-bit machine word (or larger in some cases). The result of this will be that one of the trues that was written will be turned back to its previous state (probably false) by the winning thread's update to a neighboring element.
If instead, you define seen as an array of structs, each of which is as large as a cache row, then you may have competing threads mash a bool value within that struct... but this is risky because not all CPUs will share the same cache collision validation strategies, row-sizes, and the likes... and inevitably, there will be a CPU that it will fail on.

Related

Does compiler need to care about other threads during optimizations?

This is a spin-off from a discussion about C# thread safety guarantees.
I had the following presupposition:
in absence of thread-aware primitives (mutexes, std::atomic* etc., let's exclude volatile as well for simplicity) a valid C++ compiler may do any kinds of transformations, including introducing reads from the memory (or e. g. writes if it wants to), if the semantics of the code in the current thread (that is, output and [excluded in this question] volatile accesses) remain the same from the current thread's point of view, that is, disregarding existence of other threads. The fact that introducing reads/writes may change other thread's behavior (e. g. because the other threads read the data without proper synchronization or performing other kinds of UB) can be totally ignored by a standard-conform compiler.
Is this presupposition correct or not? I would expect this to follow from the as-if rule. (I believe it is, but other people seem to disagree with me.) If possible, please include the appropriate normative references.
Yes, C++ defines data race UB as potentially-concurrent access to non-atomic objects when not all the accesses are reads. Another recent Q&A quotes the standard, including.
[intro.races]/2 - Two expression evaluations conflict if one of them modifies a memory location ... and the other one reads or modifies the same memory location.
[intro.races]/21 ... The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, ...
Any such data race results in undefined behavior.
That gives the compiler freedom to optimize code in ways that preserve the behaviour of the thread executing a function, but not what other threads (or a debugger) might see if they go looking at things they're not supposed to. (i.e. data race UB means that the order of reading/writing non-atomic variables is not part of the observable behaviour an optimizer has to preserve.)
introducing reads/writes may change other thread's behavior
The as-if rule allows you to invent reads, but no you can't invent writes to objects this thread didn't already write. That's why if(a[i] > 10) a[i] = 10; is different from a[i] = a[i]>10 ? 10 : a[i].
It's legal for two different threads to write a[1] and a[2] at the same time, and one thread loading a[0..3] and then storing back some modified and some unmodified elements could step on the store by the thread that wrote a[2].
Crash with icc: can the compiler invent writes where none existed in the abstract machine? is a detailed look at a compiler bug where ICC did that when auto-vectorizing with SIMD blends. Including links to Herb Sutter's atomic weapons talk where he discusses the fact that compilers must not invent writes.
By contrast, AVX-512 masking and AVX vmaskmovps etc, like ARM SVE and RISC-V vector extensions I think, do have proper masking with fault suppression to actually not store at all to some SIMD elements, without branching.
When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements? AVX-512 masking does indeed do fault-suppression for read-only or unmapped pages that masked-off elements extend into.
AVX-512 and Branching - auto-vectorizing with stores inside an if() vs. branchless.
It's legal to invent atomic RMWs (except without the Modify part), e.g. an 8-byte lock cmpxchg [rcx], rdx if you want to modify some of the bytes in that region. But in practice that's more costly than just storing modified bytes individually so compilers don't do that.
Of course a function that does unconditionally write a[2] can write it multiple times, and with different temporary values before eventually updating it to the final value. (Probably only a Deathstation 9000 would invent different-valued temporary contents, like turning a[2] = 3 into a[2] = 2; a[2]++;)
For more about what compilers can legally do, see Who's afraid of a big bad optimizing compiler? on LWN. The context for that article is Linux kernel development, where they rely on GCC to go beyond the ISO C standard and actually behave in sane ways that make it possible to roll their own atomics with volatile int* and inline asm. It explains many of the practical dangers of reading or writing a non-atomic shared variable.

With memory_order_relaxed how is total order of modification of an atomic variable assured on typical architectures?

As I understand memory_order_relaxed is to avoid costly memory fences that may be needed with more constrained ordering on a particular architecture.
In that case how is total modification order for an atomic variable achieved on popular processors?
EDIT:
atomic<int> a;
void thread_proc()
{
int b = a.load(memory_order_relaxed);
int c = a.load(memory_order_relaxed);
printf(“first value %d, second value %d\n, b, c);
}
int main()
{
thread t1(thread_proc);
thread t2(thread_proc);
a.store(1, memory_order_relaxed);
a.store(2, memory_order_relaxed);
t1.join();
t2.join();
}
What will guarantee that the output won’t be:
first value 1, second value 2
first value 2, second value 1
?
Multi-processors often use the MESI protocol to ensure total store order on a location. Information is transferred at cache-line granularity. The protocol ensures that before a processor modifies the contents of a cache line, all other processors relinquish their copy of the line, and must reload a copy of the modified line. Hence in the example where a processor writes x and then y to the same location, if any processor sees the write of x, it must have reloaded from the modified line, and must relinquish the line again before the writer writes y.
There is usually a specific set of assembly instructions that corresponds to operations on std::atomics, for example an atomic addition on x86 is lock xadd.
By specifying memory order relaxed you can conceptually think of it as telling the compiler "you must use this technique to increment the value, but I impose no other restrictions outside of the standard as-if optimisations rules on top of that". So literally just replacing an add with an lock xadd is likely sufficient under a relaxed ordering constraint.
Also keep in mind 'memory_order_relaxed' specifies a minimum standard that the compiler has to respect. Some intrinsics on some platforms will have implicit hardware barriers, which doesn't violate the constraint for being too constrained.
All atomic operations act in accord with [intro.races]/14:
If an operation A that modifies an atomic object M happens before an operation B that modifies M, then A shall be earlier than B in the modification order of M.
The two stores from the main thread are required to happen in that order, since the two operations are ordered within the same thread. Therefore, they cannot happen outside of that order. If someone sees the value 2 in the atomic, then the first thread must have executed past the point where the value was set to 1, per [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.
This of course only applies to atomic operations on a specific atomic object; ordering with respect to other things doesn't exist when using relaxed ordering (which is the point).
How does this get achieved on real machines? In whatever way the compiler sees fit to do so. The compiler could decide that, since you're overwriting the value of the variable you just set, then it can remove the first store per the as-if rule. Nobody ever seeing the value 1 is a perfectly legitimate implementation according to the C++ memory model.
But otherwise, the compiler is required to emit whatever is needed to make it work. Note that out-of-order processors aren't typically allowed to complete dependent operations out of order, so that's typically not a problem.
There are two parts in an inter thread communication:
a core that can do loads and stores
the memory system which consists of coherent caches
The issue is the speculative execution in the CPU core.
A processor load and store unit always need to compare addresses in order to avoid reordering two writes to the same location (if it reorders writes at all) or to pre-fetch a stale value that has just been written to (when reads are done early, before previous writes).
Without that feature, any sequence of executable code would be at risk of having its memory accesses completely randomized, seeing values written by a following instruction, etc. All memory locations would be "renamed" in crazy ways with no way for a program to refer to the same (originally named) location twice in a row.
All programs would break.
On the other hand, memory locations in potentially running code can have two "names":
the location that can hold a modifiable value, in L1d
the location that can be decoded as executable code, in L1i
And these are not connected in any way until a special reload code instruction is performed, not only the L1i but also the instruction decoder can have in cache locations that are otherwise modifiable.
[Another complication is when two virtual addresses (used by speculative loads or stores) refer to the same physical addresses (aliasing): that's another conflict that needs to be dealt with.]
Summary: In most cases, a CPU will naturally to provide an order for accesses on each data memory location.
EDIT:
While a core needs to keep track of operations that invalidate speculative execution, mainly a write to a location later read by a speculative instruction. Reads don't conflict with each others and a CPU core might want to keep track of modification of cached memory after a speculative read (making reads happen visibly in advance) and if reads can be executed out of order it's conceivable that a later read might be complete before an earlier read; on why the system would begin a later read first, a possible cause would be if the address computation is easier and complete first.
So a system that can begin reads out of order and that would consider them completed as soon as a value is made available by the cache, and valid as long as no write by the same core ends up conflicting with either read, and does not monitor L1i cache invalidations caused by another CPU wanting to modify a close memory location (possible that one location), such sequence is possible:
decompose the soon to be executed instructions into sequence A which is long a list of sequenced operations ending with a result in r1 and B a shorter sequence ending with a result in r2
run both in parallel, with B producing a result earlier
speculatively try load (r2), noting that a write that address may invalidate the speculation (suppose the location is available in L1i)
then another CPU annoys us stealing the cache line holding location of (r2)
A completes making r1 value available and we can speculatively do load (r1) (which happens to be the same address as (r2)); which stalls until our cache gets back its cache line
the value of the last done load can be different from the first
Neither speculations of A nor B invalided any memory location, as the system doesn't consider either the loss of cache line or the return of a different value by the last load to be an invalidation of a speculation (which would be easy to implement as we have all the information locally).
Here the system sees any read as non conflicting with any local operation that isn't a local write and the loads are done in an order depending on the complexity of A and B and not whichever comes first in program order (the description above doesn't even say that the program order was changed, just that it was ignored by speculation: I have never described which of the loads was first in the program).
So for a relaxed atomic load, a special instruction would be needed on such system.
The cache system
Of course the cache system doesn't change orders of requests, as it works like a global random access system with temporary ownership by cores.

In C11/C++11, possible to mix atomic/non-atomic ops on the same memory?

Is it possible to perform atomic and non-atomic ops on the same memory location?
I ask not because I actually want to do this, but because I'm trying to understand the C11/C++11 memory model. They define a "data race" like so:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
-- C11 §5.1.2.4 p25, C++11 § 1.10 p21
Its the "at least one of which is not atomic" part that is troubling me. If it weren't possible to mix atomic and non-atomic ops, it would just say "on an object which is not atomic."
I can't see any straightforward way of performing non-atomic operations on atomic variables. std::atomic<T> in C++ doesn't define any operations with non-atomic semantics. In C, all direct reads/writes of an atomic variable appear to be translated into atomic operations.
I suppose memcpy() and other direct memory operations might be a way of performing a non-atomic read/write on an atomic variable? ie. memcpy(&atomicvar, othermem, sizeof(atomicvar))? But is this even defined behavior? In C++, std::atomic is not copyable, so would it be defined behavior to memcpy() it in C or C++?
Initialization of an atomic variable (whether through a constructor or atomic_init()) is defined to not be atomic. But this is a one-time operation: you're not allowed to initialize an atomic variable a second time. Placement new or an explicit destructor call could would also not be atomic. But in all of these cases, it doesn't seem like it would be defined behavior anyway to have a concurrent atomic operation that might be operating on an uninitialized value.
Performing atomic operations on non-atomic variables seems totally impossible: neither C nor C++ define any atomic functions that can operate on non-atomic variables.
So what is the story here? Is it really about memcpy(), or initialization/destruction, or something else?
I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.
(I'm assuming here that the storage lifetime is sufficient and plays no role)
I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.
So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.
disclaimer: I am not a parallelism guru.
Is it possible to mix atomic/non-atomic ops on the same memory, and if
so, how?
you can write it in the code and compile, but it will probably yield undefined behaviour.
when talking about atomics, it is important to understand what kind o problems do they solve.
As you might know, what we call in shortly "memory" is multi-layered set of entities which are capable to hold memory.
first we have the RAM, then the cache lines , then the registers.
on mono-core processors, we don't have any synchronization problem. on multi-core processors we have all of them. every core has it own set of registers and cache lines.
this casues few problems.
First one of them is memory reordering - the CPU may decide on runtime to scrumble some reading/writing instructions to make the code run faster. this may yield some strange results that are completly invisible on the high-level code that brought this set of instruction. the most classic example of this phenomanon is the "two threads - two integer" example:
int i=0;
int j=0;
thread a -> i=1, then print j
thread b -> j=1 then print i;
logically, the result "00" cannot be. either a ends first, the result may be "01", either b ends first, the result may be "10". if both of them ends in the same time, the result may be "11". yet, if you build small program which imitates this situtation and run it in a loop, very quicly you will see the result "00"
another problem is memory invisibility. like I mentioned before, the variable's value may be cached in one of the cache lines, or be stored in one of the registered. when the CPU updates a variables value - it may delay the writing of the new value back to the RAM. it may keep the value in the cache/regiter because it was told (by the compiler optimizations) that that value will be updated again soon, so in order to make the program faster - update the value again and only then write it back to the RAM. it may cause undefined behaviour if other CPU (and consequently a thread or a process) depends on the new value.
for example, look at this psuedo code:
bool b = true;
while (b) -> print 'a'
new thread -> sleep 4 seconds -> b=false;
the character 'a' may be printed infinitly, because b may be cached and never be updated.
there are many more problems when dealing with paralelism.
atomics solves these kind of issues by (in a nutshell) telling the compiler/CPU how to read and write data to/from the RAM correctly without doing un-wanted scrumbling (read about memory orders). a memory order may force the cpu to write it's values back to the RAM, or read the valuse from the RAM even if they are cached.
So, although you can mix non atomics actions with atomic ones, you only doing part of the job.
for example let's go back to the second example:
atomic bool b = true;
while (reload b) print 'a'
new thread - > b = (non atomicly) false.
so although one thread re-read the value of b from the RAM again and again but the other thread may not write false back to the RAM.
So although you can mix these kind of operations in the code, it will yield underfined behavior.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operationsin parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. That is that the sizes of the atomic and non atomic types are the same.
I definitely would not attempt to mix both types of access to an address in a parallel context, that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operations in parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. My, possibly simplistic, interpretation of that is that the sizes of the atomic and non atomic types are the same and that the hardware can update such types in a single operation.
I definitely would not attempt to mix both types of access to an address in a parallel context, i think that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.

Thread safe access to shared data - read/write actually happens and no reordering takes place

From here: https://stackoverflow.com/a/2485177/462608
For thread-safe accesses to shared data, we need a guarantee that
the read/write actually happens (that the compiler won't just store the value in a register instead and defer updating main memory until much later)
that no reordering takes place. Assume that we use a volatile variable as a flag to indicate whether or not some data is ready to
be read. In our code, we simply set the flag after preparing the
data, so all looks fine. But what if the instructions are reordered
so the flag is set first?
In which cases does compiler stores the value in a register and defers updating main memory? [with respect to the above quote]
What is the "re-ordering" that the above quote is talking about? In what cases does it happen?
Q: In which cases does compiler stores the value in a register and defers updating main memory?
A: (This is a broad and open-ended question which is perhaps not very well suited to the stackoverflow format.) The short answer is that whenever the semantics of the source language (C++ per your tags) allow it and the compiler thinks it's profitable.
Q: What is the "re-ordering" that the above quote is talking about?
A: That the compiler and/or CPU issues load and store instructions in an order different from the one dictated by a 1-to-1 translation of the original program source.
Q: In what cases does it happen?
A: For the compiler, similarly to the answer of the first question, anytime the original program semantics allow it and the compiler thinks it's profitable. For the CPU it's similar, the CPU can, depending on the architecture memory model, typically reorder memory accesses as long as the original (single-threaded!) result is identical. For instance, both the compiler and the CPU can try to hoist loads as early as possible, since load latency is often critical for performance.
In order to enforce stricter ordering, e.g. for implementing synchronization primitives, CPU's offer various atomic and/or fence instructions, and compilers may, depending on the compiler and source language, provide ways to prohibit reordering.
Well...found this when searching for "volatile" keyword..lol
1. Register access is a lot faster than memory even with cache. For example, if you have things like below:
for(i = 0; i < 10000; i++)
{
// whatever...
}
If variable i is stored in register, the loop gets much better performance. So some of the compiler might generate code that stores i in register. The update to that variable might not happen in memory until the loop ends. It's even totally possible that i is never written to memory (eg, i is never used later) or spilled inside loop body (eg, there is a heavier nested-loop inside to be optimized and no more registers for it). That technique is called register allocation. Generally there is no rules for optimizer as long as language standard allows. There are tons of different algorithms for it. It's hard to answer when it happens. That's why janneb said so.
If a variable is not updated in time, for multi-threaded code it might be really bad.
For example, if you have code like this:
bool objRead = false;
createThread(prepareObj); // objReady will be turn on in prepareObj thread.
while(!objReady) Sleep(100);
obj->doSomething();
It's possible that optimizer generates code that tests objReady only once (when control flow enters the loop) since it's not changed inside loop.
That's why we need to make sure that read and write really happens as we designed in multi-threaded code.
Reordering is kind of more complicated than register allocation. Both compiler and your CPU might change the execution order of your code.
void prepareObj()
{
obj = &something;
objReady = true;
}
For the prepareObj function point of view, it doesn't matter that whether we set objReady first or set obj pointer first. Compiler and CPU might reverse the order of the two instructions for different reasons like better parallelism on particular CPU pipeline, better data locality for cache hit. You can read the book "Computer Architecture: A Quantitative Approach" suggested by janneb. If my memory serves, appendix A is about reordering(if not, go appendix B or C..lol).

Is concurrently overwriting a variable with the same value safe?

I have the following situation (caused by a defect in the code):
There's a shared variable of primitive type (let it be int) that is initialized during program startup from strictly one thread to value N (let it be 0). Then (strictly after the variable is initialized) during the program runtime various threads are started and they in some random order either read that variable or overwrite it with the very same value N (0 in this example). There's no synchronization around accessing the variable.
Can this situation cause unexpected behavior in the program?
It's incredibly unlikely but not impossible according to the standard.
There's nothing stating what the underlying representation of an integer is, not does the standard specify how the values are loaded.
I can envisage, however weird, an implementation where the underlying bit pattern for 0 is 10101010 and the architecture only supports loading data into memory by bit-shifting it over eight cycles but reading it as a single unit in one cycle.
If another thread reads the value while the bit pattern is being shifted in (e.g., 00000001, 00000010,00000101 and so on), you will have a problem.
The chances of anyone designing such a bizarre architecture is so close to zero as to be negligible. But, unfortunately, it's not zero. All I'm trying to get across is that you shouldn't rely on assumptions at all when it comes to standards compliance.
And please, before you vote me down, feel free to quote the part of the standard that states this is not possible :-)
Since C++ does not currently have a standard concurrency model, it would depend entirely on your threading implementation and whatever guarantees it gives. It is all but certainly unsafe in the general case, however, because of the potential for torn reads. There might be specific cases where it would "work" or at least "appear to work."
In C++0x (which does have a standard concurrency model), your scenario would formally result in undefined behavior. There is a long, detailed, hard-to-read specification of the concurrency model in the C++0x Final Committee Draft §1.10, but it basically boils down to this:
Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location (§1.10/3).
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior (§1.10/14).
Your expression evaluations clearly conflict because they modify and read the same memory location, and since the object is not atomic and access is not synchronized using a lock, you have undefined behavior.
No. Of course, you could end up with a data race if one of the threads later tries to change the value. You will also end up with a little cache contention, but I doubt this will have noticable effect.
You can not really rely on it. For primitive types you should be fine, and if the operation is atomic (eg a correctly aligned int on most platforms) then writing and reading different values is safe (note by this I mean like "x = 5;", not "x += 5;" which is never atomic and is not thread safe).
For non-primitive types even if its the same value all bets are off since there may be a copy constructor that does something un-safe (like allocating memory).
Yes it is possible for unexpected behavior to happen in this scenario. Consider the case where the initial value of the variable was not 0. It is possible for one thread to start the set to 0 and another thread see the variable with only some of the bytes set.
For types int this is very unlikely as most processor's will have atomic assignment of word sized values. However once you hit 8 bit numeric values (long on some platforms) or large structs, this begins to be an issue.
If no other thread (and this includes the main thread) can change the value of the 0 to anything else (lets say 1) while those threads are initializing then it you will not have problems. But if any other thread had the potential to change the value during the start-up phase you could have a problem. You are playing a dangerous game and I would recommend locking before reading the value.