In C++, I feel like I've always been lead to believe that things like var++ and var-- are reasonably threadsafe - AKA - you have a guarantee that your value will increase or decrease at some point in time.
It was upon this belief that I built my understanding of non-blocking algorithms for thread-safe operations.
Yet, today I'm in shock, as I have a variable that is not getting incremented - and am therefore questioning the validity of a large amount of my past work.
In a small program, I have a global variable initialized at 0.
Eight P-Threads are started up that each call varname++ a total of 1024 times, totalling to 8*1024 increments.
Yet after all threads have finished executing, the value of varname is significantly less than 8*1024.
Did I miss the boat here?
Can someone please enlighten me?
What exactly would lead you to the belief those where threadsafe? In general they are not. The reason for that is simple: Arithmetic is generally done on registers, so var++ might be transformed to something like the following:
load var to R1
inc R1
store R1 to var
If another thread modifies var between the load and the store you will obviously loose that update. In reality this problem will be even worse, since the compiler can decide to keep the variable in a register for as long as it wants (well for as long as it can prove that var isn't accessed through any pointers to it (in the same thread)).
Having multiple threads access the same variable is defined to be a data race (and therefore undefined behaviour) by the (C++11) standard, unless none of the thread modifies the variable (if all do read access only, you are fine).
For threadsafety operations, you need to use either locking (e.g. using std::mutex in C++11) or atomic operations. If you use C++11, you can use std::atomic<int> (or whatever type your counter is) as the type for var to get threadsafe modifications. (Arithmetic) Operations on std::atomic<T> (like the increment and decrement operators) are guaranteed to be threadsafe by the standard.
C++'s pre- and post-increment operators are not thread-safe.
Similar question here: I've heard i++ isn't thread safe, is ++i thread-safe?
Yes, you have missed something--namely that your reads and writes are not atomic. So a number of threads could read the value, then increment it, then write it back, and if those operations are all happening "in parallel", the value will only be incremented by one.
You can fix this using C++11's (or Boost's) std::atomic type wrapper, described here:
http://en.cppreference.com/w/cpp/atomic/atomic
Related
Is it possible to perform atomic and non-atomic ops on the same memory location?
I ask not because I actually want to do this, but because I'm trying to understand the C11/C++11 memory model. They define a "data race" like so:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
-- C11 §5.1.2.4 p25, C++11 § 1.10 p21
Its the "at least one of which is not atomic" part that is troubling me. If it weren't possible to mix atomic and non-atomic ops, it would just say "on an object which is not atomic."
I can't see any straightforward way of performing non-atomic operations on atomic variables. std::atomic<T> in C++ doesn't define any operations with non-atomic semantics. In C, all direct reads/writes of an atomic variable appear to be translated into atomic operations.
I suppose memcpy() and other direct memory operations might be a way of performing a non-atomic read/write on an atomic variable? ie. memcpy(&atomicvar, othermem, sizeof(atomicvar))? But is this even defined behavior? In C++, std::atomic is not copyable, so would it be defined behavior to memcpy() it in C or C++?
Initialization of an atomic variable (whether through a constructor or atomic_init()) is defined to not be atomic. But this is a one-time operation: you're not allowed to initialize an atomic variable a second time. Placement new or an explicit destructor call could would also not be atomic. But in all of these cases, it doesn't seem like it would be defined behavior anyway to have a concurrent atomic operation that might be operating on an uninitialized value.
Performing atomic operations on non-atomic variables seems totally impossible: neither C nor C++ define any atomic functions that can operate on non-atomic variables.
So what is the story here? Is it really about memcpy(), or initialization/destruction, or something else?
I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.
(I'm assuming here that the storage lifetime is sufficient and plays no role)
I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.
So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.
disclaimer: I am not a parallelism guru.
Is it possible to mix atomic/non-atomic ops on the same memory, and if
so, how?
you can write it in the code and compile, but it will probably yield undefined behaviour.
when talking about atomics, it is important to understand what kind o problems do they solve.
As you might know, what we call in shortly "memory" is multi-layered set of entities which are capable to hold memory.
first we have the RAM, then the cache lines , then the registers.
on mono-core processors, we don't have any synchronization problem. on multi-core processors we have all of them. every core has it own set of registers and cache lines.
this casues few problems.
First one of them is memory reordering - the CPU may decide on runtime to scrumble some reading/writing instructions to make the code run faster. this may yield some strange results that are completly invisible on the high-level code that brought this set of instruction. the most classic example of this phenomanon is the "two threads - two integer" example:
int i=0;
int j=0;
thread a -> i=1, then print j
thread b -> j=1 then print i;
logically, the result "00" cannot be. either a ends first, the result may be "01", either b ends first, the result may be "10". if both of them ends in the same time, the result may be "11". yet, if you build small program which imitates this situtation and run it in a loop, very quicly you will see the result "00"
another problem is memory invisibility. like I mentioned before, the variable's value may be cached in one of the cache lines, or be stored in one of the registered. when the CPU updates a variables value - it may delay the writing of the new value back to the RAM. it may keep the value in the cache/regiter because it was told (by the compiler optimizations) that that value will be updated again soon, so in order to make the program faster - update the value again and only then write it back to the RAM. it may cause undefined behaviour if other CPU (and consequently a thread or a process) depends on the new value.
for example, look at this psuedo code:
bool b = true;
while (b) -> print 'a'
new thread -> sleep 4 seconds -> b=false;
the character 'a' may be printed infinitly, because b may be cached and never be updated.
there are many more problems when dealing with paralelism.
atomics solves these kind of issues by (in a nutshell) telling the compiler/CPU how to read and write data to/from the RAM correctly without doing un-wanted scrumbling (read about memory orders). a memory order may force the cpu to write it's values back to the RAM, or read the valuse from the RAM even if they are cached.
So, although you can mix non atomics actions with atomic ones, you only doing part of the job.
for example let's go back to the second example:
atomic bool b = true;
while (reload b) print 'a'
new thread - > b = (non atomicly) false.
so although one thread re-read the value of b from the RAM again and again but the other thread may not write false back to the RAM.
So although you can mix these kind of operations in the code, it will yield underfined behavior.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operationsin parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. That is that the sizes of the atomic and non atomic types are the same.
I definitely would not attempt to mix both types of access to an address in a parallel context, that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operations in parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. My, possibly simplistic, interpretation of that is that the sizes of the atomic and non atomic types are the same and that the hardware can update such types in a single operation.
I definitely would not attempt to mix both types of access to an address in a parallel context, i think that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.
I have two threads running. They share an array. One of the threads adds new elements to the array (and removes them) and the other uses this array (read operations only).
Is it necessary for me to lock the array before I add/remove to/from it or read from it?
Further details:
I will need to keep iterating over the entire array in the other thread. No write operations over there as previously mentioned. "Just scanning something like a fixed-size circular buffer"
The easy thing to do in such cases is to use a lock. However locks can be very slow. I did not want to use locks if their use can be avoided. Also, as it came out from the discussions, it might not be necessary (it actually isn't) to lock all operations on the array. Just locking the management of an iterator for the array (count variable that will be used by the other thread) is enough
I don't think the question is "too broad". If it still comes out to be so, please let me know. I know the question isn't perfect. I had to combine at least 3 answers in order to be able to solve the question - which suggests most people were not able to fully understand all the issues and were forced to do some guess work. But most of it came out through the comments which I have tried to incorporate in the question. The answers helped me solve my problem quite objectively and I think the answers provided here are quite a helpful resource for someone starting out with multithreading.
If two threads perform an operation on the same memory location, and at least one operation is a write operation, you have a so-called data race. According to C11 and C++11, the behaviour of programs with data races is undefined.
So, you have to use some kind of synchronization mechanism, for example:
std::atomic
std::mutex
If you are writing and reading from the same location from multiple threads you will need to to perform locking or use atomics. We can see this by looking at the C11 draft standard(The C++11 standard looks almost identical, the equivalent section would be 1.10) says the following in section 5.1.2.4 Multi-threaded executions and data races:
Two expression evaluations conflict if one of them modifies a memory
location and the other one reads or modifies the same memory location.
and:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
and:
Compiler transformations that introduce assignments to a potentially
shared memory location that would not be modified by the abstract
machine are generally precluded by this standard, since such an
assignment might overwrite another assignment by a different thread in
cases in which an abstract machine execution would not have
encountered a data race. This includes implementations of data member
assignment that overwrite adjacent members in separate memory
locations. We also generally preclude reordering of atomic loads in
cases in which the atomics in question may alias, since this may
violate the "visible sequence" rules.
If you were just adding data to the array then in the C++ world a std::atomic index would be sufficient since you can add more elements and then atomically increment the index. But since you want to grow and shrink the array then you will need to use a mutex, in the C++ world std::lock_guard would be a typical choice.
To answer your question: maybe.
Simply put, the way that the question is framed doesn't provide enough information about whether or not a lock is required.
In most standard use cases, the answer would be yes. And most of the answers here are covering that case pretty well.
I'll cover the other case.
When would you not need a lock given the information you have provided?
There are some other questions here that would help better define whether you need a lock, whether you can use a lock-free synchronization method, or whether or not you can get away with no explicit synchronization.
Will writing data ever be non-atomic? Meaning, will writing data ever result in "torn data"? If your data is a single 32 bit value on an x86 system, and your data is aligned, then you would have a case where writing your data is already atomic. It's safe to assume that if your data is of any size larger than the size of a pointer (4 bytes on x86, 8 on x64), then your writes cannot be atomic without a lock.
Will the size of your array ever change in a way that requires reallocation? If your reader is walking through your data, will the data suddenly be "gone" (memory has been "delete"d)? Unless your reader takes this into account (unlikely), you'll need a lock if reallocation is possible.
When you write data to your array, is it ok if the reader "sees" old data?
If your data can be written atomically, your array won't suddenly not be there, and it's ok for the reader to see old data... then you won't need a lock. Even with those conditions being met, it would be appropriate to use the built in atomic functions for reading and storing. But, that's a case where you wouldn't need a lock :)
Probably safest to use a lock since you were unsure enough to ask this question. But, if you want to play around with the edge case of where you don't need a lock... there you go :)
One of the threads adds new elements to the array [...] and the other [reads] this array
In order to add and remove elements to/from an array, you will need an index that specifies the last place of the array where the valid data is stored. Such index is necessary, because arrays cannot be resized without potential reallocation (which is a different story altogether). You may also need a second index to mark the initial location from which the reading is allowed.
If you have an index or two like this, and assuming that you never re-allocate the array, it is not necessary to lock when you write to the array itself, as long as you lock the writes of valid indexes.
int lastValid = 0;
int shared[MAX];
...
int count = toAddCount;
// Add the new data
for (int i = lastValid ; count != 0 ; count--, i++) {
shared[i] = new_data(...);
}
// Lock a mutex before modifying lastValid
// You need to use the same mutex to protect the read of lastValid variable
lock_mutex(lastValid_mutex);
lastValid += toAddCount;
unlock_mutex(lastValid_mutex);
The reason this works is that when you perform writes to shared[] outside the locked region, the reader does not "look" past the lastValid index. Once the writing is complete, you lock the mutex, which normally causes a flush of the CPU cache, so the writes to shared[] would be complete before the reader is allowed to see the data.
Lock? No. But you do need some synchronization mechanism.
What you're describing sounds an awful like a "SPSC" (Single Producer Single Consumer) queue, of which there are tons of lockfree implementations out there including one in the Boost.Lockfree
The general way these work is that underneath the covers you have a circular buffer containing your objects and an index. The writer knows the last index it wrote to, and if it needs to write new data it (1) writes to the next slot, (2) updates the index by setting the index to the previous slot + 1, and then (3) signals the reader. The reader then reads until it hits an index that doesn't contain the index it expects and waits for the next signal. Deletes are implicit since new items in the buffer overwrite previous ones.
You need a way to atomically update the index, which is provided by atomic<> and has direct hardware support. You need a way for a writer to signal the reader. You also might need memory fences depending on the platform s.t. (1-3) occur in order. You don't need anything as heavy as a lock.
"Classical" POSIX would indeed need a lock for such a situation, but this is overkill. You just have to ensure that the reads and writes are atomic. C and C++ have that in the language since their 2011 versions of their standards. Compilers start to implement it, at least the latest versions of Clang and GCC have it.
It depends. One situation where it could be bad is if you are removing an item in one thread then reading the last item by its index in your read thread. That read thread would throw an OOB error.
As far as I know, this is exactly the usecase for a lock. Two threads which access one array concurrently must ensure that one thread is ready with its work.
Thread B might read unfinished data if thread A did not finish work.
If it's a fixed-size array, and you don't need to communicate anything extra like indices written/updated, then you can avoid mutual exclusion with the caveat that the reader may see:
no updates at all
If your memory ordering is relaxed enough that this happens, you need a store fence in the writer and a load fence in the consumer to fix it
partial writes
if the stored type is not atomic on your platform (int generally should be)
or your values are un-aligned, and especially if they may span cache lines
This is all dependent on your platform though - hardware, OS and compiler can all affect it. You haven't told us what they are.
The portable C++11 solution is to use an array of atomic<int>. You still need to decide what memory ordering constraints you require, and what that means for correctness and performance on your platform.
If you use e.g. vector for your array (so that it can dynamically grow), then reallocation may occur during the writes, you lose.
If you use data entries larger than is always written and read atomically (virtually any complex data type), you lose.
If the compiler / optimizer decides to keep certain things in registers (such as the counter holding the number of valid entries in the array) during some operations, you lose.
Or even if the compiler / optimizer decides to switch order of execution for your array element assignments and counter increments/decrements, you lose.
So you certianly do need some sort of synchronization. What is the best way to do so (for example it may be worth while to lock only parts of the array), depends on your specifics (how often and in what pattern do the threads access the array).
If I update a variable in one thread like this:
receiveCounter++;
and then from another thread I only ever read this variable and write its value to a GUI.
Is that safe? Or could this instruction be interrupted in the middle so the value in receiveCounter is wrong when it is read by another thread? it must be so right since ++ is not atomic, it is several instructions.
I don't care about synchronizing reads and writes, it just needs to be incremented and then update in the GUI but this does not have to happen directly after each other.
What I care about is that the value cannot be wrong. Like the ++ operation being interrupted in the middle so the read value is completely off.
Do I need to lock this variable? I really do not want to since it is update very often. I could solves this by just posting a message to a MAIN thread and copy the value to a Queue (which then would need to be locked but I would not do this on every update) I guess.
But I am interested in the above problem anyway.
If one thread changes the value in a variable and another thread reads that value and the program does not synchronize the accesses it has a data race, and the behavior of the program is undefined. Change the type of receiveCounter to std::atomic<int> (assuming it's an int to begin with)
At its core it is a read-modify-write operation, they are not atomic. There are some processors around that have a dedicated instruction for it. Like Intel/AMD cores, very common, they have an INC instruction.
While that sounds like that could be atomic, since it is a single instruction, it still isn't. The x86/x64 instruction set doesn't have much to do anymore with the way the execution engine is actually implemented. Which executes RISC-like "micro-ops", the INC instruction is translated to multiple micro-ops. It can be made atomic with the LOCK prefix on the instruction. But compilers don't emit it unless they know that an atomic update is desired.
You will thus need to be explicit about it. The C++11 std::atomic<> standard addition is a good way. Your operating system or compiler will have intrinsics for it, usually named something like "Interlocked" or "__built_in".
Simple answer: NO.
because i++; is the same as i = i + 1;, it contains load, math-operation, and saving the value. so it is in general not atomic.
However the really executed operations depend on the instruction set of the CPU and might be atomic, depending on the CPU architecture. But the default is still not atomic.
Generally speaking, it is not thread-safe because the ++ operator consists in one read and one write, the pair of which is not atomic and can be interrupted in between.
Then, it probably also depends on the language/compiler/architecture, but in a typical case the increment operation is probably not thread safe.
(edited)
As for the read and write operations themselves, as long as you are not in 64-bit mode, they should be atomic, so if you don't care about other threads having the value wrong by 1, it may be ok for your case.
No. Increment operation is not atomic, hence not thread-safe.
In your case it is safe to use this operation if you don't care about its value in specific time (if you only read this variable from another thread and not trying to write to it). It will eventually increment receiveCounter's value, you just don't have any guarantees about operations order.
++ is equivalent to i=i+1.
++ is not an atomic operation so its NOT thread safe. Only read and write of primitive variables (except long and double) are atomic and hence thread safe.
I have the following situation (caused by a defect in the code):
There's a shared variable of primitive type (let it be int) that is initialized during program startup from strictly one thread to value N (let it be 0). Then (strictly after the variable is initialized) during the program runtime various threads are started and they in some random order either read that variable or overwrite it with the very same value N (0 in this example). There's no synchronization around accessing the variable.
Can this situation cause unexpected behavior in the program?
It's incredibly unlikely but not impossible according to the standard.
There's nothing stating what the underlying representation of an integer is, not does the standard specify how the values are loaded.
I can envisage, however weird, an implementation where the underlying bit pattern for 0 is 10101010 and the architecture only supports loading data into memory by bit-shifting it over eight cycles but reading it as a single unit in one cycle.
If another thread reads the value while the bit pattern is being shifted in (e.g., 00000001, 00000010,00000101 and so on), you will have a problem.
The chances of anyone designing such a bizarre architecture is so close to zero as to be negligible. But, unfortunately, it's not zero. All I'm trying to get across is that you shouldn't rely on assumptions at all when it comes to standards compliance.
And please, before you vote me down, feel free to quote the part of the standard that states this is not possible :-)
Since C++ does not currently have a standard concurrency model, it would depend entirely on your threading implementation and whatever guarantees it gives. It is all but certainly unsafe in the general case, however, because of the potential for torn reads. There might be specific cases where it would "work" or at least "appear to work."
In C++0x (which does have a standard concurrency model), your scenario would formally result in undefined behavior. There is a long, detailed, hard-to-read specification of the concurrency model in the C++0x Final Committee Draft §1.10, but it basically boils down to this:
Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location (§1.10/3).
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior (§1.10/14).
Your expression evaluations clearly conflict because they modify and read the same memory location, and since the object is not atomic and access is not synchronized using a lock, you have undefined behavior.
No. Of course, you could end up with a data race if one of the threads later tries to change the value. You will also end up with a little cache contention, but I doubt this will have noticable effect.
You can not really rely on it. For primitive types you should be fine, and if the operation is atomic (eg a correctly aligned int on most platforms) then writing and reading different values is safe (note by this I mean like "x = 5;", not "x += 5;" which is never atomic and is not thread safe).
For non-primitive types even if its the same value all bets are off since there may be a copy constructor that does something un-safe (like allocating memory).
Yes it is possible for unexpected behavior to happen in this scenario. Consider the case where the initial value of the variable was not 0. It is possible for one thread to start the set to 0 and another thread see the variable with only some of the bytes set.
For types int this is very unlikely as most processor's will have atomic assignment of word sized values. However once you hit 8 bit numeric values (long on some platforms) or large structs, this begins to be an issue.
If no other thread (and this includes the main thread) can change the value of the 0 to anything else (lets say 1) while those threads are initializing then it you will not have problems. But if any other thread had the potential to change the value during the start-up phase you could have a problem. You are playing a dangerous game and I would recommend locking before reading the value.
I need to provide synchronization to some members of a structure.
If the structure is something like this
struct SharedStruct {
int Value1;
int Value2;
}
and I have a global variable
SharedStruct obj;
I want that the write from a processor
obj.Value1 = 5; // Processor B
to be immediately visible to the other processors, so that when I test the value
if(obj.Value1 == 5) { DoSmth(); } // Processor A
else DoSmthElse();
to get the new value, not some old value from the cache.
First I though that if I use volatile when writing/reading the values, it is enough. But I read that volatile can't solve this kind o issues.
The members are guaranteed to be properly aligned on 2/4/8 byte boundaries, and writes should be atomic in this case, but I'm not sure how the cache could interfere with this.
Using memory barriers (mfence, sfence, etc.) would be enough ? Or some interlocked operations are required ?
Or maybe something like
lock mov addr, REGISTER
?
The easiest would obviously be some locking mechanism, but speed is critical and can't afford locks :(
Edit
Maybe I should clarify a bit. The value is set only once (behaves like a flag). All the other threads need just to read it. That's why I think that it may be a way to force the read of this new value without using locks.
Thanks in advance!
There Ain't No Such Thing As A Free Lunch. If your data is being accessed from multiple threads, and it is necessary that updates are immediately visible by those other threads, then you have to protect the shared struct by a mutex, or a readers/writers lock, or some similar mechanism.
Performance is a valid concern when synchronizing code, but it is trumped by correctness. Generally speaking, aim for correctness first and then profile your code. Worrying about performance when you haven't yet nailed down correctness is premature optimization.
Use explicitly atomic instructions. I believe most compilers offer these as intrinsics. Compare and Exchange is another good one.
If you intend to write a lockless algorithm, you need to write it so that your changes only take effect when conditions are as expected.
For example, if you intend to insert a linked list object, use the compare/exchange stuff so that it only inserts if the pointer still points at the same location when you actually do the update.
Or if you are going to decrement a reference count and free the memory at count 0, you will want to pre-free it by making it unavailable somehow, check that the count is still 0 and then really free it. Or something like that.
Using a lock, operate, unlock design is generally a lot easier. The lock-free algorithms are really difficult to get right.
All the other answers here seem to hand wave about the complexities of updating shared variables using mutexes, etc. It is true that you want the update to be atomic.
And you could use various OS primitives to ensure that, and that would be good
programming style.
However, on most modern processors (certainly the x86), writes of small, aligned scalar values is atomic and immediately visible to other processors due to cache coherency.
So in this special case, you don't need all the synchronizing junk; the hardware does the
atomic operation for you. Certainly this is safe with 4 byte values (e.g., "int" in 32 bit C compilers).
So you could just initialize Value1 with an uninteresting value (say 0) before you start the parallel threads, and simply write other values there. If the question is exiting the loop on a fixed value (e.g., if value1 == 5) this will be perfectly safe.
If you insist on capturing the first value written, this won't work. But if you have a parallel set of threads, and any value written other than the uninteresting one will do, this is also fine.
I second peterb's answer to aim for correctness first. Yes, you can use memory barriers here, but they will not do what you want.
You said immediately. However, how immediate this update ever can be, you could (and will) end up with the if() clause being executed, then the flag being set, and than the DoSmthElse() being executed afterwards. This is called a race condition...
You want to synchronize something, it seems, but it is not this flag.
Making the field volatile should make the change "immediately" visible in other threads, but there is no guarantee that the instant at which thread A executes the update doesn't occur after thread B tests the value but before thread B executes the body of the if/else statement.
It sounds like what you really want to do is make that if/else statement atomic, and that will require either a lock, or an algorithm that is tolerant of this sort of situation.