Since std::atomic::is_lock_free() may not genuinely reflect the reality [ref], I'm considering writing a genuine runtime test instead. However, when I got down to it, I found that it's not a trivial task I thought it to be. I'm wondering whether there is some clever idea that could do it.
Other than performance, the standard doesn't guarantee any way you can tell; that's more or less the point.
If you are willing to introduce some platform-specific UB, you could do something like cast a atomic<int64_t> * to a volatile int64_t* and see if you observe "tearing" when another thread reads the object. (When to use volatile with multi threading? - normally never, but real hardware has coherent caches between cores that run threads so plain asm load/store are basically like relaxed-atomic.)
If this test succeeds (i.e. the plain C++ type was naturally atomic with just volatile), that tells you any sane compiler will make it lock-free very cheaply. But if it fails, it doesn't tell you very much. A lock-free atomic for that type may be only slightly more expensive than the plain version for loads/stores, or the compiler may not make it lock-free at all. e.g. on 32-bit x86 where lock-free int64_t is efficient with only small overhead (using SSE2 or x87), but volatile int64_t* will produce tearing using two separate 4-byte integer loads or stores the way most compilers compile it.
On any specific platform / target architecture, you can single-step your code in a debugger and see what asm instructions run. (Including stepping into libatomic function calls like __atomic_store_16). This is the only 100% reliable way. (Plus consulting ISA documentation to check atomicity guarantees for different instructions, e.g. whether ARM load/store pair is guaranteed, under what conditions.)
(Fun fact: gcc7 with statically linked libatomic may always use locking for 16-byte objects on x86-64, because it doesn't have an opportunity to do runtime CPU detection at dynamic link time and use lock cmpxchg16b on CPUs that support it, with the same mechanism glibc uses to pick optimal memcpy / strchr implementations for the current system.)
You could portably look for a performance difference (e.g. scalability with multiple readers), but x86-64 lock cmpxchg16b doesn't scale1. Multiple readers contend with each other, unlike 8 byte and narrower atomic objects where pure asm loads are atomic and can be used. lock cmpxchg16b acquires exclusive access to a cache line before executing; abusing the side-effect of atomically loading the old value on failure to implement .load() is much worse than an 8-byte atomic load which compiles to just a regular load instruction.
That's part of the reason that gcc7 decided to stop returning true for is_lock_free() on 16-byte objects, as described in the GCC mailing list message about the change you're asking about.
Also note that clang on 32-bit x86 uses lock cmpxchg8b to implement std::atomic<int64_t>, just like for 16-byte objects in 64-bit mode. So you would see a lack of parallel read scaling with it, too. (https://bugs.llvm.org/show_bug.cgi?id=33109)
std::atomic<> implementations that use locking usually still don't make the object larger by including a lock byte or word in each object. It would change the ABI, but lock-free vs. locking is already an ABI difference. The standard allows this, I think, but weird hardware might need extra bytes in the object even when lock-free. Anyway sizeof(atomic<T>) == sizeof(T) doesn't tell you anything either way. If it's larger it's most likely that your implementation added a mutex, but you can't be sure without checking the asm. (If the size wasn't a power of 2, it could have widened it for alignment.)
(In C11, there's much less scope for including a lock in the object: it has to work even with minimal initialization (e.g. statically to 0), and no destructor. Compilers / ABIs generally want their C stdatomic.h atomics to be compatible with their C++ std::atomic atomics.)
The normal mechanism is to use the address of the atomic object as a key for a global hash table of locks. Two objects aliasing / colliding and sharing the same lock is extra contention, but not a correctness problem. These locks are only taken/released from library functions, not while holding other such locks, so it can't create a deadlock.
You could detect this by using shared memory between two different processes (so each process would have its own hash table of locks).
Is C++11 atomic<T> usable with mmap?
check that std::atomic<T> is the same size as T (so the lock isn't in the object itself).
Map a shared memory segment from two separate processes that don't otherwise share any of their address space. It doesn't matter if you map it to a different base address in each process.
Store patterns like all-ones and all-zeros from one process while reading from the other (and look for tearing). Same as what I suggested with volatile above.
Also test atomic increment: have each thread do 1G increments and check that the result is 2G every time. Even if pure load and pure store are naturally atomic (the tearing test), read-modify-write operations like fetch_add / operator++ need special support: Can num++ be atomic for 'int num'?
From the C++11 standard, the intent is that this should still be atomic for lock-free objects. It might also work for non-lock-free objects (if they embed the lock in the object), which is why you have to rule that out by checking sizeof().
To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state.
If you see tearing between two processes, the object wasn't lock-free (at least not the way C++11 intended, and not the way you'd expect on normal shared-memory CPUs.)
I'm not sure why address-free matters if the processes don't have to share any address-space other than 1 page containing the atomic object2. (Of course, C++11 doesn't require that the implementation uses pages at all. Or maybe an implementation could put the hash table of locks at the top or bottom of each page? In which case using a hash function that depended on address bits above the page offset would be totally silly.)
Anyway, this depends on a lot of assumptions about how computers work that are true on all normal CPUs, but which C++ doesn't make. If the implementation you care about is on a mainstream CPU like x86 or ARM under a normal OS, then this testing method should be fairly accurate and might be an alternative to just reading the asm. It's not something that's very practical to do automatically at compile time, but it would be possible to automate a test like this and put it into a build script, unlike reading the asm.
Footnote 1: 16-byte atomics on x86
(Update: Intel recently documented that the AVX feature bit implies 16-byte atomicity for aligned loads/stores, such as with movaps. At least on Intel CPUs specifically; AMD CPUs with AVX in practice seem to be like that too, but AMD hasn't yet documented it officially. The rest of this answer was written before that, but GCC's libatomic does use vmovdqa [mem], xmm / mfence for atomic 16-byte stores on CPUs where that's guaranteed atomic.)
No x86 hardware documents support for 16-byte atomic load/store with SSE instructions. In practice many modern CPUs do have atomic movaps load/store, but there are no guarantees of this in Intel/AMD manuals the way there are for 8-byte x87/MMX/SSE loads/stores on Pentium and later. And no way to detect which CPUs do/don't have atomic 128-bit ops (other than lock cmpxchg16b), so compiler writers can't safely use them.
See SSE instructions: which CPUs can do atomic 16B memory operations? for a nasty corner case: testing on K10 shows that aligned xmm load/store shows no tearing between threads on the same socket, but threads on different sockets experience rare tearing because HyperTransport apparently only gives the minimum x86 atomicity guarantee of 8 byte objects. (IDK if lock cmpxchg16b is more expensive on a system like that.)
Without published guarantees from vendors, we can never be sure about weird microarchitectural corner cases, either. Lack of tearing in a simple test with one thread writing patterns and the other reading is pretty good evidence, but it's always possible that something could be different in some special case the CPU designers decided to handle a different way than normal.
A pointer + counter struct where read-only access only needs the pointer can be cheap, but current compilers need union hacks to get them to do an 8-byte atomic load of just the first half of the object. How can I implement ABA counter with c++11 CAS?. For an ABA counter, you'd normally update it with a CAS anyway, so lack of a 16-byte atomic pure store is not a problem.
An ILP32 ABI (32-bit pointers) in 64-bit mode (like Linux's x32 ABI, or AArch64's ILP32 ABI) means pointer+integer can fit in only 8 bytes, but integer registers are still 8 bytes wide. This makes it much more efficient to use a pointer+counter atomic object than in full 64-bit mode where a pointer is 8 bytes.
Footnote 2: address-free
I think the term "address-free" is a separate claim from not depending on any per-process state. As I understand it, it means that correctness doesn't depend on both threads using the same address for the same memory location. But if correctness also depends on them sharing the same global hash table (IDK why storing the address of an object in the object itself would ever help), that would only matter if it was possible to have multiple addresses for the same object within the same process. That is possible on something like x86's real-mode segmentation model, where a 20-bit linear address space is addressed with 32-bit segment:offset. (Actual C implementations for 16-bit x86 exposed segmentation to the programmer; hiding it behind C's rules would be possible but not high performance.)
It's also possible with virtual memory: two mappings of the same physical page to different virtual addresses within the same process is possible but weird. That might or might not use the same lock, depending on whether the hash function uses any address bits above the page offset.
(The low bits of an address, that represent the offset within a page, are the same for every mapping. i.e. virtual to physical translation for those bits is a no-op, which is why VIPT caches are usually designed to take advantage of that to get speed without aliasing.)
So a non-lock-free object might be address-free within a single process, even if it uses a separate global hash table instead of adding a mutex to the atomic object. But this would be a very unusual situation; it's extremely rare to use virtual memory tricks to create two addresses for the same variable within the same process that shares all of its address-space between threads. Much more common would be atomic objects in shared memory between processes. (I may be misunderstanding the meaning of "address-free"; possibly it means "address-space free", i.e. lack of dependency on other addresses being shared.)
I think you are really just trying to detect this special case specific to gcc where is_lock_free reports false, but the underlying implementation (hidden behind a libatomic function call) is still using cmpxchg16b. You want to know about this, since you consider such an implementation genuinely lock free.
In that case, as a practical matter, I would just write your detection function to hardcode the gcc version range you know operates in this manner. Currently, all versions after the one in which the change to stop inlining cmpxchg16b apparently still use a lock-free implementation under the covers, so a check today would be "open ended" (i.e., all versions after X). Prior to this point is_lock_free returns true (which you consider correct). After some hypothetical future change to gcc which makes the library call use locks, the is_lock_free() == false answer will become genuinely true, and you'll close your check by recording the version in which it occurred.
So something like this should be a good start:
template <typename T>
bool is_genuinely_lock_free(std::atomic<T>& t) {
#if __GNUC__ >= LF16_MAJOR_FIRST && __GNUC_MINOR__ >= LF16_MINOR_FIRST && \
__GNUC__ <= LF16_MAJOR_LAST && __GNUC_MINOR__ <= LF16_MINOR_LAST
return sizeof(T) == 16 || t.is_lock_free();
#else
return t.is_lock_free();
#endif
}
Here the LF16 macros define the version range where gcc returns the "wrong" answer for is_lock_free for 16-byte objects. Note that since second half of this change (to make __atomic_load_16 and friends use locks) you'll only need the first half of the check today. You need to determine the exact version when is_lock_free() started returning false for 16-byte objects: the links Peter provides discussing this issue are a good start, and you can do some checking in godbolt - although the latter doesn't provide everything you need since it doesn't decompile library functions like __atomic_load16: you may need to dig into the libatomic source for that. It's also possible that the macro check should be tied to the libstdc++ or libatomic version instead of the compiler version (although AFAIK in typical installs the versions of all of those are bound together). You'll probably want to add a few more checks to the #if to limit it to 64-bit x86 platforms as well.
I think this approach is valid since the concept of genuinely lock-free isn't really well-defined: you have decided in this case you want to consider the cmpxchg16b implementation in gcc lock-free, but if other grey areas occur in other future implementations you'll want to make another judgment call about whether you consider it lock-free. So the hardcoding approach seems approximately as robust for the non-gcc cases as some type of detection since in either case unknown future implementations may trigger the wrong answer. For the gcc case it seems more robust and definitely more simple.
The basis for this idea is that getting the answer wrong is not going to be a world-destroying functional problem, but rather a performance issue: I'm guessing you are trying to do this detection to select between alternate implementations one of which is faster on a "genuinely" lock-free system, and other being more suitable when std::atomic is lock-based.
If your requirements are stronger, and you really want to be more robust, why not combine approaches: use this simple version detection approach and combine it with a runtime/compile-time detection approach which examines tearing behavior or decompilation as suggested in Peter's answer. If both approaches agree, use it as your answer; if they disagree, however, surface the error and do further investigation. This will also help you catch the point, if ever, at which gcc changes the implementation to make 16-byte objects lock-full.
Related
I know that atomic will apply a lock on type "T" variable when multiple threads are reading and writing the variable, making sure only one of them is doing the R/W.
But in a multi cpu-core computer, threads can run on different cores, and different cores would have different L1-cache, L2-cache, while share L3-cache. We know sometimes C++ compiler will optimize a variable to be stored inside register, so that if a variable is not stored in memory, then there's no memory synchronization between different core-cache on the variable.
So my worry/question is, if an atomic variable is optimized to be some register variable by compiler, then it's not stored in memory, when one core writes its value, another core could read out a stale value, right? Is there any guarantee on this data consistency?
Thanks.
Atomic doesn't "solve" things the way you vaguely describe. It provides certain very specific guarantees onvolving consistency of memory based on order.
Various compilers implement these guarantees in different ways on different platforms.
On x86/64 no locks are used for atomic integers and pointers up to a reasonable size. And the hardware provises stronger guarantees than the standard requires, making some of the more esoteric options equivalent to full consistency.
I won't be able to fully answer your question but I can point you in the right direction; the topic you need to learn about is "the C++ memory model".
That being said, atomics exist in order to avoid the exact problem you describe. If you ask for full memory order consistency, and thread A modifies X then Y, no other thread can see Y modified but not X. How that guarantee is provided is not specified by the C++ standard; cache line invalidation, using special instructions for access, barring certain register-based optimizations by the compiler, etc are all the kind of thing that compilers do.
Note that the C++ memory model was refined, bugfixed and polished for C++17 in order to describe the behaviour of the new parallel algorithms and permit their efficient implementation on GPU hardware (among other spots) with the right flags, and in turn it influenced the guarantees that new GPU hardware provides. So people talking about memory models may be excited and talk about more modern issues than your mainly C++11 concerns.
This is a big complex topic. It is really easy to write code you think is portable, yet only works on a specific platform, or only usually works on the platform you tested it on. But that is just because threading is hard.
You may be looking for this:
[intro.progress]/18 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
As a hobby project I'm toying around with creating a programming language with garbage collection.
The language will be compiled to (preferably portable) C++ and supports threads.
The question is:
Support two threads writes "simultaneously" different values to the same (pointer-sized and aligned) memory location.
Is it then possible for any thread to read a mix between the two values?
For example on a 32 bit platform:
Thread 1 writes: AAAAAAAA
Thread 2 writes: BBBBBBBB
Will any thread always read AAAAAAAA or BBBBBBBB or could they read AAAABBBB or some other "mix" between the two?
I don't care about the ordering and what ends up being the final value. The important thing is just that no invalid value can ever be read from the location.
I realize that this may depend on the platform and C++ may not provide any promises for it.
Would it be guaranteed for some platforms and would that involve using inline assembler to achieve it?
PS: I believe std::atomic would make such guarantees, but I think it would be far to much overhead to use for all load/store operations for object references.
C++ makes no such guarantees, it depends on the hardware.
Typical hardware / processors, such as Arm, x86,amd64, as long as the writes are 32-bit aligned then 32-bit read and write operations will be atomic.
reading/writing 32bits a byte at a time (such as with strcpy, memcpy, etc), all bets are off - depends very much on the implementation of those functions (they tend to get lots of optimizations).
It gets more complicated on some platforms when there are multiple memory locations.
Say you have
extern int32 a;
extern int32 b;
a = 0x12345678;
b = 0x87654321;
Now, individually, a and b are written to atomically by thread 1, but an observer, thread 2, may "see" the value of B change before A.
This can happen due to hardware and software.
The software (the C++ compiler / optimizer), may rearrange your code if it thinks it would be better. (Or, the compiler may even avoid writing the values to a and b in some cases).
The hardware can also rearrange memory reads/writes at runtime - which is visible when thread1 and thread2 are running on different cores, and until core1 does something to synchronize its internal memory pipeline with the rest of the system, core2 may see something different. The Ia64 is pretty aggressive about these sorts of optimizations. X86 is not so much (as it would break too much legacy code I presume).
In C/C++, "volatile" basically lets you tell the compiler to be less aggressive with optimizations around this variable -- though exactly what it does depends on the implementation. Usually what it means is that the compiler won't optimize away reads/writes to volatile variables and generally won't rearrange accesses to them either.
This doesn't change what the processor might to at runtime though.
For that, you need to use the special "memory barrier" intrisics / operations.
The details of these are complex, and are usually hidden behind such things as "atomic".
Oh, also, most systems have magical memory -- certain addresses which are reserved by the hardware for special purposes. Typically unless you are writing device drivers you won't run into this.
I thought only string and (certain) streaming instructions require memory barriers on Intel x86? For all other instructions the Intel strong memory ordering model ensures that consistency is achieved?
Assuming the above is correct, why do we need to use C++ atomics (excluding compare-and-exchange) when our code is only executing on Intel x86?
I really am getting myself confused where we need to use atomics and whether using atomics inhibits the out-of-order-execution due to memory barriers and then the whole MESI protocol.
MESI just ensures the caches are consistent across all processors?
Memory barriers are useful on other architectures because they flush the CPU store buffers to the caches, to allow MESI to ensure consistency?
When do we need to use atomics on Intel X86 CPUs?
While X86 is cache-coherent, it doesn't mean it gives you the guarantees you expect to find. There are different instructions for atomic save and regular save and they behave differently.
Also, and equally importantly, atomic variables prevent 'destructive' compiler optimizations. Without those, the compiler will easily optimize your code according to single-threaded execution model, and your programm will misbehave.
You need to forget about the specific architecture that you compile against. You're programming C++, and that means you can only rely on the guarantees that your specific C++ implementation gives you. This should of course include whatever the Standard guarantees. In addition you might get some additional guarantees from your specific C++ implementation.
The reason is that the C++ implementation is allowed to do pretty much everything, as long as it doesn't break the rules of the C++ Standard. That means it may "remove" or "destroy" some guarantees that the platform you're compiling against would "normally" provide (meaning: when you program it in assembler). E.g. it might use string/MMX/SSE/SSE2/... instructions where you don't expect it, reorder instructions, coalesce writes, place non-atomic data at addresses that aren't suitably aligned for atomicity etc.
That being said, there is at least one C++ implementation that gives you rather strong additional guarantees about memory ordering, and that is Visual C++. It guarantees that volatile loads always have acquire semantics and volatile stores always have release semantics. (At least on x86/amd64 with default compiler settings.) It also guarantees that properly aligned volatile reads and writes are atomic. See MSDN for details.
I thought only string and (certain) streaming instructions require memory barriers on Intel x86? For all other instructions the Intel strong memory ordering model ensures that consistency is achieved?
This is true at the assembly language level.
Assuming the above is correct, why do we need to use C++ atomics (excluding compare-and-exchange) when our code is only executing on Intel x86?
How else would you perform atomic operations like compare and exchange? Atomicity and memory visibility have almost nothing to do with each other.
I really am getting myself confused where we need to use atomics and whether using atomics inhibits the out-of-order-execution due to memory barriers and then the whole MESI protocol.
We need to use atomics where we need atomicity. We need to use memory barriers where we need memory visibility because even though the x86's assembly language memory model might not require them, there's no guarantee this will "pass through" the compiler seamlessly to the C++ level.
If I have a single int which I want to write to from one thread and read from on another, I need to use std::atomic, to ensure that its value is consistent across cores, regardless of whether or not the instructions that read from and write to it are conceptually atomic. If I don't, it may be that the reading core has an old value in its cache, and will not see the new value. This makes sense to me.
If I have some complex data type that cannot be read/written to atomically, I need to guard access to it using some synchronisation primitive, such as std::mutex. This will prevent the object getting into (or being read from) an inconsistent state. This makes sense to me.
What doesn't make sense to me is how mutexes help with the caching problem that atomics solve. They seem to exist solely to prevent concurrent access to some resource, but not to propagate any values contained within that resource to other cores' caches. Is there some part of their semantics I've missed which deals with this?
The right answer to this is magic pixies - e.g. It Just Works. The implementation of std::atomic for each platform must do the right thing.
The right thing is a combination of 3 parts.
Firstly, the compiler needs to know that it can't move instructions across boundaries [in fact it can in some cases, but assume that it doesn't].
Secondly, the cache/memory subsystem needs to know - this is generally done using memory barriers, although x86/x64 generally have such strong memory guarantees that this isn't necessary in the vast majority of cases (which is a big shame as its nice for wrong code to actually go wrong).
Finally the CPU needs to know it cannot reorder instructions. Modern CPUs are massively aggressive at reordering operations and making sure in the single threaded case that this is unnoticeable. They may need more hints that this cannot happen in certain places.
For most CPUs part 2 and 3 come down to the same thing - a memory barrier implies both. Part 1 is totally inside the compiler, and is down to the compiler writers to get right.
See Herb Sutters talk 'Atomic Weapons' for a lot more interesting info.
The consistency across cores is ensured by memory barriers (which also prevents instructions reordering). When you use std::atomic, not only do you access the data atomically, but the compiler (and library) also insert the relevant memory barriers.
Mutexes work the same way: the mutex implementations (eg. pthreads or WinAPI or what not) internally also insert memory barriers.
Most modern multicore processors (including x86 and x64) are cache coherent. If two cores hold the same memory location in cache and one of them updates the value, the change is automatically propagated to other cores' caches. It's inefficient (writing to the same cache line at the same time from two cores is really slow) but without cache coherence it would be very difficult to write multithreaded software.
And like syam said, memory barriers are also required. They prevent the compiler or processor from reordering memory accesses, and also force the write into memory (or at least into cache), when for example a variable is held in a register because of compiler optizations.
If I am accessing a single integer type (e.g. long, int, bool, etc...) in multiple threads, do I need to use a synchronisation mechanism such as a mutex to lock them. My understanding is that as atomic types, I don't need to lock access to a single thread, but I see a lot of code out there that does use locking. Profiling such code shows that there is a significant performance hit for using locks, so I'd rather not. So if the item I'm accessing corresponds to a bus width integer (e.g. 4 bytes on a 32 bit processor) do I need to lock access to it when it is being used across multiple threads? Put another way, if thread A is writing to integer variable X at the same time as thread B is reading from the same variable, is it possible that thread B could end up a few bytes of the previous value mixed in with a few bytes of the value being written? Is this architecture dependent, e.g. ok for 4 byte integers on 32 bit systems but unsafe on 8 byte integers on 64 bit systems?
Edit: Just saw this related post which helps a fair bit.
You are never locking a value - you are locking an operation ON a value.
C & C++ do not explicitly mention threads or atomic operations - so operations that look like they could or should be atomic - are not guaranteed by the language specification to be atomic.
It would admittedly be a pretty deviant compiler that managed a non atomic read on an int: If you have an operation that reads a value - theres probably no need to guard it. However- it might be non atomic if it spans a machine word boundary.
Operations as simple as m_counter++ involves a fetch, increment, and store operation - a race condition: another thread can change the value after the fetch but before the store - and hence needs to be protected by a mutex - OR find your compilers support for interlocked operations. MSVC has functions like _InterlockedIncrement() that will safely increment a memory location as long as all other writes are similarly using interlocked apis to update the memory location - which is orders of magnitude more lightweight than invoking a even a critical section.
GCC has intrinsic functions like __sync_add_and_fetch which also can be used to perform interlocked operations on machine word values.
If you're on a machine with more than one core, you need to do things properly even though writes of an integer are atomic. The issues are two-fold:
You need to stop the compiler from optimizing out the actual write! (Somewhat important this. ;-))
You need memory barriers (not things modeled in C) to make sure the other cores take notice of the fact that you've changed things. Otherwise you'll be tangled up in caches between all the processors and other dirty details like that.
If it was just the first thing, you'd be OK with marking the variable volatile, but the second is really the killer and you will only really see the difference on a multicore machine. Which happens to be an architecture that is becoming far more common than it used to be… Oops! Time to stop being sloppy; use the correct mutex (or synchronization or whatever) code for your platform and all the details of how to make memory work like you believe it to will go away.
In 99.99% of the cases, you must lock, even if it's access to seemingly atomic variables. Since C++ compiler is not aware of multi-threading on the language level, it can do a lot of non-trivial reorderings.
Case in point: I was bitten by a spin lock implementation where unlock is simply assigning zero to a volatile integer variable. The compiler was reordering unlock operation before the actual operation under the lock, unsurprisingly, leading to mysterious crashes.
See:
Lock-Free Code: A False Sense of Security
Threads Cannot be Implemented as a Library
There's no support for atomic variables in C++, so you do need locking. Without locking you can only speculate about what exact instructions will be used for data manipulation and whether those instructions will guarantee atomic access - that's not how you develop reliable software.
Yes it would be better to use synchronization. Any data accessed by multiple threads must be synchronized.
If it is windows platform you can also check here :Interlocked Variable Access.
Multithreading is hard and complex. The number of hard to diagnose problems that can come around is quite big. In particular, on intel architectures reads and writes from aligned 32bit integers is guaranteed to be atomic in the processor, but that does not mean that it is safe to do so in multithreaded environments.
Without proper guards, the compiler and/or the processor can reorder the instructions in your block of code. It can cache variables in registers and they will not be visible in other threads...
Locking is expensive, and there are different implementations of lock-less data structures to optimize for high performance, but it is hard to do it correctly. And the problem is a that concurrency bugs are usually obscure and difficult to debug.
Yes. If you are on Windows you can take a look at Interlocked functions/variables and if you are of the Boost persuasion then you can look at their implementation of atomic variables.
If boost is too heavyweight putting "atomic c++" into your favourite search engine will give you plenty of food for thought.