Thread Synchronisation 101

Thread Synchronisation 101 - c++

Previously I've written some very simple multithreaded code, and I've always been aware that at any time there could be a context switch right in the middle of what I'm doing, so I've always guarded access the shared variables through a CCriticalSection class that enters the critical section on construction and leaves it on destruction. I know this is fairly aggressive and I enter and leave critical sections quite frequently and sometimes egregiously (e.g. at the start of a function when I could put the CCriticalSection inside a tighter code block) but my code doesn't crash and it runs fast enough.
At work my multithreaded code needs to be a tighter, only locking/synchronising at the lowest level needed.
At work I was trying to debug some multithreaded code, and I came across this:
EnterCriticalSection(&m_Crit4);
m_bSomeVariable = true;
LeaveCriticalSection(&m_Crit4);
Now, m_bSomeVariable is a Win32 BOOL (not volatile), which as far as I know is defined to be an int, and on x86 reading and writing these values is a single instruction, and since context switches occur on an instruction boundary then there's no need for synchronising this operation with a critical section.
I did some more research online to see whether this operation did not need synchronisation, and I came up with two scenarios it did:
The CPU implements out of order execution or the second thread is running on a different core and the updated value is not written into RAM for the other core to see; and
The int is not 4-byte aligned.
I believe number 1 can be solved using the "volatile" keyword. In VS2005 and later the C++ compiler surrounds access to this variable using memory barriers, ensuring that the variable is always completely written/read to the main system memory before using it.
Number 2 I cannot verify, I don't know why the byte alignment would make a difference. I don't know the x86 instruction set, but does mov need to be given a 4-byte aligned address? If not do you need to use a combination of instructions? That would introduce the problem.
So...
QUESTION 1: Does using the "volatile" keyword (implicity using memory barriers and hinting to the compiler not to optimise this code) absolve a programmer from the need to synchronise a 4-byte/8-byte on x86/x64 variable between read/write operations?
QUESTION 2: Is there the explicit requirement that the variable be 4-byte/8-byte aligned?
I did some more digging into our code and the variables defined in the class:
class CExample
{
private:
CRITICAL_SECTION m_Crit1; // Protects variable a
CRITICAL_SECTION m_Crit2; // Protects variable b
CRITICAL_SECTION m_Crit3; // Protects variable c
CRITICAL_SECTION m_Crit4; // Protects variable d
// ...
};
Now, to me this seems excessive. I thought critical sections synchronised threads between a process, so if you've got one you can enter it and no other thread in that process can execute. There is no need for a critical section for each variable you want to protect, if you're in a critical section then nothing else can interrupt you.
I think the only thing that can change the variables from outside a critical section is if the process shares a memory page with another process (can you do that?) and the other process starts to change the values. Mutexes would also help here, named mutexes are shared across processes, or only processes of the same name?
QUESTION 3: Is my analysis of critical sections correct, and should this code be rewritten to use mutexes? I have had a look at other synchronisation objects (semaphores and spinlocks), are they better suited here?
QUESTION 4: Where are critical sections/mutexes/semaphores/spinlocks best suited? That is, which synchronisation problem should they be applied to. Is there a vast performance penalty for choosing one over the other?
And while we're on it, I read that spinlocks should not be used in a single-core multithreaded environment, only a multi-core multithreaded environment. So, QUESTION 5: Is this wrong, or if not, why is it right?
Thanks in advance for any responses :)

1) No volatile just says re-load the value from memory each time it is STILL possible for it to be half updated.
Edit:
2) Windows provides some atomic functions. Look up the "Interlocked" functions.
The comments led me do a bit more reading up. If you read through the Intel System Programming Guide You can see that there aligned read and writes ARE atomic.
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
• Reading or writing a byte
• Reading or writing a word aligned on a 16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across bus widths, cache lines, and
page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel
Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and
Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M,
Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that
permit external memory subsystems to make split accesses atomic; however,
nonaligned data accesses will seriously impact the performance of the processor and
should be avoided.
An x87 instruction or an SSE instructions that accesses data larger than a quadword
may be implemented using multiple memory accesses. If such an instruction stores
to memory, some of the accesses may complete (writing to memory) while another
causes the operation to fault for architectural reasons (e.g. due an page-table entry
that is marked “not present”). In this case, the effects of the completed accesses
may be visible to software even though the overall instruction caused a fault. If TLB
invalidation has been delayed (see Section 4.10.3.4), such page faults may occur
even if all accesses are to the same page.
So basically yes if you do an 8-bit read/write from any address a 16-bit read/write from a 16-bit aligned address etc etc you ARE getting atomic operations. Its also interesting to note that you can do unaligned memory read/writes within a cacheline on a modern machine. The rules seem quite complex though so I wouldn't rely on them if i were you. Cheers to the commenters thats a good learning experience for me that one :)
3) A critical section will attempt to spin lock for its lock a few times and then locks a mutex. Spin Locking can suck CPU power doing nothing and a mutex can take a while to do its stuff. CriticalSections are a good choice if you can't use the interlocked functions.
4) There are performance penalties for choosing one over another. Its a pretty big ask to go through the benefits of everything here. The MSDN help has lots of good information on each of these. I sugegst reading them.
Semaphores
Critical Sections & Spin locks
Events
Mutexes
5) You can use a spin lock in a single threaded environment its not usually necessary though as thread management means that you can't have 2 processors accessing the same data simultaneously. It just isn't possible.

1: Volatile in itself is practically useless for multithreading. It guarantees that the read/write will be executed, rather than storing the value in a register, and it guarantees that the read/write won't be reordered with respect to other volatile reads/writes. But it may still be reordered with respect to non-volatile ones, which is basically 99.9% of your code. Microsoft have redefined volatile to also wrap all accesses in memory barriers, but that is not guaranteed to be the case in general. It will just silently break on any compiler which defines volatile as the standard does. (The code will compile and run, it just won't be thread-safe any longer)
Apart from that, reads/writes to integer-sized objects are atomic on x86 as long as the object is well aligned. (You have no guarantee of when the write will occur though. The compiler and CPU may reorder it, so it's atomic, but not thread-safe)
2: Yes, the object has to be aligned for the read/write to be atomic.
3: Not really. Only one thread can execute code inside a given critical section at a time. Other threads can still execute other code. So you can have four variables each protected by a different critical section. If they all shared the same critical section, I'd be unable to manipulate object 1 while you're manipulating object 2, which is inefficient and constrains parallelism more than necessary. If they are protected by different critical sections, we just can't both manipulate the same object simultaneously.
4: Spinlocks are rarely a good idea. They are useful if you expect a thread to have to wait only a very short time before being able to acquire the lock, and you absolutely neeed minimal latency. It avoids the OS context switch which is a relatively slow operation. Instead, the thread just sits in a loop constantly polling a variable. So higher CPU usage (the core isn't freed up to run another thread while waiting for the spinlock), but the thread will be able to continue as soon as the lock is released.
As for the others, the performance characteristics are pretty much the same: just use whichever has the semantics best suited for your needs. Typically critical sections are most convenient for protecting shared variables, and mutexes can be easily used to set a "flag" allowing other threads to proceed.
As for not using spinlocks in a single-core environment, remember that the spinlock doesn't actually yield. Thread A waiting on a spinlock isn't actually put on hold allowing the OS to schedule thread B to run. But since A is waiting on this spinlock, some other thread is going to have to release that lock. If you only have a single core, then that other thread will only be able to run when A is switched out. With a sane OS, that's going to happen sooner or later anyway as part of the regular context switching. But since we know that A won't be able to get the lock until B has had a time to executed and release the lock, we'd be better off if A just yielded immediately, was put in a wait queue by the OS, and restarted when B has released the lock. And that's what all other lock types do.
A spinlock will still work in a single core environment (assuming an OS with preemptive multitasking), it'll just be very very inefficient.

Q1: Using the "volatile" keyword
In VS2005 and later the C++ compiler surrounds access to this variable using memory barriers, ensuring that the variable is always completely written/read to the main system memory before using it.
Exactly. If you are not creating portable code, Visual Studio implements it exactly this way. If you want to be portable, your options are currently "limited". Until C++0x there is no portable way how to specify atomic operations with guaranteed read/write ordering and you need to implement per-platform solutions. That said, boost already did the dirty job for you, and you can use its atomic primitives.
Q2: Variable needs to be 4-byte/8-byte aligned?
If you do keep them aligned, you are safe. If you do not, rules are complicated (cache lines, ...), therefore the safest way is to keep them aligned, as this is easy to achieve.
Q3: Should this code be rewritten to use mutexes?
Critical section is a lightweight mutex. Unless you need to synchronize between processes, use critical sections.
Q4: Where are critical sections/mutexes/semaphores/spinlocks best suited?
Critical sections can even do spin waits for you.
Q5: Spinlocks should not be used in a single-core
Spin lock uses the fact that while the waiting CPU is spinning, another CPU may release the lock. This cannot happen with one CPU only, therefore it is only a waste of time there. On multi-CPU spin locks can be good idea, but it depends on how often the spin wait will be successful. The idea is waiting for a short while is a lot faster then doing context switch there and back again, therefore if the wait it likely to be short, it is better to wait.

Don't use volatile. It has virtually nothing to do with thread-safety. See here for the low-down.
The assignment to BOOL doesn't need any synchronisation primitives. It'll work fine without any special effort on your part.
If you want to set the variable and then make sure that another thread sees the new value, you need to establish some kind of communication between the two threads. Just locking immediately before assigning achieves nothing because the other thread might have come and gone before you acquired the lock.
One final word of caution: threading is extremely hard to get right. The most experienced programmers tend to be the least comfortable with using threads, which should set alarm bells ringing for anyone who is inexperienced with their use. I strongly suggest you use some higher-level primitives to implement concurrency in your app. Passing immutable data structures via synchronised queues is one approach that substantially reduces the danger.

Volatile does not imply memory barriers.
It only means that it will be part of the perceived state of the memory model. The implication of this is that the compiler cannot optimize the variable away, nor can it perform operations on the variable only in CPU registers (it will actually load and store to memory).
As there are no memory barriers implied, the compiler can reorder instructions at will. The only guarantee is that the order in which different volatile variables are read/write will be the same as in the code:
void test()
{
volatile int a;
volatile int b;
int c;
c = 1;
a = 5;
b = 3;
}
With the code above (assuming that c is not optimized away) the update to c can happen before or after the updates to a and b, providing 3 possible outcomes. The a and b updates are guaranteed to be performed in order. c can be optimized away easily by any compiler. With enough information, the compiler can even optimize away a and b (if it can be proven that no other threads read the variables and that they are not bound to a hardware array (so in this case, they can in fact be removed). Note that the standard does not require an specific behavior, but rather a perceivable state with the as-if rule.

Questions 3: CRITICAL_SECTIONs and Mutexes work, pretty much, the same way. A Win32 mutex is a kernel object, so it can be shared between processes, and waited on with WaitForMultipleObjects, which you can't do with a CRITICAL_SECTION. On the other hand, a CRITICAL_SECTION is lighter-weight and therefore faster. But the logic of the code should be unaffected by which you use.
You also commented that "there is no need for a critical section for each variable you want to protect, if you're in a critical section then nothing else can interrupt you." This is true, but the tradeoff is that accesses to any of the variables would need you to hold that lock. If the variables can meaningfully be updated independently, you are losing an opportunity for parallelising those operations. (Since these are members of the same object, though, I would think hard before concluding that they can really be accessed independently of each other.)

Related

How to directly read the value of the `std::atomic_int64_t` without atomic operation?

I have an std::atomic_int64_t that can be read by multiple threads but written by only one thread. In the one thread that writes the atomic, I want to read it directly without any atomic-related instruction since there won't be concurrent writing. How should I do that in C++?

It's hard to tell for sure without knowing your use case if what you're trying to do is reasonable, but there's about a 99.95% chance that it's a bad idea. The reason for this is not obvious, so let me have a go.
Complex runtime environments
Atomics, despite the name, are not just about atomic access to a variable, they're about ordering of effects. To understand what this means, we have to understand a little bit about two things:
modern CPUs, caches, and memory
compiler optimizations
For point 1, consider that a modern CPU consists mostly of memory management. Very little silicon is actually devoted to calculating, most of it is concerned with keeping the calculation units fed with data. When you store a value into memory, chances are it's not going to show up in main memory immediately, instead it'll go to the active CPU core's store buffer that's going to be flushed out at some point in the future, at which point it may become visible to the other CPU core on which your other thread runs. In the name of performance, we have turned our CPUs into highly asynchronous beasts.
For point 2, consider that your compiler will take apart the code you give it, analyze it for data dependencies, and reorder your instructions in such a way that they'll have the same results but run faster. Consider that the compiler can only do this for the code of one thread at a time. It cannot know that another thread depends on what the first one is doing being done in a particular order (or indeed at all), and so it'll run roughshod over the assumptions of that other, unknown thread. Another thread will change a variable, so that'll need to be re-read from time to time? Well, the compiler doesn't know that and has the value in a register, so it'll generate an endless loop and call it a day. This sort of behaviour needs to be inhibited when you have multiple threads.
Atomics are about synchronization
The main point of atomics is synchronization. An Atomic write ensures that store buffers are flushed, things become visible in main memory, and prevent the compiler from reordering instructions across it (possibly one-way, depending on the precise boundary used). Similarly, an atomic read ensures that the values from main memory become visible in cache and prevent the compiler from reordering across it.
So if your reader threads don't use atomic reads, we have a situation: The writer threads does a proper atomic store. This ensures that the compiler does not reorder operations in the writer thread across the atomic boundary, and the generated code ensures that the CPU core's store buffer is flushed. If the reader threads read the atomic variable with an atomic read and see the new value, they'll also see everything else the writer thread was instructed to do up to that point.
So, if the reader does not use an atomic read to read the atomic variable, what could go wrong? Well, basically two things:
the cpu core might not see the need to update its cache
the compiler could reorder operations across the non-atomic read, or not see a need to re-read a value from memory that the optimizer thought it already knew and had in a register.
In effect, what this means is that the reader thread might work under the assumption that things the writer thread did before the write have already happened, but it ends up "not seeing" those new data. Hilarity will (almost) inevitably ensue.
TL;DR
Atomics are about synchonisation. You have multiple threads that you need to synchronize. Use atomic reads in the reader thread, or you're not synchronizing.

How does a mutex ensure a variable's value is consistent across cores?

If I have a single int which I want to write to from one thread and read from on another, I need to use std::atomic, to ensure that its value is consistent across cores, regardless of whether or not the instructions that read from and write to it are conceptually atomic. If I don't, it may be that the reading core has an old value in its cache, and will not see the new value. This makes sense to me.
If I have some complex data type that cannot be read/written to atomically, I need to guard access to it using some synchronisation primitive, such as std::mutex. This will prevent the object getting into (or being read from) an inconsistent state. This makes sense to me.
What doesn't make sense to me is how mutexes help with the caching problem that atomics solve. They seem to exist solely to prevent concurrent access to some resource, but not to propagate any values contained within that resource to other cores' caches. Is there some part of their semantics I've missed which deals with this?

The right answer to this is magic pixies - e.g. It Just Works. The implementation of std::atomic for each platform must do the right thing.
The right thing is a combination of 3 parts.
Firstly, the compiler needs to know that it can't move instructions across boundaries [in fact it can in some cases, but assume that it doesn't].
Secondly, the cache/memory subsystem needs to know - this is generally done using memory barriers, although x86/x64 generally have such strong memory guarantees that this isn't necessary in the vast majority of cases (which is a big shame as its nice for wrong code to actually go wrong).
Finally the CPU needs to know it cannot reorder instructions. Modern CPUs are massively aggressive at reordering operations and making sure in the single threaded case that this is unnoticeable. They may need more hints that this cannot happen in certain places.
For most CPUs part 2 and 3 come down to the same thing - a memory barrier implies both. Part 1 is totally inside the compiler, and is down to the compiler writers to get right.
See Herb Sutters talk 'Atomic Weapons' for a lot more interesting info.

The consistency across cores is ensured by memory barriers (which also prevents instructions reordering). When you use std::atomic, not only do you access the data atomically, but the compiler (and library) also insert the relevant memory barriers.
Mutexes work the same way: the mutex implementations (eg. pthreads or WinAPI or what not) internally also insert memory barriers.

Most modern multicore processors (including x86 and x64) are cache coherent. If two cores hold the same memory location in cache and one of them updates the value, the change is automatically propagated to other cores' caches. It's inefficient (writing to the same cache line at the same time from two cores is really slow) but without cache coherence it would be very difficult to write multithreaded software.
And like syam said, memory barriers are also required. They prevent the compiler or processor from reordering memory accesses, and also force the write into memory (or at least into cache), when for example a variable is held in a register because of compiler optizations.

Do I need to use locking with integers in c++ threads

If I am accessing a single integer type (e.g. long, int, bool, etc...) in multiple threads, do I need to use a synchronisation mechanism such as a mutex to lock them. My understanding is that as atomic types, I don't need to lock access to a single thread, but I see a lot of code out there that does use locking. Profiling such code shows that there is a significant performance hit for using locks, so I'd rather not. So if the item I'm accessing corresponds to a bus width integer (e.g. 4 bytes on a 32 bit processor) do I need to lock access to it when it is being used across multiple threads? Put another way, if thread A is writing to integer variable X at the same time as thread B is reading from the same variable, is it possible that thread B could end up a few bytes of the previous value mixed in with a few bytes of the value being written? Is this architecture dependent, e.g. ok for 4 byte integers on 32 bit systems but unsafe on 8 byte integers on 64 bit systems?
Edit: Just saw this related post which helps a fair bit.

You are never locking a value - you are locking an operation ON a value.
C & C++ do not explicitly mention threads or atomic operations - so operations that look like they could or should be atomic - are not guaranteed by the language specification to be atomic.
It would admittedly be a pretty deviant compiler that managed a non atomic read on an int: If you have an operation that reads a value - theres probably no need to guard it. However- it might be non atomic if it spans a machine word boundary.
Operations as simple as m_counter++ involves a fetch, increment, and store operation - a race condition: another thread can change the value after the fetch but before the store - and hence needs to be protected by a mutex - OR find your compilers support for interlocked operations. MSVC has functions like _InterlockedIncrement() that will safely increment a memory location as long as all other writes are similarly using interlocked apis to update the memory location - which is orders of magnitude more lightweight than invoking a even a critical section.
GCC has intrinsic functions like __sync_add_and_fetch which also can be used to perform interlocked operations on machine word values.

If you're on a machine with more than one core, you need to do things properly even though writes of an integer are atomic. The issues are two-fold:
You need to stop the compiler from optimizing out the actual write! (Somewhat important this. ;-))
You need memory barriers (not things modeled in C) to make sure the other cores take notice of the fact that you've changed things. Otherwise you'll be tangled up in caches between all the processors and other dirty details like that.
If it was just the first thing, you'd be OK with marking the variable volatile, but the second is really the killer and you will only really see the difference on a multicore machine. Which happens to be an architecture that is becoming far more common than it used to be… Oops! Time to stop being sloppy; use the correct mutex (or synchronization or whatever) code for your platform and all the details of how to make memory work like you believe it to will go away.

In 99.99% of the cases, you must lock, even if it's access to seemingly atomic variables. Since C++ compiler is not aware of multi-threading on the language level, it can do a lot of non-trivial reorderings.
Case in point: I was bitten by a spin lock implementation where unlock is simply assigning zero to a volatile integer variable. The compiler was reordering unlock operation before the actual operation under the lock, unsurprisingly, leading to mysterious crashes.
See:
Lock-Free Code: A False Sense of Security
Threads Cannot be Implemented as a Library

There's no support for atomic variables in C++, so you do need locking. Without locking you can only speculate about what exact instructions will be used for data manipulation and whether those instructions will guarantee atomic access - that's not how you develop reliable software.

Yes it would be better to use synchronization. Any data accessed by multiple threads must be synchronized.
If it is windows platform you can also check here :Interlocked Variable Access.

Multithreading is hard and complex. The number of hard to diagnose problems that can come around is quite big. In particular, on intel architectures reads and writes from aligned 32bit integers is guaranteed to be atomic in the processor, but that does not mean that it is safe to do so in multithreaded environments.
Without proper guards, the compiler and/or the processor can reorder the instructions in your block of code. It can cache variables in registers and they will not be visible in other threads...
Locking is expensive, and there are different implementations of lock-less data structures to optimize for high performance, but it is hard to do it correctly. And the problem is a that concurrency bugs are usually obscure and difficult to debug.

Yes. If you are on Windows you can take a look at Interlocked functions/variables and if you are of the Boost persuasion then you can look at their implementation of atomic variables.
If boost is too heavyweight putting "atomic c++" into your favourite search engine will give you plenty of food for thought.

Overhead of a Memory Barrier / Fence

I'm currently writing C++ code and use a lot of memory barriers / fences in my code. I know, that a MB tolds the compiler and the hardware to not reorder write/reads around it. But i don't know how complex this operation is for the processor at runtime.
My Question is: What is the runtime-overhead of such a barrier? I didn't found any useful answer with google...
Is the overhead negligible? Or leads heavy usage of MBs to serious performance problems?
Best regards.

Compared to arithmetic and "normal" instructions I understand these to be very costly, but do not have numbers to back up that statement. I like jalf's answer by describing effects of the instructions, and would like to add a bit.
There are in general a few different types of barriers, so understanding the differences could be helpful. A barrier like the one that jalf mentioned is required for example in a mutex implementation before clearing the lock word (lwsync on ppc, or st4.rel on ia64 for example). All reads and writes must be complete, and only instructions later in the pipeline that have no memory access and no dependencies on in progress memory operations can be executed.
Another type of barrier is the sort that you'd use in a mutex implementation when acquiring a lock (examples, isync on ppc, or instr.acq on ia64). This has an effect on future instructions, so if a non-dependent load has been prefetched it must be discarded. Example:
if ( pSharedMem->atomic.bit_is_set() ) // use a bit to flag that somethingElse is "ready"
{
foo( pSharedMem->somethingElse ) ;
}
Without an acquire barrier (borrowing ia64 lingo), your program may have unexpected results if somethingElse made it into a register before the check of the flagging bit check is complete.
There is a third type of barrier, generally less used, and is required to enforce store load ordering. Examples of instructions for such an ordering enforcing instruction are, sync on ppc (heavyweight sync), MF on ia64, membar #storeload on sparc (required even for TSO).
Using ia64 like pseudocode to illustrate, suppose one had
st4.rel
ld4.acq
without an mf in between one has no guarentee that the load follows the store. You know that loads and stores preceding the st4.rel are done before that store or the "subsequent" load, but that load or other future loads (and perhaps stores if non-dependent?) could sneak in, completing earlier since nothing prevents that otherwise.
Because mutex implementations very likely only use acquire and release barriers in thier implementations, I'd expect that an observable effect of this is that memory access following lock release may actually sometimes occur while "still in the critical section".

Try thinking about what the instruction does. It doesn't make the CPU do anything complicated in terms of logic, but it forces it to wait until all reads and writes have been committed to main memory. So the cost really depends on the cost of accessing main memory (and the number of outstanding reads/writes).
Accessing main memory is generally pretty expensive (10-200 clock cycles), but in a sense, that work would have to be done without the barrier as well, it could just be hidden by executing some other instructions simultaneously so you didn't feel the cost so much.
It also limits the CPU's (and compilers) ability to reschedule instructions, so there may be an indirect cost as well in that nearby instructions can't be interleaved which might otherwise yield a more efficient execution schedule.

Is it safe to read an integer variable that's being concurrently modified without locking?

Suppose that I have an integer variable in a class, and this variable may be concurrently modified by other threads. Writes are protected by a mutex. Do I need to protect reads too? I've heard that there are some hardware architectures on which, if one thread modifies a variable, and another thread reads it, then the read result will be garbage; in this case I do need to protect reads. I've never seen such architectures though.
This question assumes that a single transaction only consists of updating a single integer variable so I'm not worried about the states of any other variables that might also be involved in a transaction.

atomic read
As said before, it's platform dependent. On x86, the value must be aligned on a 4 byte boundary. Generally for most platforms, the read must execute in a single CPU instruction.
optimizer caching
The optimizer doesn't know you are reading a value modified by a different thread. declaring the value volatile helps with that: the optimizer will issue a memory read / write for every access, instead of trying to keep the value cached in a register.
CPU cache
Still, you might read a stale value, since on modern architectures you have multiple cores with individual cache that is not kept in sync automatically. You need a read memory barrier, usually a platform-specific instruction.
On Wintel, thread synchronization functions will automatically add a full memory barrier, or you can use the InterlockedXxxx functions.
MSDN: Memory and Synchronization issues, MemoryBarrier Macro
[edit] please also see drhirsch's comments.

You ask a question about reading a variable and later you talk about updating a variable, which implies a read-modify-write operation.
Assuming you really mean the former, the read is safe if it is an atomic operation. For almost all architectures this is true for integers.
There are a few (and rare) exceptions:
The read is misaligned, for example accessing a 4-byte int at an odd address. Usually you need to force the compiler with special attributes to do some misalignment.
The size of an int is bigger than the natural size of instructions, for example using 16 bit ints on a 8 bit architecture.
Some architectures have an artificially limited bus width. I only know of very old and outdated ones, like a 386sx or a 68008.

I'd recommend not to rely on any compiler or architecture in this case.
Whenever you have a mix of readers and writers (as opposed to just readers or just writers) you'd better sync them all. Imagine your code running an artificial heart of someone, you don't really want it to read wrong values, and surely you don't want a power plant in your city go 'boooom' because someone decided not to use that mutex. Save yourself a night-sleep in a long run, sync 'em.
If you only have one thread reading -- you're good to go with just that one mutex, however if you're planning for multiple readers and multiple writers you'd need a sophisticated piece of code to sync that. A nice implementation of read/write lock that would also be 'fair' is yet to be seen by me.

Imagine that you're reading the variable in one thread, that thread gets interrupted while reading and the variable is changed by a writing thread. Now what is the value of the read integer after the reading thread resumes?
Unless reading a variable is an atomic operation, in this case only takes a single (assembly) instruction, you can not ensure that the above situation can not happen.
(The variable could be written to memory, and retrieving the value would take more than one instruction)
The consensus is that you should encapsulate/lock all writes individualy, while reads can be executed concurrently with (only) other reads

Suppose that I have an integer variable in a class, and this variable may be concurrently modified by other threads. Writes are protected by a mutex. Do I need to protect reads too? I've heard that there are some hardware architectures on which, if one thread modifies a variable, and another thread reads it, then the read result will be garbage; in this case I do need to protect reads. I've never seen such architectures though.
In the general case, that is potentially every architecture. Every architecture has cases where reading concurrently with a write will result in garbage.
However, almost every architecture also has exceptions to this rule.
It is common that word-sized variables are read and written atomically, so synchronization is not needed when reading or writing. The proper value will be written atomically as a single operation, and threads will read the current value as a single atomic operation as well, even if another thread is writing. So for integers, you're safe on most architectures. Some will extend this guarantee to a few other sizes as well, but that's obviously hardware-dependant.
For non-word-sized variables both reading and writing will typically be non-atomic, and will have to be synchronized by other means.

If you don't use prevous value of this variable when write new, then:
You can read and write integer variable without using mutex. It is because integer is base type in 32bit architecture and every modification/read of value is doing with one operation.
But, if you donig something such as increment:
myvar++;
Then you need use mutex, because this construction is expanded to myvar = myvar + 1 and between read myvar and increment myvar, myvar can be modified. In that case you will get bad value.

While it would probably be safe to read ints on 32 bit systems without synchronization. I would not risk it. While multiple concurrent reads are not a problem, I do not like writes to happen at the same time as reads.
I would recommend placing the reads in the Critical Section too and then stress test your application on multiple cores to see if this is causing too much contention. Finding concurrency bugs is a nightmare I prefer to avoid. What happens if in the future some one decides to change the int to a long long or a double, so they can hold larger numbers?
If you have a nice thread library like boost.thread or zthread then you should have read/writer locks. These would be ideal for your situation as they allow multiple reads while protecting writes.

This may happen on 8 bit systems which use 16 bit integers.
If you want to avoid locking you can under suitable circumstances get away with reading multiple times, until you get two equal consecutive values. For example, I've used this approach to read the 64 bit clock on a 32 bit embedded target, where the clock tick was implemented as an interrupt routine. In that case reading three times suffices, because the clock can only tick once in the short time the reading routine runs.

In general, each machine instruction goes thru several hardware stages when executing. As most current CPUs are multi-core or hyper-threaded, that means that reading a variable may start it moving thru the instruction pipeline, but it doesn't stop another CPU core or hyper-thread from concurrently executing a store instruction to the same address. The two concurrently executing instructions, read and store, might "cross paths", meaning that the read will receive the old value just before the new value is stored.
To resume: you do need the mutex for both read and writes.

Both reading / writing to variables with concurrency must be protected by a critical section (not mutex). Unless you want to waste your whole day debugging.
Critical sections are platform-specific, I believe. On Win32, critical section is very efficient: when no interlocking occur, entering critical section is almost free and does not affect overall performance. When interlocking occur, it is still more efficient, than mutex, because it implements a series of checks before suspending the thread.

Depends on your platform. Most modern platforms offer atomic operations for integers: Windows has InterlockedIncrement, InterlockedDecrement, InterlockedCompareExchange, etc. These operations are usually supported by the underlying hardware (read: the CPU) and they are usually cheaper than using a critical section or other synchronization mechanisms.
See MSDN: InterlockedCompareExchange
I believe Linux (and modern Unix variants) support similar operations in the pthreads package but I don't claim to be an expert there.

If a variable is marked with the volatile keyword then the read/write becomes atomic but this has many, many other implications in terms of what the compiler does and how it behaves and shouldn't just be used for this purpose.
Read up on what volatile does before you blindly start using it: http://msdn.microsoft.com/en-us/library/12a04hfd(VS.80).aspx

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js