Multithreading: do I need protect my variable in read-only method? - c++

I have few questions about using lock to protect my shared data structure. I am using C/C++/ObjC/Objc++
For example I have a counter class that used in multi-thread environment
class MyCounter {
private:
int counter;
std::mutex m;
public:
int getCount() const {
return counter;
}
void increase() {
std::lock_guard<std::mutex> lk(m);
counter++;
}
};
Do I need to use std::lock_guard<std::mutex> lk(m); in getCount() method to make it thread-safe?
What happen if there is only two threads: a reader thread and a writer thread then do I have to protect it at all? Because there is only one thread is modifying the variable so I think no lost update will happen.
If there are multiple writer/reader for a shared primitive type variable (e.g. int) what disaster may happen if I only lock in write method but not read method? Will 8bits type make any difference compare to 64bits type?
Is any primitive type are atomic by default? For example write to a char is always atomic? (I know this is true in Java but don't know about c++ and I am using llvm compiler on Mac if platform matters)

Yes, unless you can guarantee that changes to the underlying variable counter are atomic, you need the mutex.
Classic example, say counter is a two-byte value that's incremented in (non-atomic) stages:
(a) add 1 to lower byte
if lower byte is 0:
(b) add 1 to upper byte
and the initial value is 255.
If another thread comes in anywhere between the lower byte change a and the upper byte change b, it will read 0 rather than the correct 255 (pre-increment) or 256 (post-increment).
In terms of what data types are atomic, the latest C++ standard defines them in the <atomic> header.
If you don't have C++11 capabilities, then it's down to the implementation what types are atomic.

Yes, you would need to lock the read as well in this case.
There are several alternatives -- a lock is quite heavy here. Atomic operations are the most obvious (lock-free). There are also other approaches to locking in this design -- the read write lock is one example.

Yes, I believe that you do need to lock the read as well. But since you are using C++11 features, why don't you use std::atomic<int> counter; instead?

As a rule of thumb, you should lock the read too.
Read and write to int is atomic on most architecture (and since int is guaranted to be the machine's word size, you should almost never experience corrupted int)
Yet, the answer from #paxdiablo is correct, and will happen if you have someone doing this:
#pragma pack(push, 1)
struct MyObj
{
char a;
MyCounter cnt;
};
#pragma pack(pop)
In that specific case, cnt will not be aligned to a word boundary, and the int MyCounter::counter will/might be emulated in multiple operations in CPU supporting unaligned access (like x86). Thus, you could get this sequence of operations:
Thread A: [...] set counter to 255 (counter is 0x000000FF)
getCount() => CPU reads low byte: lo:255
<interrupted here>
Thread B: increase() => counter is incremented, leading to counter = 256 = 0x00000100)
<interrupted here>
Thread A: CPU read high bytes: 0x000001, concatenate: 0x000001FF, returns 511 !
Now, let's say you never use unaligned access. Yet, if you are doing something like this:
ThreadA.cpp:
int g = clientCounter.getCount();
while (g > 0)
{
processFirstClient();
g = clientCounter.getCount();
}
ThreadB.cpp:
if (acceptClient()) clientCounter.increase();
The compiler is completely allowed to replace the loop in Thread A by this:
if (clientCounter.getCount())
while(true) processFirstClient();
Why ? That's because for each instruction, the compiler will evaluate side-effects of such expression. The getCount() is so simple that the compiler will deduce: it's a read of a single variable, and it's not modified anywhere in ThreadA.cpp, thus, it's constant. Because it's constant, let's simplify this.
If you add a mutex, the mutex code will insert a memory barrier telling the compiler "hey, don't expect anything after this barrier is crossed".
Thus, the "optimization" above can not happen since getCount might have been modified.
Sure, you could have written volatile int counter instead of counter, and the compiler would have avoided this optimization too.
In the end, if you have to write a ton of code just to avoid a mutex, you're doing it wrong (and probably will get wrong results).

You cant gaurantee that multiple threads wont modify your variable at the same time. and if such a situation occurs your variable will be garbled or program might crash. In order to avoid such cases its always better and safer to make the program thread safe.
You can use the synchronization techinques available like: Mutex, Lock, Synchronization attribute(available for MS c++)

Related

Why does this cppreference excerpt seem to wrongly suggest that atomics can protect critical sections?

int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto job = [&] {
int asdf = bar.load();
// std::lock_guard lg(mx);
foo.emplace_back(1);
bar.store(foo.size());
};
std::thread t1(job);
std::thread t2(job);
t1.join();
t2.join();
}
This obviously is not guaranteed to work, but works with a mutex. But how can that be explained in terms of the formal definitions of the standard?
Consider this excerpt from cppreference:
If an atomic store in thread A is tagged memory_order_release and an
atomic load in thread B from the same variable is tagged
memory_order_acquire [as is the case with default atomics], all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A, become visible side-effects in thread B. That is, once
the atomic load is completed, thread B is guaranteed to see everything
thread A wrote to memory.
Atomic loads and stores (with the default or with the specific acquire and release memory order specified) have the mentioned acquire-release semantics. (So does a mutex's lock and unlock.)
An interpretation of that wording could be that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification, making this well-defined. But pretty much everyone would agree that this can lead to a segmentation fault and would surely do so if the job function ran its three lines in a loop.
What standard wording explains the obvious difference in capability between the two tools, given that this wording seems to imply that atomic would synchronize in a way.
I know when to use mutexes and atomics, and I know that the example doesn't work because no synchronization actually happens. My question is how the definition is to be interpreted so it doesn't contradict the way it works in reality.
The quoted passage means that when B loads the value that A stored, then by observing that the store happened, it can also be assured that everything that B did before the store has also happened and is visible.
But this doesn't tell you anything if the store has not in fact happened yet!
The actual C++ standard says this more explicitly. (Always remember that cppreference, while a valuable resource which often quotes from or paraphrases the standard, is not the standard itself and is not authoritative.) From N4861, the final C++20 draft, we have in atomics.order p2:
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A.
I would agree that if the load in your thread B returned 1, it could safely conclude that the other thread had finished its store and therefore had exited the critical section, and therefore B could safely use foo. In this case the load in B has synchronized with the store in A, since the value of the load (namely 1) came from the store (which is part of its own release sequence).
But it is entirely possible that both loads return 0, if both threads do their loads before either one does its store. The value 0 didn't come from either store, so the loads don't synchronize with the stores in that case. Your code doesn't even look at the value that was loaded, so both threads may enter the critical section together in that case.
The following code would be a safe, though inefficient, way to use an atomic to protect a critical section. It ensures that A will execute the critical section first, and B will wait until A has finished before proceeding. (Obviously if both threads wait for the other then you have a deadlock.)
int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto jobA = [&] {
foo.emplace_back(1);
bar.store(foo.size());
};
auto jobB = [&] {
while (bar.load() == 0) /* spin */ ;
foo.emplace_back(1);
};
std::thread t1(jobA);
std::thread t2(jobB);
t1.join();
t2.join();
}
Setting aside the elephant in the room that none of the C++ containers are thread safe without employing locking of some sort (so forget about using emplace_back without implementing locking), and focusing on the question of why atomic objects alone are not sufficient:
You need more than atomic objects. You also need sequencing.
All that an atomic object gives you is that when an object changes state, any other thread will either see its old value or its new value, and it will never see any "partially old/partially new", or "intermediate" value.
But it makes no guarantee whatsoever as to when other execution threads will "see" the atomic object's new value. At some point they (hopefully) will, see the atomic object's instantly flip to its new value. When? Eventually. That's all that you get from atomics.
One execution thread may very well set an atomic object to a new value, but other execution threads will still have the old value cached, in some form or fashion, and will continue to see the atomic object's old value, and won't "see" the atomic object's new value until some intermediate time passes (if ever).
Sequencing are rules that specify when objects' new values are visible in other execution threads. The simplest way to get both atomicity and easy to deal with sequencing, in one fell swoop, is to use mutexes and condition variables which handle all the hard details for you. You can still use atomics and with a careful logic use lock/release fence instructions to implement proper sequencing. But it's very easy to get it wrong, and the worst of it you won't know that it's wrong until your code starts going off the rails due to improper sequencing and it'll be nearly impossible to accurately reproduce the faulty behavior for debugging purposes.
But for nearly all common, routine, garden-variety tasks mutexes and condition variables is the most simplest solution to proper inter-thread sequencing.
The idea is that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification
Yes all writes that done by foo.emplace_back(1); would be guaranteed when bar.store(foo.size()); is executed. But who guaranteed you that foo.emplace_back(1); from thread 1 would see any/all non partial consistent state from foo.emplace_back(1); executed in thread 2 and vice versa? They both read and modify internal state of std::vector and there is no memory barrier before code reaches atomic store. And even if all variables would be read/modified atomically std::vector state consists of multiple variables - size, capacity, pointer to the data at least. Changes to all of them must be synchronized as well and memory barrier is not enough for that.
To explain little more let's create simplified example:
int a = 0;
int b = 0;
std::atomic<int> at;
// thread 1
int foo = at.load();
a = 1;
b = 2;
at.store(foo);
// thread 2
int foo = at.load();
int tmp1 = a;
int tmp2 = b;
at.store(tmp2);
Now you have 2 problems:
There is no guarantee that when tmp2 value is 2
tmp1 value would be 1
as you read a and b before atomic operation.
There is no guarantee that when at.store(b)
is executed that either a == b == 0 or a == 1 and b == 2,
it could be a == 1 but still b == 0.
Is that clear?
But:
// thread 1
mutex.lock();
a = 1;
b = 2;
mutex.unlock();
// thread 2
mutex.lock();
int tmp1 = a;
int tmp2 = b;
mutex.unlock();
You either get tmp1 == 0 and tmp2 == 0 or tmp1 == 1 and tmp2 == 2, do you see the difference?

Implementing 64 bit atomic counter with 32 bit atomics

I would like to cobble together a uint64 atomic counter from atomic uint32s. The counter has a single writer and multiple readers. The writer is a signal handler so it must not block.
My idea is to use a generation count with the low bit as a read lock. The reader retries until the generation count is stable across the read, and the low bit is unset.
Is the following code correct in design and use of memory ordering? Is there a better way?
using namespace std;
class counter {
atomic<uint32_t> lo_{};
atomic<uint32_t> hi_{};
atomic<uint32_t> gen_{};
uint64_t read() const {
auto acquire = memory_order_acquire;
uint32_t lo, hi, gen1, gen2;
do {
gen1 = gen_.load(acquire);
lo = lo_.load(acquire);
hi = hi_.load(acquire);
gen2 = gen_.load(acquire);
} while (gen1 != gen2 || (gen1 & 1));
return (uint64_t(hi) << 32) | lo;
}
void increment() {
auto release = memory_order_release;
gen_.fetch_add(1, release);
uint32_t newlo = 1 + lo_.fetch_add(1, release);
if (newlo == 0) {
hi_.fetch_add(1, release);
}
gen_.fetch_add(1, release);
}
};
edit: whoops, fixed auto acquire = memory_order_release;
This is a known pattern, called a SeqLock. https://en.wikipedia.org/wiki/Seqlock. (With the simplification that there's only one writer so no extra support for excluding simultaneous writers is needed.) It's not lock-free; a writer sleeping at the wrong time will leave readers spinning until the writer finishes. But in the common case where that doesn't happen, it has excellent performance with no contention between readers which are truly read-only.
You don't need or want the increment of the payload to use atomic RMW operations. (Unless you're on a system that can cheaply do a 64-bit atomic add or load, then do that instead of a SeqLock).
You can just load both halves with atomic 32-bit loads, increment it, and atomically store the result. (With cheap relaxed or release memory order for the payload, and using a release store for the 2nd sequence counter update, what you're calling the "generation" counter).
Similarly the sequence counter also doesn't need to be an atomic RMW. (Unless you're using it as a spinlock with multiple writers)
The single writer only needs pure loads and pure stores with only release ordering, which are (much) cheaper than atomic RMW, or stores with seq_cst ordering:
load the counter and the value in any order
store a new counter (old+1)
store the new value (or just update the low half if you want to branch on no carry)
store the final counter.
The ordering of the stores in those 3 bullet points are the only thing that matters. A write fence after the first store could be good, because we don't really want the cost of making both stores of both halves of the value release, on CPUs where that's more expensive than relaxed.
Unfortunately to satisfy C++ rules, the value has to be atomic<T>, which makes it inconvenient to get the compiler to generate the most efficient code possible for loading both halves. e.g. ARM ldrd or ldp / stp load-pair aren't guaranteed atomic until ARMv8.4a, but that doesn't matter. (And compilers often won't optimize two separate atomic 32-bit loads into one wider load.)
Values other threads read while the sequence-counter is odd are irrelevant, but we'd like to avoid undefined behaviour. Maybe we could use a union of a volatile uint64_t and an atomic<uint64_t>
I wrote this C++ SeqLock<class T> template for another question I didn't finish writing an answer for (figuring out which versions of ARM have 64-bit atomic load and store).
This tries to check if the target already supports lock-free atomic operations on atomic<T> to stop you from using this when it's pointless. (Disable that for testing purposed by defining IGNORE_SIZECHECK.) TODO: transparently fall back to doing that, maybe with a template specialization, instead of using a static_assert.
I provided an inc() function for T that supports a ++ operator. TODO would be an apply() that accepts a lambda to do something to a T, and store the result between sequence counter updates.
// **UNTESTED**
#include <atomic>
#ifdef UNIPROCESSOR
// all readers and writers run on the same core (or same software thread)
// ordering instructions at compile time is all that's necessary
#define ATOMIC_FENCE std::atomic_signal_fence
#else
// A reader can be running on another core while writing.
// Memory barriers or ARMv8 acquire / release loads / store are needed
#define ATOMIC_FENCE std::atomic_thread_fence
#endif
// using fences instead of .store(std::memory_order_release) will stop the compiler
// from taking advantage of a release-store instruction instead of separate fence, like on AArch64
// But fences allow it to be optimized away to just compile-time ordering for the single thread or unirprocessor case.
// SINGLE WRITER only.
// uses volatile + barriers for the data itself, like pre-C++11
template <class T>
class SeqLocked
{
#ifndef IGNORE_SIZECHECK
// sizeof(T) > sizeof(unsigned)
static_assert(!std::atomic<T>::is_always_lock_free, "A Seq Lock with a type small enough to be atomic on its own is totally pointless, and we don't have a specialization that replaces it with a straight wrapper for atomic<T>");
#endif
// C++17 doesn't have a good way to express a load that doesn't care about tearing
// without explicitly writing it as multiple small parts and thus gimping the compiler if it can use larger loads
volatile T data; // volatile should be fine on any implementation where pre-C++11 lockless code was possible with volatile,
// even though Data Race UB does apply to volatile variables in ISO C++11 and later.
// even non-volatile normally works in practice, being ordered by compiler barriers.
std::atomic<unsigned> seqcount{0}; // Even means valid, odd means modification in progress.
// unsigned definitely wraps around at a power of 2 on overflow
public:
T get() const {
unsigned c0, c1;
T tmp;
// READER RETRY LOOP
do {
c0 = seqcount.load(std::memory_order_acquire); // or for your signal-handler use-case, relaxed load followed by ATOMIC_FENCE(std::memory_order_acquire);
tmp = (T)data; // load
ATOMIC_FENCE(std::memory_order_acquire); // LoadLoad barrier
c1 = seqcount.load(std::memory_order_relaxed);
} while(c0&1 || c0 != c1); // retry if the counter changed or is odd
return tmp;
}
// TODO: a version of this that takes a lambda for the operation on tmp
T inc() // WRITER
{
unsigned orig_count = seqcount.load(std::memory_order_relaxed);
// we're the only writer, avoid an atomic RMW.
seqcount.store(orig_count+1, std::memory_order_relaxed);
ATOMIC_FENCE(std::memory_order_release); // 2-way barrier *after* the store, not like a release store. Or like making data=tmp a release operation.
// make sure the counter becomes odd *before* any data change
T tmp = data; // load into a non-volatile temporary
++tmp; // make any change to it
data = tmp; // store
seqcount.store(orig_count+2, std::memory_order_release); // or use ATOMIC_FENCE(std::memory_order_release); *before* this, so the UNIPROCESSOR case can just do compile-time ordering
return tmp;
}
void set(T newval) {
unsigned orig_count = seqcount.load(std::memory_order_relaxed);
seqcount.store(orig_count+1, std::memory_order_relaxed);
ATOMIC_FENCE(std::memory_order_release);
// make sure the data stores appear after the first counter update.
data = newval; // store
ATOMIC_FENCE(std::memory_order_release);
seqcount.store(orig_count+2, std::memory_order_relaxed); // Or use mo_release here, better on AArch64
}
};
/***** test callers *******/
#include <stdint.h>
struct sixteenbyte {
//unsigned arr[4];
unsigned long a,b,c,d;
sixteenbyte() = default;
sixteenbyte(const volatile sixteenbyte &old)
: a(old.a), b(old.b), c(old.c), d(old.d) {}
//arr(old.arr) {}
};
void test_inc(SeqLocked<uint64_t> &obj) { obj.inc(); }
sixteenbyte test_get(SeqLocked<sixteenbyte> &obj) { return obj.get(); }
//void test_set(SeqLocked<sixteenbyte> &obj, sixteenbyte val) { obj.set(val); }
uint64_t test_get(SeqLocked<uint64_t> &obj) {
return obj.get();
}
// void atomic_inc_u64_seq_cst(std::atomic<uint64_t> &a) { ++a; }
uint64_t u64_inc_relaxed(std::atomic<uint64_t> &a) {
// same but without dmb barriers
return 1 + a.fetch_add(1, std::memory_order_relaxed);
}
uint64_t u64_load_relaxed(std::atomic<uint64_t> &a) {
// gcc uses LDREXD, not just LDRD?
return a.load(std::memory_order_relaxed);
}
void u64_store_relaxed(std::atomic<uint64_t> &a, uint64_t val) {
// gcc uses a LL/SC retry loop even for a pure store?
a.store(val, std::memory_order_relaxed);
}
It compiles to the asm we want on the Godbolt compiler explorer for ARM, and other ISAs. At least for int64_t; larger struct types may be copied less efficiently because of cumbersome volatile rules.
It uses non-atomic volatile T data for the shared data. This is technically data-race undefined behaviour, but all compilers we use in practice were fine with pre-C++11 multi-threaded access to volatile objects. And pre-C++11, people even depended on atomicity for some sizes. We do not, we check the counter and only use the value we read if there were no concurrent writes. (That's the whole point of a SeqLock.)
One problem with volatile T data is that in ISO C++, T foo = data won't compile for struct objects unless you provide a copy-constructor from a volatile object, like
sixteenbyte(const volatile sixteenbyte &old)
: a(old.a), b(old.b), c(old.c), d(old.d) {}
This is really annoying for us, because we don't care about the details of how memory is read, just that multiple reads aren't optimized into one.
volatile is really the wrong tool here, and plain T data with sufficient fencing to ensure that the read actually happens between the reads of the atomic counter would be better. e.g. we could do that in GNU C with a asm("":::"memory"); compiler barrier against reordering before/after the accesses. That would let the compiler copy larger objects with SIMD vectors, or whatever, which it won't do with separate volatile accesses.
I think std::atomic_thread_fence(mo_acquire) would also be a sufficient barrier, but I'm not 100% sure.
In ISO C, you can copy a volatile aggregate (struct), and the compiler will emit whatever asm it normally would to copy that many bytes. But in C++, we can't have nice things apparently.
Related: single-core systems with a writer in an interrupt handler
In an embedded system with one core, and some variables that are only updated by interrupt handlers, you may have a writer that can interrupt the reader but not vice versa. That allows some cheaper variations that use the value itself to detect torn reads.
See Reading a 64 bit variable that is updated by an ISR, especially for a monotonic counter Brendan's suggestion of reading the most significant-half first, then the low half, then the most-significant half again. If it matches, your read wasn't torn in a way that matters. (A write that didn't change the high half isn't a problem even if it interrupted the reader to change the low half right before or after the reader read it.)
Or in general, re-read the whole value until you see the same value twice in a row.
Neither of these techniques are SMP-safe: the read retry only guards against torn reads, not torn writes if the writer stored the halves separately. That's why a SeqLock uses a 3rd atomic integer as a sequence counter. They would work in any case where the writer is atomic wrt. the reader, but the reader isn't atomic. Interrupt handler vs. main code is one such case, or signal handler is equivalent.
You could potentially use the low half of a monotonic counter as a sequence number, if you don't mind incrementing by 2 instead of 1. (Perhaps requiring readers to do a 64-bit right shift by 1 to recover the actual number. So that's not good.)

about spin lock

i have some questions in boost spinlock code :
class spinlock
{
public:
spinlock()
: v_(0)
{
}
bool try_lock()
{
long r = InterlockedExchange(&v_, 1);
_ReadWriteBarrier(); // 1. what this mean
return r == 0;
}
void lock()
{
for (unsigned k = 0; !try_lock(); ++k)
{
yield(k);
}
}
void unlock()
{
_ReadWriteBarrier();
*const_cast<long volatile*>(&v_) = 0;
// 2. Why don't need to use InterlockedExchange(&v_, 0);
}
private:
long v_;
};
A ReadWriteBarrier() is a "memory barrier" (in this case for both reads and writes), a special instruction to the processor to ensure that any instructions resulting in memory operations have completed (load & store operations - or in for example x86 processors, any opertion which has a memory operand at either side). In this particular case, to make sure that the InterlockedExchange(&v_,1) has completed before we continue.
Because an InterlockedExchange would be less efficient (takes more interaction with any other cores in the machine to ensure all other processor cores have 'let go' of the value - which makes no sense, since most likely (in correctly working code) we only unlock if we actually hold the lock, so no other processor will have a different value cached than what we're writing over anyway), and a volatile write to the memory will be just as good.
The barriers are there to ensure memory synchronization; without
them, different threads may see modifications of memory in
different orders.
And the InterlockedExchange isn't necessary in the second case
because we're not interested in the previous value. The role of
InterlockedExchange is doubtlessly to set the value and return
the previous value. (And why v_ would be long, when it can
only take values 0 and 1, is beyond me.)
There are three issues with atomic access to variables. First, ensuring that there is no thread switch in the middle of reading or writing a value; if this happens it's called "tearing"; the second thread can see a partly written value, which will usually be nonsensical. Second, ensuring that all processors see the change that is being made with a write, or that the processor reading a value sees any previous changes to that value; this is called "cache coherency". Third, ensuring that the compiler doesn't move code across the read or write; this is called "code motion". InterlockedExchange does the first two; although the MSDN documentation is rather muddled, _ReadWriteBarrier does the third, and possibly the second.

Does std::atomic prevent reordering of nonatomic variables over the atomic variables

question is rather simple Q:
If I have
settings[N_STNGS];//used by many threads
std::atomic<size_t> current_settings(0);
void updateSettings()//called by single thread , always the same thread if that is important
{
auto new_settings = (current_settings+1)%N_STNGS;
settings[new_settings].loadFromFileSystem(); //line A
current_settings=new_settings; //line B
}
does standard guarantee that line A wont be reordered after line B? Also will users of STNGS always see consistent(commited-as in memory visibility visible) data?
Edit: for multiple reader threads and nontrivial settings is this worth the trouble compared to simple mutexing?
Given the definition
int settings[N_STNGS];
std::atomic<size_t> current_settings(0);
and Thread 1 executing:
settings[new_settings] = somevalue; // line A
current_settings=new_settings; // line B
and Thread 2 executing:
int cur_settings = current_settings; // line X
int setting_value = settings[cur_settings]; // line Y
then yes, if Thread 2 at line X reads new_settings written by Thread 1 in line B, and there are no other modifications to settings[new_settings] (by some code we don't see), Thread 2 is bound to read somevalue and no undefined behavior occurs. This is because all the operations are (by default) memory_order_seq_cst and a release-write (line B) synchronizes with an acquire-read (line X). Note that you need two statements in Thread 2 to get a sequenced-before relationship between the atomic read of the index and the read of the value (a memory_order_consume operation would do instead).
I'd certainly implement it with rw-mutexes for start.
The general answer is no. If you are careful and you use only functions which have a memory_order parameter and pass them the right value for it depending on what you are doing, then it may be yes.
(And as other have pointed out, your code has problems. For instance, returning by value an atomic<> type doesn't make sense for me.)

Should I protect operations on primitive types with mutexes for being thread-safe in C++?

What is the best approach to achieve thread-safety for rather simple operations?
Consider a pair of functions:
void setVal(int val)
{
this->_val = val;
}
int getVal() {
return this->_val;
}
Since even assignments of primitive types aren't guaranteed to be atomic, should I modify every getter and setter in the program in the following way to be thread-safe?
void setVal(int val)
{
this->_mutex.lock();
this->_val = val;
this->_mutex.unlock();
}
int getVal() {
this->_mutex.lock();
int result = this->_val;
this->_mutex.unlock();
return result;
}
Are you using _val in multiple threads? If not, then no, you don't need to synchronize access to it.
If it is used from multiple threads, then yes, you need to synchronize access, either using a mutex or by using an atomic type (like std::atomic<T> in C++0x, though other threading libraries have nonstandard atomic types as well).
Mutexes are very costly, as they are able to be shared across processes. If the state that you're limiting access to is only to be constrained to threads within your current process then go for something much less heavy, such as a Critical Section or Semaphore.
On 32-bit x86 platforms, reads and writes of 32-bit values aligned on 4-byte boundary are atomic. On 64-bit platforms you can also rely on 64-bit loads and stores of 8-byte aligned values to be atomic as well. SPARC and POWER CPUs also work like that.
C++ doesn't make any guarantees like that, but in practice no compiler is going to mess with it, since every non-trivial multi-threaded program relies on this behaviour.
int getVal() {
this->_mutex.lock();
int result = this->_val;
this->_mutex.unlock();
return result;
}
What exactly are you hoping to accomplish with this? Sure, you've stopped this->_val from changing before you saved into result but it still may change before result is returned, -- or between the return and the assignment to whatever you assigned it -- or a microsecond later. Regardless of what you do, you are just going to get a snapshot of a moving target. Deal with it.
void setVal(int val)
{
this->_mutex.lock();
this->_val = val;
this->_mutex.unlock();
}
Similarly, what is this buying you? If you call setVal(-5) and setVal(17) from separate threads at the same time, what value should be there after both complete? You've gone to some trouble to make sure that the first to start is also the first to finish, but how is that help to get the "right" value set?