Is Updating double operation atomic - c++

In Java, updating double and long variable may not be atomic, as double/long are being treated as two separate 32 bits variables.
http://java.sun.com/docs/books/jls/second_edition/html/memory.doc.html#28733
In C++, if I am using 32 bit Intel Processor + Microsoft Visual C++ compiler, is updating double (8 byte) operation atomic?
I cannot find much specification mention on this behavior.
When I say "atomic variable", here is what I mean :
Thread A trying to write 1 to variable x.
Thread B trying to write 2 to variable x.
We shall get value 1 or 2 out from variable x, but not an undefined value.

This is hardware specific and depends an the architecture. For x86 and x86_64 8 byte writes or reads are guaranteed to be atomic, if they are aligned. Quoting from the Intel Architecture Memory Ordering White Paper:
Intel 64 memory ordering guarantees
that for each of the following
memory-access instructions, the
constituent memory operation appears
to execute as a single memory access
regardless of memory type:
Instructions that read or write a single byte.
Instructions that read or write a word (2 bytes) whose address is
aligned on a 2 byte boundary.
Instructions that read or write a doubleword (4 bytes) whose address is
aligned on a 4 byte boundary.
Instructions that read or write a quadword (8 bytes) whose address is
aligned on an 8 byte boundary.
All locked instructions (the implicitly
locked xchg instruction and other
read-modify-write instructions with a
lock prefix) are an indivisible and
uninterruptible sequence of load(s)
followed by store(s) regardless of
memory type and alignment.

It's safe to assume that updating a double is never atomic, even if it's size is the same as an int with atomic guarantee. The reason is that if has different processing path since it's a non-critical and expensive data type. For example even data barriers usually mention that they don't apply to floating point data/operations in general.
Visual C++ will allign primitive types (see article) and while that should guarantee that it's bits won't get garbled while writing to memory (8 byte allignment is always in one 64 or 128 bit cache line) the rest depends on how CPU handles non-atomic data in it's cache and whether reading/flushing a cache line is interruptable. So if you dig through Intel docs for the kind of core you are using and it gives you that guarantee then you are safe to go.
The reason why Java spec is so conservative is that it's supposed to run the same way on an old 386 and on Corei7. Which is of course delusional but a promise is a promise, therefore it promisses less :-)
The reason I'm saying that you have to look up CPU doc is that your CPU might be an old 386, or alike :-)) Don't forget that on a 32-bit CPU your 8-byte block takes 2 "rounds" to access so you are down to the mercy of the mechanics of the cache access.
Cache line flushing giving much higher data consistency guarantee applies only to a reasonably recent CPU with Intel-ian guarantee (automatic cache consistency).

I wouldn't think in any architecture, thread/context switching would interrupt updating a register halfway so that you are left with for example 18bits updated of the 32bits it was going to update. Same for updating a memory location ( provided that it's a basic access unit, 8,16,32,64 bits etc).

So has this question been answered? I ran a simple test program changing a double:
#include <stdio.h>
int main(int argc, char** argv)
{
double i = 3.14159265358979323;
i += 84626.433;
}
I compiled it without optimizations (gcc -O0), and all assignment operations are performed with single assembler instructions such as fldl .LC0 and faddp %st, %st(1). (i += 84626.433 is of course done two operations, faddp and fstpl).
Can a thread really get interrupted inside a single instruction such as faddp?

On a multicore, besides being atomic, you have to worry about cache coherence, so that the thread reading sees the new value in its cache when the writer has updated.

Related

std::atomic in a union with another character

I recently read some code that had an atomic and a character in the same union. Something like this
union U {
std::atomic<char> atomic;
char character;
};
I am not entirely sure of the rules here, but the code comments said that since a character can alias anything, we can safely operate on the atomic variable if we promise not to change the last few bits of the byte. And the character only uses those last few bytes.
Is this allowed? Can we overlay an atomic integer over a character and have them both be active? If so what happens when one thread is trying to load the value from the atomic integer and another goes and makes a write to the character (only the last few bytes), will the char write be an atomic write? What happens there? Will the cache have to be flushed for the thread that is trying to load the atomic integer?
(This code looks smelly to me too, and I am not advocating using this. Just want to understand what parts of the above scheme can be defined and under what circumstances)
As requested the code is doing something like this
// thread using the atomic
while (!atomic.compare_exchange_weak(old_value, old_value | mask, ...) { ... }
// thread using the character
character |= 0b1; // set the 1st bit or something
the code comments said that since a character can alias anything, we can safely operate on the atomic variable if we promise not to change the last few bits of the byte.
Those comments are wrong. char-can-alias-anything doesn't stop this from being a data race on a non-atomic variable, so it's not allowed in theory, and worse, it's actually broken when compiled by any normal compiler (like gcc, clang, or MSVC) for any normal CPU (like x86).
The unit of atomicity is the memory location, not bits within a memory location. The ISO C++11 standard defines "memory location" carefully, so adjacent elements in a char[] array or a struct are separate locations (and thus it's not a race if two threads write c[0] and c[1] without synchronization). But adjacent bit-fields in a struct are not separate memory locations, and using |= on a non-atomic char aliased to the same address as an atomic<char> is definitely the same memory location, regardless of which bits are set in the right-hand side of the |=.
For a program to be free of data-race UB, if a memory location is written by any thread, all other threads that access that memory location (potentially) simultaneously must do so with atomic operations. (And probably also through the exact same object, i.e. changing the middle byte of an atomic<int> by type-punning to atomic<char> isn't guaranteed to be safe either. On most implementations on hardware similar to "normal" modern CPUs, type-punning to a different atomic type might happen to still be atomic if atomic<int/char> are both lock-free, but memory-ordering semantics might actually be broken, especially if it isn't perfectly overlapping.
Also, union type-punning in general is not allowed in ISO C++. I think you actually need to pointer-cast to char*, not make unions with char. Union type-punning is allowed in ISO C99, and as a GNU extension in GNU C89 and GNU C++, and also in some other C++ implementations.
So that takes care of the theory, but does any of this work on current CPUs? No, it's totally unsafe in practice, too.
character |= 1 will (on normal computers) compile to asm that loads the whole char, modifies the temporary, then stores the value back. On x86 this can all happen within one memory-destination or instruction if the compiler chooses to do that (which it won't if it also wants the value later). But even so, it's still a non-atomic RMW which can step on modifications to other bits.
Atomicity is expensive and optional for read-modify-write operations, and the only way to set some bits in a byte without affecting others is a read-modify-write on current CPUs. Compilers only emit asm that does it atomically if you specifically request it. (Unlike pure stores or pure loads, which are often naturally atomic. But always use std::atomic to get the other semantics you want...)
Consider this sequence of events:
thread A | thread B
-------------------|--------------
read tmp=c=0000 |
|
| c|=0b1100 # atomically, leaving c = 1100
tmp |= 1 # tmp=1 |
store c = tmp
Leaving c = 1, not the 1101 you're hoping for. i.e. the non-atomic load/store of the high bits stepped on the modification by thread B.
We get asm that can do exactly that, from compiling the source snippets from the question (on the Godbolt compiler explorer):
void t1(U &v, unsigned mask) {
// thread using the atomic
char old_value = v.atomic.load(std::memory_order_relaxed);
// with memory_order_seq_cst as the default for CAS
while (!v.atomic.compare_exchange_weak(old_value, old_value | mask)) {}
// v.atomic |= mask; // would have been easier and more efficient than CAS
}
t1(U&, unsigned int):
movzx eax, BYTE PTR [rdi] # atomic load of the old value
.L2:
mov edx, eax
or edx, esi # esi = mask (register arg)
lock cmpxchg BYTE PTR [rdi], dl # atomic CAS, uses AL implicitly as the expected value, same semantics as C++11 comp_exg seq_cst
jne .L2
ret
void t2(U &v) {
// thread using the character
v.character |= 0b1; // set the 1st bit or something
}
t2(U&):
or BYTE PTR [rdi], 1 # NON-ATOMIC RMW of the whole byte.
ret
It would be straightforward to write a program that ran v.character |= 1 in one thread, and an atomic v.atomic ^= 0b1100000 (or the equivalent with a CAS loop) in another thread.
If this code was safe, you'd always find that an even number of XOR operations modifying only the high bits left them zero. But you wouldn't find that, because the non-atomic or in the other thread might have stepped on an odd number of XOR operations. Or to make the problem easier to see, use addition with 0x10 or something, so instead of a 50% chance of being right by accident, you only have a 1 in 16 chance of the upper 4 bits being right.
This is pretty much exactly the same problem as losing counts when one of the increment operations is non-atomic.
Will the cache have to be flushed for the thread that is trying to load the atomic integer?
No, that's not how atomicity works. The problem isn't cache, it's that unless the CPU does something special, nothing stops other CPUs from reading or writing the location between when you load the old value and when you store the update value. You'd have the same problem on a multi-core system with no cache.
Of course, all systems do use cache, but cache is coherent so there's a hardware protocol (MESI) that stops different cores from simultaneously having conflicting values. When a store commits to L1D cache, it becomes globally visible. See Can num++ be atomic for 'int num'? for the details.

Are Reads and Writes of an int in C++ Atomic on x86-64 multi-core machine

I've read this, my question is quite similar yet somewhat different.
Note, I know C++0x does not guarantee that but I'm asking particularly for a multi-core machine like x86-64.
Let's say we have 2 threads (pinned to 2 physical cores) running the following code:
// I know people may delcare volatile useless, but here I do NOT care memory reordering nor synchronization/
// I just want to suppress complier optimization of using register.
volatile int n;
void thread1() {
for (;;)
n = 0xABCD1234;
// NOTE, I know ++n is not atomic,
// but I do NOT care here.
// what I cares is whether n can be 0x00001234, i.e. in the middle of the update from core-1's cache lines to main memory,
// will core-2 see an incomplete value(like the first 2 bytes lost)?
++n;
}
}
void thread2() {
while (true) {
printf('%d', n);
}
}
Is it possible for thread 2 to see n to be something like 0x00001234, i.e. in the middle of the update from core-1's cache lines to main memory, will core-2 see an incomplete value?
I know a single 4-byte int definitely fits into a typically 128-byte-long cache line, and if that int does store inside one cache line then I believe no issues here... yet what if it acrosses the cache line boundary? i.e. will it be possbile that some char already sit inside that cache line which makes first part of the n in one cache line and the other part in the next line? If that is the case, then core-2 may have a chance seeing an incomplete value, right?
Also, I think unless making every char or short or other less-than-4-bytes types padded to be 4-byte-long, one can never guarantee a single int does not pass the cache line boundary, isn't it?
If so, would that suggest generally even setting a single int is not guaranteed to be atomic on a x86-64 multi-core machine?
I got this question because as I researched on this topic, various people in various posts seem agreed on that as long as the machine architecture is proper(e.g. x86-64) setting an int should be atomic. But as I argued above that does not hold, right?
UPDATE
I'd like to give some background of my question. I'm dealing with a real-time system, which is sampling some signal and putting the result into one global int, this is of course done in one thread. And in yet another thread I read this value and process it.
I do not care the ordering of set and get, all I need is just a complete (vs. a corrrupted integer value) value.
x86 guarantees this. C++ doesn't. If you write x86 assembly you will be fine. If you write C++ it is undefined behavior. Since you can't reason about undefined behavior (it is undefined after all) you have to go lower and look at the assembler instructions that were generated. If they do what you want then this is fine. Note, however, that compilers tend to change generated assembly when you change compilers, compiler versions, compiler flags or any code which might change the optimizer's behavior, so you will constantly have to check the assembler code to make sure it is still correct.
The easier way is to use std::atomic<int> which will guarantee that the correct assembler instructions are generated so you don't have to constantly check.
The other question talks about variables "properly aligned". If it crosses a cache-line, the variable is not properly aligned. An int will not do that unless you specifically ask the compiler to pack a struct, for example.
You also assume that using volatile int is better than atomic<int>. If volatile int is the perfect way to sync variables on your platform, surely the library implementer would also know that and store a volatile x inside atomic<x>.
There is no requirement that atomic<int> has to be extra slow just because it is standard. :-)
Why worry so much?
Rely on your implementation. std::atomic<int> will reduce to an int if int is atomic on your platform (and in x86-64 they are, if properly aligned).
I'd also be concerned about the possibility of int overflow with your code (which is undefined behaviour), if I were you.
In other words std::atomic<unsigned> is the appropriate type here.
If you're looking for atomicity guarantee, std::atomic<> is your friend. Don't rely on volatile qualifier.
The question is almost a duplicate of Why is integer assignment on a naturally aligned variable atomic on x86?. The answer there does answer everything you ask, but this question is more focused on the ABI / compiler question of whether an int (or other type?) will be sufficiently-aligned, rather than what happens when it is. There's other stuff in this question that's worth answering specifically, too.
Yes, they almost invariably will be on machines where an int fits in a single register (e.g. not AVR: an 8-bit RISC), because compilers typically choose not to use multiple store instructions when they could use 1.
Normal x86 ABIs will align an int to a 4B boundary, even inside structs (unless you use GNU C __attribute__((packed)) or the equivalent for other dialects). But beware that the i386 System V ABI only aligns double to 4 bytes; it's only outside structs that modern compilers can go beyond that and give it natural alignment, making load/store atomic.
But nothing you can legally do in C++ can ever depend on this fact (because by definition it will involve a data race on a non-atomic type so it's Undefined Behaviour). Fortunately, there are efficient ways to get the same result (i.e. about the same compiler-generated asm, without mfence instructions or other slow stuff) that don't cause undefined behaviour.
You should use atomic instead of volatile or hoping that the compiler doesn't optimize away stores or loads on a non-volatile int, because the assumption of async modification is one of the ways that volatile and atomic overlap.
I'm dealing with a real-time system, which is sampling some signal and putting the result into one global int, this is of course done in one thread. And in yet another thread I read this value and process it.
std::atomic with .store(val, std::memory_order_relaxed) and .load(std::memory_order_relaxed) will give you exactly what you want here. The HW-access thread runs free and does plain ordinary x86 store instructions into the shared variable, while the reader thread does plain ordinary x86 load instructions.
This is the C++11 way to express that this is what you want, and you should expect it to compile to the same asm as with volatile. (With maybe a couple instructions difference if you use clang, but nothing important.) If there was any case where volatile int wouldn't have sufficient alignment, or any other corner cases, atomic<int> will work (barring compiler bugs). Except maybe in a packed struct; IDK if compilers stop you from breaking atomicity by packing atomic types in structs.
In theory, you might want to use volatile std::atomic<int> to make sure the compiler doesn't optimize out multiple stores to the same variable. See Why don't compilers merge redundant std::atomic writes?. But for now, compilers don't do that kind of optimization. (volatile std::atomic<int> should still compile to the same light-weight asm.)
I know a single 4-byte int definitely fits into a typically 128-byte-long cache line, and if that int does store inside one cache line then I believe no issues here...
Cache lines are 64B on all mainstream x86 CPUs since PentiumIII; before that 32B lines were typical. (Well AMD Geode still uses 32B lines...) Pentium4 uses 64B lines, although it prefers to transfer them in pairs or something? Still, I think it's accurate to say that it really does use 64B lines, not 128B. This page lists it as 64B per line.
AFAIK, there are no x86 microarchitectures that used 128B lines in any level of cache.
Also, only Intel CPUs guarantee that cached unaligned stores / loads are atomic if they don't cross a cache-line boundary. The baseline atomicity guarantee for x86 in general (AMD/Intel/other) is don't cross an 8-byte boundary. See Why is integer assignment on a naturally aligned variable atomic on x86? for quotes from Intel/AMD manuals.
Natural alignment works on pretty much any ISA (not just x86) up to the maximum guaranteed-atomic width.
The code in your question wants a non-atomic read-modify write where the load and store are separately atomic, and impose no ordering on surrounding loads/stores.
As everyone has said, the right way to do this is with atomic<int>, but nobody has pointed out exactly how. If you just n++ on atomic_int n, you will get (for x86-64) lock add [n], 1, which will be much slower than what you get with volatile, because it makes the entire RMW operation atomic. (Perhaps this is why you were avoiding std::atomic<>?)
#include <atomic>
volatile int vcount;
std::atomic <int> acount;
static_assert(alignof(vcount) == sizeof(vcount), "under-aligned volatile counter");
void inc_volatile() {
while(1) vcount++;
}
void inc_separately_atomic() {
while(1) {
int t = acount.load(std::memory_order_relaxed);
t++;
acount.store(t, std::memory_order_relaxed);
}
}
asm output from the Godbolt compiler explorer with gcc7.2 and clang5.0
Unsurprisingly, they both compile to equivalent asm with gcc/clang for x86-32 and x86-64. gcc makes identical asm for both, except for the address to increment:
# x86-64 gcc -O3
inc_volatile:
.L2:
mov eax, DWORD PTR vcount[rip]
add eax, 1
mov DWORD PTR vcount[rip], eax
jmp .L2
inc_separately_atomic():
.L5:
mov eax, DWORD PTR acount[rip]
add eax, 1
mov DWORD PTR acount[rip], eax
jmp .L5
clang optimizes better, and uses
inc_separately_atomic():
.LBB1_1:
add dword ptr [rip + acount], 1
jmp .LBB1_1
Note the lack of a lock prefix, so inside the CPU this decodes to separate load, ALU add, and store uops. (See Can num++ be atomic for 'int num'?).
Besides smaller code-size, some of these uops can be micro-fused when they come from the same instruction, reducing front-end bottlenecks. (Totally irrelevant here; the loop bottlenecks on the 5 or 6 cycle latency of a store/reload. But if used as part of a larger loop, it would be relevant.) Unlike with a register operand, add [mem], 1 is better than inc [mem] on Intel CPUs because it micro-fuses even more: INC instruction vs ADD 1: Does it matter?.
It's interesting that clang uses the less efficient inc dword ptr [rip + vcount] for inc_volatile().
And how does an actual atomic RMW compile?
void inc_atomic_rmw() {
while(1) acount++;
}
# both gcc and clang do this:
.L7:
lock add DWORD PTR acount[rip], 1
jmp .L7
Alignment inside structs:
#include <stdint.h>
struct foo {
int a;
volatile double vdouble;
};
// will fail with -m32, in the SysV ABI.
static_assert(alignof(foo) == sizeof(double), "under-aligned volatile counter");
But atomic<double> or atomic<unsigned long long> will guarantee atomicity.
For 64-bit integer load/store on 32-bit machines, gcc uses SSE2 instructions. Some other compilers unfortunately use lock cmpxchg8b, which is far less efficient for separate stores or loads. volatile long long wouldn't give you that.
volatile double usually would normally be atomic to load/store when aligned correctly, because the normal way is already to use single 8B load/store instructions.

Is a write to variable on a single core CPU atomic?

I have a single core CPU (ARM Cortex M3, 32bit) with two threads. Assuming the following situation:
// Thread 1:
int16_t a = 1;
double b = 1.0;
// Do some other fancy stuff including starting Thread 2
for (;;) {std::cout << a << "," <<b;}
// Thread 2:
a = 2;
b = 2.0;
I can handle the following outputs:
1,1
1,2
2,1
2,2
Can I be certain that the output will always be one of those (1/2) without using mutex or other locking mechanisms? And more specifically, is this compiler dependent? And is the behavior different for int16 and double?
It depends on the CPU, mostly, though in theory anything involving multiple threads on pre-C11 is at best implementation defined and at worst undefined behavior, so the compiler might do just about anything.
If you can ignore crazy compilers that do silly things, and assume that the compiler will use the CPU's facilities in a reasonable way, it depends mostly on what the CPU supports.
Cortex-M3 is a 32-bit CPU with a 32-bit bus and no FPU. So reads and writes of 32-bit and smaller values will generally be atomic. double, however, is 64 bits, so any read/write of a double will involve two instructions and be non-atomic. Thus if one thread reads while the other is writing you might get half from one value and half from the other.
Now in your specific example, the values 1.0 and 2.0 are both 0 for their lower half, so the 'mix' would be innocuous, but other values will not have that behavior.
The evaluation order of the operations before the ; are not guaranteed left to right, even if it were, they are not atomic, you will get a segfault if you try and read and write to a at the same time (they can & do take multiple cycles to perform and a context switch can interrupt it).
On arm in particular reads and writes go in a queue on the cpu where they are free to be reordered (except past memory barriers), even on a cpu that doesn't reorder memory the compiler is also free to reorder them. There is nothing stopping your assignment and read being moved forward or back and so you cannot guarantee the state of any of the values or ordering.

do integer reads need to be critical section protected?

I have come across C++03 some code that takes this form:
struct Foo {
int a;
int b;
CRITICAL_SECTION cs;
}
// DoFoo::Foo foo_;
void DoFoo::Foolish()
{
if( foo_.a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
Does the read from foo_.a need to be protected? e.g.:
void DoFoo::Foolish()
{
EnterCriticalSection(&foo_.cs);
int a = foo_.a;
LeaveCriticalSection(&foo_.cs);
if( a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
If so, why?
Please assume the integers are 32-bit aligned. The platform is ARM.
Technically yes, but no on many platforms. First, let us assume that int is 32 bits (which is pretty common, but not nearly universal).
It is possible that the two words (16 bit parts) of a 32 bit int will be read or written to separately. On some systems, they will be read separately if the int isn't aligned properly.
Imagine a system where you can only do 32-bit aligned 32 bit reads and writes (and 16-bit aligned 16 bit reads and writes), and an int that straddles such a boundary. Initially the int is zero (ie, 0x00000000)
One thread writes 0xBAADF00D to the int, the other reads it "at the same time".
The writing thread first writes 0xBAAD to the high word of the int. The reader thread then reads the entire int (both high and low) getting 0xBAAD0000 -- which is a state that the int was never put into on purpose!
The writer thread then writes the low word 0xF00D.
As noted, on some platforms all 32 bit reads/writes are atomic, so this isn't a concern. There are other concerns, however.
Most lock/unlock code includes instructions to the compiler to prevent reordering across the lock. Without that prevention of reordering, the compiler is free to reorder things so long as it behaves "as-if" in a single threaded context it would have worked that way. So if you read a then b in code, the compiler could read b before it reads a, so long as it doesn't see an in-thread opportunity for b to be modified in that interval.
So possibly the code you are reading is using these locks to make sure that the read of the variable happens in the order written in the code.
Other issues are raised in the comments below, but I don't feel competent to address them: cache issues, and visibility.
Looking at this it seems that arm has quite relaxed memory model so you need a form of memory barrier to ensure that writes in one thread are visible when you'd expect them in another thread. So what you are doing, or else using std::atomic seems likely necessary on your platform. Unless you take this into account you can see updates out of order in different threads which would break your example.
I think you can use C++11 to ensure that integer reads are atomic, using (for example) std::atomic<int>.
The C++ standard says that there's a data race if one thread writes to a variable at the same time as another thread reads from that variable, or if two threads write to the same variable at the same time. It further says that a data race produces undefined behavior. So, formally, you must synchronize those reads and writes.
There are three separate issues when one thread reads data that was written by another thread. First, there is tearing: if writing requires more than a single bus cycle, it's possible for a thread switch to occur in the middle of the operation, and another thread could see a half-written value; there's an analogous problem if a read requires more than a single bus cycle. Second, there's visibility: each processor has its own local copy of the data that it's been working on recently, and writing to one processor's cache does not necessarily update another processor's cache. Third, there's compiler optimizations that reorder reads and writes in ways that would be okay within a single thread, but will break multi-threaded code. Thread-safe code has to deal with all three problems. That's the job of synchronization primitives: mutexes, condition variables, and atomics.
Although the integer read/write operation indeed will most likely be atomic, the compiler optimizations and processor cache will still give you problems if you don't do it properly.
To explain - the compiler will normally assume that the code is single-threaded and make many optimizations that rely on that. For example, it might change the order of instructions. Or, if it sees that the variable is only written and never read, it might optimize it away entirely.
The CPU will also cache that integer, so if one thread writes it, the other one might not get to see it until a lot later.
There are two things you can do. One is to wrap in in critical section like in your original code. The other is to mark the variable as volatile. That will signal the compiler that this variable will be accessed by multiple threads and will disable a range of optimizations, as well as placing special cache-sync instructions (aka "memory barriers") around accesses to the variable (or so I understand). Apparently this is wrong.
Added: Also, as noted by another answer, Windows has Interlocked APIs that can be used to avoid these issues for non-volatile variables.

Does setting a bit collide with concurrent sets of other bits on the same word?

Say I have a bitmap, and several threads (running on several CPUs) are setting bits on it. No synchronization is used, and no atomic operations. Also, no resets are done. To my understanding, when two threads are trying to set two bits on the same word, only one operation would eventually stick. The reason is that for a bit to be set, the whole word should be read and written back, and so when both reads are done at the same time, when writing back one operation would override the other. Is that correct?
If the above is true, is it always so for byte operations as well? Namely, if a word is 2 bytes, and each thread tries to set a different byte to 1, will they too override each other when done concurrently, or do some systems support writing back the results to only a part of a word?
Reason for asking is trying to figure out how much space do I have to give up in order to omit synchronization in bit/byte/word-map operations.
In short, it's very CPU and compiler dependent.
Say you have a 32-bit value containing zero, and thread A wants to set bit 0 and thread B wants to set bit 1.
As you describe, these are read-modify-write operations, and the synchronization issue is 'what happens if they collide'.
The case you need to avoid is this:
A: Reads (gets 0)
B: Reads (also gets zero)
A: Logical-OR bit 0, result = 1
A: Writes 1
B: Logical-OR bit 1, result = 2
B: Writes 2 - oops, should have been 3
... when the correct result is this...
A: Reads (gets 0)
A: Logical-OR bit 0, result = 1
A: Writes 1
B: Reads (gets 1)
B: Logical-OR bit 1, result = 2
B: Writes 3 - correct
On some processors, the read-modify write will be three separate instructions, so you WILL need synchronization. On others, it will be a single atomic instruction. On multiple Core/CPU systems it will be a single instruction BUT other cores/CPUs may be able to access, so again you will need synchronization.
Doing it with bytes can be the same. In some processor memory architectures, you can only write a 32-bit memory value, so byte updates require a read-modify-write as before.
Update for X86 architecture (and windows, specifically)
Windows provides a set of atomic "Interlocked" operations on 32-bit values, including Logical OR. These could be a big help to you in avoiding critical sections. but beware, because as Raymond Chen points out, they don't solve everything. Keeping reading that post until you understand it!
The specifics will be system-dependent, and possibly compiler-dependent. I imagine you might have to go all the way to a 32-bit integer before you are free from the effects you fear.
I believe this is true, for the reasons you specified.
The way I see it, if your bitmap is stored as a char[], and if your architecture is byte addressable (it's possible to read and write an individual byte in memory, without having to read an entire word), then the compiler may generate an atomic operation. Even so, it's completely implementation-defined, so you can't rely on it.