Is imageStore atomic? - opengl

When using imageStore in OpenGL, is it atomic?
Or in other words, Assume I have one compute shader invocation that writes "82" to a location, and another invocation that writes "42" to the same location.
When I have a third invocation that reads from the same location: Am I guaranteed to get the initial value or 42 or 82? Or can I get an undefined value because they could both write at the same time? (I know its not sure which one I will get).
Would the answer to above questions change if they both write the same value, instead of different values?

The specification is kinda unclear on this.
The specification talks a lot about the ordering of invocations, as well as visibility of store operations. But at no point does it say what happens if you have two invocations that race on a write to the same memory location. It does not say that the value will be undefined, or that it would be one of several possibilities.
The specification seems to have a hole in it in this regard. As such, I would not do anything that would assume that such writes are genuinely "atomic" in this regard.

Related

In C11/C++11, possible to mix atomic/non-atomic ops on the same memory?

Is it possible to perform atomic and non-atomic ops on the same memory location?
I ask not because I actually want to do this, but because I'm trying to understand the C11/C++11 memory model. They define a "data race" like so:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
-- C11 §5.1.2.4 p25, C++11 § 1.10 p21
Its the "at least one of which is not atomic" part that is troubling me. If it weren't possible to mix atomic and non-atomic ops, it would just say "on an object which is not atomic."
I can't see any straightforward way of performing non-atomic operations on atomic variables. std::atomic<T> in C++ doesn't define any operations with non-atomic semantics. In C, all direct reads/writes of an atomic variable appear to be translated into atomic operations.
I suppose memcpy() and other direct memory operations might be a way of performing a non-atomic read/write on an atomic variable? ie. memcpy(&atomicvar, othermem, sizeof(atomicvar))? But is this even defined behavior? In C++, std::atomic is not copyable, so would it be defined behavior to memcpy() it in C or C++?
Initialization of an atomic variable (whether through a constructor or atomic_init()) is defined to not be atomic. But this is a one-time operation: you're not allowed to initialize an atomic variable a second time. Placement new or an explicit destructor call could would also not be atomic. But in all of these cases, it doesn't seem like it would be defined behavior anyway to have a concurrent atomic operation that might be operating on an uninitialized value.
Performing atomic operations on non-atomic variables seems totally impossible: neither C nor C++ define any atomic functions that can operate on non-atomic variables.
So what is the story here? Is it really about memcpy(), or initialization/destruction, or something else?
I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.
(I'm assuming here that the storage lifetime is sufficient and plays no role)
I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.
So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.
disclaimer: I am not a parallelism guru.
Is it possible to mix atomic/non-atomic ops on the same memory, and if
so, how?
you can write it in the code and compile, but it will probably yield undefined behaviour.
when talking about atomics, it is important to understand what kind o problems do they solve.
As you might know, what we call in shortly "memory" is multi-layered set of entities which are capable to hold memory.
first we have the RAM, then the cache lines , then the registers.
on mono-core processors, we don't have any synchronization problem. on multi-core processors we have all of them. every core has it own set of registers and cache lines.
this casues few problems.
First one of them is memory reordering - the CPU may decide on runtime to scrumble some reading/writing instructions to make the code run faster. this may yield some strange results that are completly invisible on the high-level code that brought this set of instruction. the most classic example of this phenomanon is the "two threads - two integer" example:
int i=0;
int j=0;
thread a -> i=1, then print j
thread b -> j=1 then print i;
logically, the result "00" cannot be. either a ends first, the result may be "01", either b ends first, the result may be "10". if both of them ends in the same time, the result may be "11". yet, if you build small program which imitates this situtation and run it in a loop, very quicly you will see the result "00"
another problem is memory invisibility. like I mentioned before, the variable's value may be cached in one of the cache lines, or be stored in one of the registered. when the CPU updates a variables value - it may delay the writing of the new value back to the RAM. it may keep the value in the cache/regiter because it was told (by the compiler optimizations) that that value will be updated again soon, so in order to make the program faster - update the value again and only then write it back to the RAM. it may cause undefined behaviour if other CPU (and consequently a thread or a process) depends on the new value.
for example, look at this psuedo code:
bool b = true;
while (b) -> print 'a'
new thread -> sleep 4 seconds -> b=false;
the character 'a' may be printed infinitly, because b may be cached and never be updated.
there are many more problems when dealing with paralelism.
atomics solves these kind of issues by (in a nutshell) telling the compiler/CPU how to read and write data to/from the RAM correctly without doing un-wanted scrumbling (read about memory orders). a memory order may force the cpu to write it's values back to the RAM, or read the valuse from the RAM even if they are cached.
So, although you can mix non atomics actions with atomic ones, you only doing part of the job.
for example let's go back to the second example:
atomic bool b = true;
while (reload b) print 'a'
new thread - > b = (non atomicly) false.
so although one thread re-read the value of b from the RAM again and again but the other thread may not write false back to the RAM.
So although you can mix these kind of operations in the code, it will yield underfined behavior.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operationsin parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. That is that the sizes of the atomic and non atomic types are the same.
I definitely would not attempt to mix both types of access to an address in a parallel context, that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operations in parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. My, possibly simplistic, interpretation of that is that the sizes of the atomic and non atomic types are the same and that the hardware can update such types in a single operation.
I definitely would not attempt to mix both types of access to an address in a parallel context, i think that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.

Is it necessary to lock an array that is *only written to* from one thread and *only read from* another?

I have two threads running. They share an array. One of the threads adds new elements to the array (and removes them) and the other uses this array (read operations only).
Is it necessary for me to lock the array before I add/remove to/from it or read from it?
Further details:
I will need to keep iterating over the entire array in the other thread. No write operations over there as previously mentioned. "Just scanning something like a fixed-size circular buffer"
The easy thing to do in such cases is to use a lock. However locks can be very slow. I did not want to use locks if their use can be avoided. Also, as it came out from the discussions, it might not be necessary (it actually isn't) to lock all operations on the array. Just locking the management of an iterator for the array (count variable that will be used by the other thread) is enough
I don't think the question is "too broad". If it still comes out to be so, please let me know. I know the question isn't perfect. I had to combine at least 3 answers in order to be able to solve the question - which suggests most people were not able to fully understand all the issues and were forced to do some guess work. But most of it came out through the comments which I have tried to incorporate in the question. The answers helped me solve my problem quite objectively and I think the answers provided here are quite a helpful resource for someone starting out with multithreading.
If two threads perform an operation on the same memory location, and at least one operation is a write operation, you have a so-called data race. According to C11 and C++11, the behaviour of programs with data races is undefined.
So, you have to use some kind of synchronization mechanism, for example:
std::atomic
std::mutex
If you are writing and reading from the same location from multiple threads you will need to to perform locking or use atomics. We can see this by looking at the C11 draft standard(The C++11 standard looks almost identical, the equivalent section would be 1.10) says the following in section 5.1.2.4 Multi-threaded executions and data races:
Two expression evaluations conflict if one of them modifies a memory
location and the other one reads or modifies the same memory location.
and:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
and:
Compiler transformations that introduce assignments to a potentially
shared memory location that would not be modified by the abstract
machine are generally precluded by this standard, since such an
assignment might overwrite another assignment by a different thread in
cases in which an abstract machine execution would not have
encountered a data race. This includes implementations of data member
assignment that overwrite adjacent members in separate memory
locations. We also generally preclude reordering of atomic loads in
cases in which the atomics in question may alias, since this may
violate the "visible sequence" rules.
If you were just adding data to the array then in the C++ world a std::atomic index would be sufficient since you can add more elements and then atomically increment the index. But since you want to grow and shrink the array then you will need to use a mutex, in the C++ world std::lock_guard would be a typical choice.
To answer your question: maybe.
Simply put, the way that the question is framed doesn't provide enough information about whether or not a lock is required.
In most standard use cases, the answer would be yes. And most of the answers here are covering that case pretty well.
I'll cover the other case.
When would you not need a lock given the information you have provided?
There are some other questions here that would help better define whether you need a lock, whether you can use a lock-free synchronization method, or whether or not you can get away with no explicit synchronization.
Will writing data ever be non-atomic? Meaning, will writing data ever result in "torn data"? If your data is a single 32 bit value on an x86 system, and your data is aligned, then you would have a case where writing your data is already atomic. It's safe to assume that if your data is of any size larger than the size of a pointer (4 bytes on x86, 8 on x64), then your writes cannot be atomic without a lock.
Will the size of your array ever change in a way that requires reallocation? If your reader is walking through your data, will the data suddenly be "gone" (memory has been "delete"d)? Unless your reader takes this into account (unlikely), you'll need a lock if reallocation is possible.
When you write data to your array, is it ok if the reader "sees" old data?
If your data can be written atomically, your array won't suddenly not be there, and it's ok for the reader to see old data... then you won't need a lock. Even with those conditions being met, it would be appropriate to use the built in atomic functions for reading and storing. But, that's a case where you wouldn't need a lock :)
Probably safest to use a lock since you were unsure enough to ask this question. But, if you want to play around with the edge case of where you don't need a lock... there you go :)
One of the threads adds new elements to the array [...] and the other [reads] this array
In order to add and remove elements to/from an array, you will need an index that specifies the last place of the array where the valid data is stored. Such index is necessary, because arrays cannot be resized without potential reallocation (which is a different story altogether). You may also need a second index to mark the initial location from which the reading is allowed.
If you have an index or two like this, and assuming that you never re-allocate the array, it is not necessary to lock when you write to the array itself, as long as you lock the writes of valid indexes.
int lastValid = 0;
int shared[MAX];
...
int count = toAddCount;
// Add the new data
for (int i = lastValid ; count != 0 ; count--, i++) {
shared[i] = new_data(...);
}
// Lock a mutex before modifying lastValid
// You need to use the same mutex to protect the read of lastValid variable
lock_mutex(lastValid_mutex);
lastValid += toAddCount;
unlock_mutex(lastValid_mutex);
The reason this works is that when you perform writes to shared[] outside the locked region, the reader does not "look" past the lastValid index. Once the writing is complete, you lock the mutex, which normally causes a flush of the CPU cache, so the writes to shared[] would be complete before the reader is allowed to see the data.
Lock? No. But you do need some synchronization mechanism.
What you're describing sounds an awful like a "SPSC" (Single Producer Single Consumer) queue, of which there are tons of lockfree implementations out there including one in the Boost.Lockfree
The general way these work is that underneath the covers you have a circular buffer containing your objects and an index. The writer knows the last index it wrote to, and if it needs to write new data it (1) writes to the next slot, (2) updates the index by setting the index to the previous slot + 1, and then (3) signals the reader. The reader then reads until it hits an index that doesn't contain the index it expects and waits for the next signal. Deletes are implicit since new items in the buffer overwrite previous ones.
You need a way to atomically update the index, which is provided by atomic<> and has direct hardware support. You need a way for a writer to signal the reader. You also might need memory fences depending on the platform s.t. (1-3) occur in order. You don't need anything as heavy as a lock.
"Classical" POSIX would indeed need a lock for such a situation, but this is overkill. You just have to ensure that the reads and writes are atomic. C and C++ have that in the language since their 2011 versions of their standards. Compilers start to implement it, at least the latest versions of Clang and GCC have it.
It depends. One situation where it could be bad is if you are removing an item in one thread then reading the last item by its index in your read thread. That read thread would throw an OOB error.
As far as I know, this is exactly the usecase for a lock. Two threads which access one array concurrently must ensure that one thread is ready with its work.
Thread B might read unfinished data if thread A did not finish work.
If it's a fixed-size array, and you don't need to communicate anything extra like indices written/updated, then you can avoid mutual exclusion with the caveat that the reader may see:
no updates at all
If your memory ordering is relaxed enough that this happens, you need a store fence in the writer and a load fence in the consumer to fix it
partial writes
if the stored type is not atomic on your platform (int generally should be)
or your values are un-aligned, and especially if they may span cache lines
This is all dependent on your platform though - hardware, OS and compiler can all affect it. You haven't told us what they are.
The portable C++11 solution is to use an array of atomic<int>. You still need to decide what memory ordering constraints you require, and what that means for correctness and performance on your platform.
If you use e.g. vector for your array (so that it can dynamically grow), then reallocation may occur during the writes, you lose.
If you use data entries larger than is always written and read atomically (virtually any complex data type), you lose.
If the compiler / optimizer decides to keep certain things in registers (such as the counter holding the number of valid entries in the array) during some operations, you lose.
Or even if the compiler / optimizer decides to switch order of execution for your array element assignments and counter increments/decrements, you lose.
So you certianly do need some sort of synchronization. What is the best way to do so (for example it may be worth while to lock only parts of the array), depends on your specifics (how often and in what pattern do the threads access the array).

CRITICAL_SECTION for set and get single bool value

now writing complicated class and felt that I use to much CRITICAL_SECTION.
As far as I know there are atomic operations for some types, that are always executed without any hardware or software interrupt.
I want to check if I understand everything correctly.
To set or get atomic value we don't need CRITICAL_SECTION because doing that there won't be interrupts.
bool is atomic.
So there are my statements, want to ask, if they are correct, also if they are correct, what types variables may be also set or get without CRITICAL_SECTION?
P. S. I'm talking about getting or setting one single value per method, not two, not five, but one.
You don't need locks round atomic data, but internally they might lock. Note for example, C++11's std::atomic has a is_lock_free function.
bool may not be atomic. See here and here
Note: This answer applies to Windows and says nothing of other platforms.
There are no InterlockedRead or InterlockedWrite functions; simple reads and writes with the correct integer size (and alignment) are atomic on Windows ("Simple reads and writes to properly-aligned 32-bit variables are atomic operations.").
(and there are no cache problems since a properly-aligned variable is always on a single cache line).
However, reading and modifying such variables (or any other variable) are not atomic:
Read a bool? Fine. Test-And-Set a bool? Better use
InterlockedCompareExchange.
Overwrite an integer? great! Add to
it? Critical section.
Here this can be found:
Simple reads and writes to properly aligned 64-bit variables are
atomic on 64-bit Windows. Reads and writes to 64-bit values are not
guaranteed to be atomic on 32-bit Windows. Reads and writes to
variables of other sizes are not guaranteed to be atomic on any
platform.
Result should be correct but in programming it is better not to trust should. There still remains small possibility of failure because of CPU cache.
You cannot guarantee for all implementations/platforms/compilers that bool, or any other type, or most operations, are atomic. So, no, I don't believe your statements are correct. You can retool your logic or use other means of establishing atomicity, but you probably can't get away with just removing CRITICAL_SECTION usage if you rely on it.

C++ memory_order_consume, kill_dependency, dependency-ordered-before, synchronizes-with

I am reading C++ Concurrency in Action by Anthony Williams. Currently I at point where he desribes memory_order_consume.
After that block there is:
Now that I’ve covered the basics of the memory orderings, it’s time to look at the
more complex parts
It scares me a little bit, because I don't fully understand several things:
How dependency-ordered-before differs from synchronizes-with? They both create happens-before relationship. What is exact difference?
I am confused about following example:
int global_data[]={ … };
std::atomic<int> index;
void f()
{
int i=index.load(std::memory_order_consume);
do_something_with(global_data[std::kill_dependency(i)]);
}
What does kill_dependency exactly do? Which dependency it kills? Between which entities? And how compiler can exploit that knowladge?
Can all occurancies of memory_order_consume be safely replaced with memory_order_acquire? I.e. is it stricter in all senses?
At Listing 5.9, can I safely replace
std::atomic<int> data[5]; // all accesses are relaxed
with
int data[5]
? I.e. can acquire and release be used to synchronize access to non-atomic data?
He describes relaxed, acquire and release by some examples with mans in cubicles. Are there some similar simple descriptions of seq_cst and consume?
As to the next to last question, the answer takes a little more explanation. There are three things that can go wrong when multiple threads access the same data:
the system might switch threads in the middle of a read or write, producing a result that's half one value and half another.
the compiler might move code around, on the assumption that there is no other thread looking at the data that's involved.
the processor may be keeping a value in its local cache, without updating main memory after changing the value or re-reading it after another thread changed the value in main memory.
Memory order addresses only number 3. The atomic functions address 1 and 2, and, depending on the memory order argument, maybe 3 as well. So memory_order_relaxed means "don't bother with number 3. The code still handles 1 and 2. In that case, you'd use acquire and release to ensure proper memory ordering.
How dependency-ordered-before differs from synchronizes-with?
From 1.10/10: "[ Note: The relation “is dependency-ordered before” is analogous to “synchronizes with”, but uses release/consume in place of release/acquire. — end note ]".
What does kill_dependency exactly do?
Some compilers do data-dependency analysis. That is, they trace changes to values in variables in order to better figure out what has to be synchronized. kill_dependency tells such compilers not to trace any further because there's something going on in the code that the compiler wouldn't understand.
Can all occurancies of memory_order_consume be safely replaced with
memory_order_acquire? I.e. is it stricter in all senses?
I think so, but I'm not certain.
memory_order_consume requires that the atomic operation happens-before all non-atomic operations that are data dependent on it. A data dependency is any dependency where you cannot evaluate an expression without using that data. For example, in x->y, there is no way to evaluate x->y without first evaluating x.
kill_dependency is a unique function. All other functions have a data dependency on their arguments. Kill_dependency explicitly does not. It shows up when you know that the data itself is already synchronized, but the expression you need to get to the data may not be synchronized. In your example, do_something_with is allowed to assume any cached value of globalldata[i] is safe to use, but i itself must actually be the correct atomic value.
memory_order_acquire is strictly stronger if all changes to the data are properly released with a matching memory_order_release.

Is concurrently overwriting a variable with the same value safe?

I have the following situation (caused by a defect in the code):
There's a shared variable of primitive type (let it be int) that is initialized during program startup from strictly one thread to value N (let it be 0). Then (strictly after the variable is initialized) during the program runtime various threads are started and they in some random order either read that variable or overwrite it with the very same value N (0 in this example). There's no synchronization around accessing the variable.
Can this situation cause unexpected behavior in the program?
It's incredibly unlikely but not impossible according to the standard.
There's nothing stating what the underlying representation of an integer is, not does the standard specify how the values are loaded.
I can envisage, however weird, an implementation where the underlying bit pattern for 0 is 10101010 and the architecture only supports loading data into memory by bit-shifting it over eight cycles but reading it as a single unit in one cycle.
If another thread reads the value while the bit pattern is being shifted in (e.g., 00000001, 00000010,00000101 and so on), you will have a problem.
The chances of anyone designing such a bizarre architecture is so close to zero as to be negligible. But, unfortunately, it's not zero. All I'm trying to get across is that you shouldn't rely on assumptions at all when it comes to standards compliance.
And please, before you vote me down, feel free to quote the part of the standard that states this is not possible :-)
Since C++ does not currently have a standard concurrency model, it would depend entirely on your threading implementation and whatever guarantees it gives. It is all but certainly unsafe in the general case, however, because of the potential for torn reads. There might be specific cases where it would "work" or at least "appear to work."
In C++0x (which does have a standard concurrency model), your scenario would formally result in undefined behavior. There is a long, detailed, hard-to-read specification of the concurrency model in the C++0x Final Committee Draft §1.10, but it basically boils down to this:
Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location (§1.10/3).
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior (§1.10/14).
Your expression evaluations clearly conflict because they modify and read the same memory location, and since the object is not atomic and access is not synchronized using a lock, you have undefined behavior.
No. Of course, you could end up with a data race if one of the threads later tries to change the value. You will also end up with a little cache contention, but I doubt this will have noticable effect.
You can not really rely on it. For primitive types you should be fine, and if the operation is atomic (eg a correctly aligned int on most platforms) then writing and reading different values is safe (note by this I mean like "x = 5;", not "x += 5;" which is never atomic and is not thread safe).
For non-primitive types even if its the same value all bets are off since there may be a copy constructor that does something un-safe (like allocating memory).
Yes it is possible for unexpected behavior to happen in this scenario. Consider the case where the initial value of the variable was not 0. It is possible for one thread to start the set to 0 and another thread see the variable with only some of the bytes set.
For types int this is very unlikely as most processor's will have atomic assignment of word sized values. However once you hit 8 bit numeric values (long on some platforms) or large structs, this begins to be an issue.
If no other thread (and this includes the main thread) can change the value of the 0 to anything else (lets say 1) while those threads are initializing then it you will not have problems. But if any other thread had the potential to change the value during the start-up phase you could have a problem. You are playing a dangerous game and I would recommend locking before reading the value.