When I write CUDA code,I use atomic Operation to force a global sychronization at the last step.
Then I also have to implemente the same task in OpenCL, I wonder is there is a similar operation in OpenCL like atomic operation in CUDA that I can use, my devices is a fpga board..
barrier() may be something similar to what you are looking for, but can only force a "join" on threads in the same workgroup.
See this post. You may be able to use CLK_GLOBAL_MEM_FENCE to get the results you are looking for.
Stack overflow: Barriers in OpenCL
There is no kernel-level global synchronization is OpenCL and CUDA since entire workgroups may finish before others can be started. Only workgroup level synchronization is available inside a kernel. For global synchronization you much use multiple kernels.
According to your comment, it seems like you want atomic operations on float values.
Please check out this link: atomic operation and floats in opencl
The idea is to use the built in atom_cmpxchg operation to try to swap the old value of a float point variable with a new value, which could be be its addition with another value, or multiplication, division, subtraction, etc.
The swapping only succeeds if the old value is actually the old value (that's where the cmp comes into play). Otherwise, it will do it again in a while loop.
Notice that this atomic operation could be quite slow if many threads are doing this operation on a single value.
Related
I have an application where multiple threads access and write to a shared memory of 1 word (16bit).
Can I expect that the processor reads and writes a word from/to memory in an atomic operation? So I don't need mutex protection of the shared memory/variable?
The target is an embedded device running VxWorks.
EDIT: There's only one CPU and it is an old one (>7years) - I am not exactly sure about the architecture and model, but I am also more interested in the general way that "most" CPU's will work. If it is a 16bit CPU, would it then, in most cases, be fair to expect that it will read/write a 16bit variable in one operation? Or should I always in any case use mutex protection? And let's say that I don't care about portability, and we talk about C++98.
All processors will read and write aligned machine-words atomicially in the sense that you won't get half the bits of the old value and half the bits of the new value if read by another processor.
To achieve good speed, modern processors will NOT synchronize read-modif-write operations to a particular location unless you actually ask for it - since nearly all reads and writes go to "non-shared" locations.
So, if the value is, say, a counter of how many times we've encountered a particular condition, or some other "if we read/write an old value, it'll go wrong" situation, then you need to ensure that two processors don't simultaneously update the value. This can typically be done with atomic instructions (or some other form of atomic updates) - this will ensure that one, and only one, processor touches the value at any given time, and that all the other processors DO NOT hold a copy of the value that they think is accurate and up to date when another has just made an update. See the C++11 std::atomic set of functions.
Note the distinction between atomically reading or writing the machine word's value and atomically performing the whole update.
The problem is not the atomicity of the acess (which you can usually assume unless you are using a 8bit MC), but the missing synchronization which leads to undefined behavior.
If you want to write portable code, use atomics instead. If you want to achieve maximal performance for your specific platform, read the documentation of your OS and compiler very carefully and see what additional mechanisms or guarantees they provide for multithreaded programs (But I really doubt that you will find anything more efficient than std::atomic that gives you sufficient guarantees).
Can I expect that the processor reads and writes a word from/to memory in an atomic operation?
Yes.
So I don't need mutex protection of the shared memory/variable?
No. Consider:
++i;
Even if the read and write are atomic, two threads doing this at the same time can each read, each increment, and then each write, resulting in only one increment where two are needed.
Can I expect that the processor reads and writes a word from/to memory in an atomic operation?
Yes, if the data's properly aligned and no bigger than a machine word, most CPU instructions will operate on it atomically in the sense you describe.
So I don't need mutex protection of the shared memory/variable?
You do need some synchronisation - whether a mutex or using atomic operations ala std::atomic.
The reasons for this include:
if your variable is not volatile, the compiler might not even emit read and write instructions for the memory address nominally holding that variable at the places you might expect, instead reusing values read or set earlier that are saved in CPU registers or known at compile time
if you use a mutex or std::atomic type you do not need to use volatile as well
further, even if the data is written towards memory, it may not leave the CPU caches and be written to actual RAM where other cores and CPUs can see it unless you explicitly use a memory barrier (std::mutex and std::atomic types do that for you)
finally, delays between reading and writing values can cause unexpected results, so operations like ++x can fail as explained by David Schwartz.
When writing lock-free code using the Compare-and-Swap (CAS) technique there is a problem called the ABA problem:
http://en.wikipedia.org/wiki/ABA_problem
whereby comparing just on the value "A" is problematic because a write could still have occurred between the two observations. I read on and found this solution:
http://en.wikipedia.org/wiki/LL/SC
In computer science, load-link and store-conditional (LL/SC) are a
pair of instructions used in multithreading to achieve
synchronization. Load-link returns the current value of a memory
location, while a subsequent store-conditional to the same memory
location will store a new value only if no updates have occurred to
that location since the load-link. Together, this implements a
lock-free atomic read-modify-write operation.
How would a typical C++ lock-free CAS technique be modified to use the above solution? Would somebody be able to show a small example?
I don't mind whether its C++11/x86, x86-64 Linux-specific (preferably no Win32 answers).
LL/SC are instructions implemented by some architectures (e.g. SPARC) to form the foundation of higher level atomic operations. In x86 you have the LOCK prefix instead to accomplish a similar goal.
To avoid the ABA problem on x86 with LOCK you have to provide your own protection against intervening stores. One way to do this is to store a generation number (just an increasing integer) adjacent to the memory in question. Each updater does an atomic compare/exchange wide enough to encompass both the data and the serial number. The update only succeeds if it finds the right data and the right number. At the same time, it updates the number so that other threads see the change.
You'll note that x86 has always (?) offered a CMPXCHG instruction that is twice as wide as a machine word (see CMPXCHG8B and later CMPXCGH16B) which can be used for this purpose.
If I have several threads trying to write the same value to a single location in memory, is it possible to have a race condition? Can the data somehow get corrupted during the writes? There is no preceding read or test conditions, only the write...
EDIT: To clarify, I'm computing a dot product on a GPU. I'm using several threads to calculate the individual products (one thread per row/column element) and saving them to a temporary location in memory. I need to then sum those intermediate products and save the result.
I was thinking about having all threads individually perform this sum/store operation since branching on a GPU can hurt performance. (You would think it should take the same amount of time for the sum/store whether it's done by a single thread or all threads, but I've tested this and there is a small performance hit.) All threads will get the same sum, but I'm concerned about a race condition when they each try to write their answer to the same location in memory. In the limited testing I've done, everything seems fine, but I'm still nervous...
Under most threading standards on most platforms, this is simply prohibited or undefined. That is, you are not allowed to do it and if you do, anything can happen.
High-level language compilers like those for C and C++ are free to optimize code based on the assumption that you will not do anything you are not allowed to do. So a "write-only" operation may turn out to be no such thing. If you write i = 1; in C or C++, the compiler is free to generate the same code as if you wrote i = 0; i++;. Similarly confounding optimizations really do occur in the real world.
Instead, follow the rules for whatever threading model you are using to use appropriate synchronization primitives. If your platform provides them, use appropriate atomic operations.
There is no problem having multiple threads writing a single (presumably shared or global) memory location in CUDA, even "simultaneously" i.e. from the same line of code.
If you care about the order of the writes, then this is a problem, as CUDA makes no guarantee of order, for multiple threads executing the same write operation to the same memory location. If this is an issue, you should use atomics or some other method of refactoring your code to sort it out. (It doesn't sound like this is an issue for you.)
Presumably, as another responder has stated, you care about the result at some point. Therefore it's necessary to have a barrier of some sort, either explicit (e.g. __synchthreads(), for multiple threads within a block using shared memory for example) or implicit (e.g. end of a kernel, for multiple threads writing to a location in global memory) before you read that location and expect a sensible result. Note these are not the only possible barrier methods that could give you sane results, just two examples. Warp-synchronous behavior or other clever coding techniques could be leveraged to ensure the sanity of a read following a collection of writes.
Although on the surface the answer would seem to be no, there are no race conditions, the answer is a bit more nuanced. Boris is right that on some 32-bit architectures, storing a 64-bit long or address may take two operations and therefore may be read in an invalid state. This is probably pretty difficult to reproduce since memory pages are what typically are updated and a long would never span a memory page.
However, the more important issue is that you need to realize that without memory synchronization there are no guarantees around when a thread would see the updated value. A thread could run for a long period of time reading an out-of-date value from memory. It wouldn't be an invalid value but it would not be the most recent one written. That may not specifically cause a "race-condition" but it might cause your program to perform in an unexpected manner.
Also, although you say it is "write-only", obviously someone is reading the value otherwise there would be no reason to perform the update. The details of what portion of the code is reading the value will better inform us as to whether the write-only without synchronization is truly safe.
If write-only operations are not atomic obviously there will be a moment when another thread may observe the data in the corrupted state.
For example writing to 64-bit integers, that are stored as a pair of 32-bit integers.
Thread A - just finished writing the high order word, and the Thread B has just finished writing to the low order word, and is going to set the high order word;
Thread C may see that integer consists of a low order word written by thread B and a high order word written by thread A.
P.S. this question is very generic, actual results will depend on the memory model of the environment(language) and the underlying processor architecture(hardware).
I'm creating a simple server which stores several variables globally. On occasion these variables will update and during this time the variables are locked from other threads. Each client who accesses the server is granted their own thread, but has no way to change these variables and are essentially readonly. My question to the internet is do I need to worry about either a) two threads reading the same variable at the same time (not changing) or b) the writing the variable process interrupting the reading process.
I know that in most cases writing a double is not an atomic operation because it is usually more than one register, but can the reading operation be interrupted?
Thanks
My first guess is that this has nothing to do with Linux as an OS.
It is for sure related to the CPU in use, as some may be able to load/store double in memory in 1 operation. The x86 series has such an op-code for the FPU.
It may also be linked to the compiler that can make use of these CPU abilities to load/store doubles in 1 operation. Don't know what gcc does.
[Edit] Apologies, I was out of it when I read this question originally and gave an incorrect answer which jalf kindly pointed out.
I know that in most cases writing a
double is not an atomic operation
because it is usually more than one
register, but can the reading
operation be interrupted?
Yes. Imagine trying to write your own IEEE double-precision floating type using two WORD-sized variables. We cannot read both of these atomically, as they are two distinct parts. One could be in the process of being modified concurrently at the same time we are trying to read.
do I need to worry about either a) two
threads reading the same variable at
the same time (not changing) or b) the
writing the variable process
interrupting the reading process.
a: no
b: yes
You'll either need to use a synchronization mechanism for the readers (in addition to the writer) or, if you're like me, just make it a WORD-sized single-precision float for the shared data which is often atomic for reading on modern systems (though you should verify this), stick to atomic operations to modify it, and avoid the headaches.
If you are only reading, you should not have to worry about atomicity. If you are reading and writing, you will need to use a mutex lock, both on your "server" and "client" thread. Locking only when writing only gets half of the job done.
Now, with a single double you may be in some luck depending on your compiler works and on your exact hardware architecture. See this answer.
If I am accessing a single integer type (e.g. long, int, bool, etc...) in multiple threads, do I need to use a synchronisation mechanism such as a mutex to lock them. My understanding is that as atomic types, I don't need to lock access to a single thread, but I see a lot of code out there that does use locking. Profiling such code shows that there is a significant performance hit for using locks, so I'd rather not. So if the item I'm accessing corresponds to a bus width integer (e.g. 4 bytes on a 32 bit processor) do I need to lock access to it when it is being used across multiple threads? Put another way, if thread A is writing to integer variable X at the same time as thread B is reading from the same variable, is it possible that thread B could end up a few bytes of the previous value mixed in with a few bytes of the value being written? Is this architecture dependent, e.g. ok for 4 byte integers on 32 bit systems but unsafe on 8 byte integers on 64 bit systems?
Edit: Just saw this related post which helps a fair bit.
You are never locking a value - you are locking an operation ON a value.
C & C++ do not explicitly mention threads or atomic operations - so operations that look like they could or should be atomic - are not guaranteed by the language specification to be atomic.
It would admittedly be a pretty deviant compiler that managed a non atomic read on an int: If you have an operation that reads a value - theres probably no need to guard it. However- it might be non atomic if it spans a machine word boundary.
Operations as simple as m_counter++ involves a fetch, increment, and store operation - a race condition: another thread can change the value after the fetch but before the store - and hence needs to be protected by a mutex - OR find your compilers support for interlocked operations. MSVC has functions like _InterlockedIncrement() that will safely increment a memory location as long as all other writes are similarly using interlocked apis to update the memory location - which is orders of magnitude more lightweight than invoking a even a critical section.
GCC has intrinsic functions like __sync_add_and_fetch which also can be used to perform interlocked operations on machine word values.
If you're on a machine with more than one core, you need to do things properly even though writes of an integer are atomic. The issues are two-fold:
You need to stop the compiler from optimizing out the actual write! (Somewhat important this. ;-))
You need memory barriers (not things modeled in C) to make sure the other cores take notice of the fact that you've changed things. Otherwise you'll be tangled up in caches between all the processors and other dirty details like that.
If it was just the first thing, you'd be OK with marking the variable volatile, but the second is really the killer and you will only really see the difference on a multicore machine. Which happens to be an architecture that is becoming far more common than it used to be… Oops! Time to stop being sloppy; use the correct mutex (or synchronization or whatever) code for your platform and all the details of how to make memory work like you believe it to will go away.
In 99.99% of the cases, you must lock, even if it's access to seemingly atomic variables. Since C++ compiler is not aware of multi-threading on the language level, it can do a lot of non-trivial reorderings.
Case in point: I was bitten by a spin lock implementation where unlock is simply assigning zero to a volatile integer variable. The compiler was reordering unlock operation before the actual operation under the lock, unsurprisingly, leading to mysterious crashes.
See:
Lock-Free Code: A False Sense of Security
Threads Cannot be Implemented as a Library
There's no support for atomic variables in C++, so you do need locking. Without locking you can only speculate about what exact instructions will be used for data manipulation and whether those instructions will guarantee atomic access - that's not how you develop reliable software.
Yes it would be better to use synchronization. Any data accessed by multiple threads must be synchronized.
If it is windows platform you can also check here :Interlocked Variable Access.
Multithreading is hard and complex. The number of hard to diagnose problems that can come around is quite big. In particular, on intel architectures reads and writes from aligned 32bit integers is guaranteed to be atomic in the processor, but that does not mean that it is safe to do so in multithreaded environments.
Without proper guards, the compiler and/or the processor can reorder the instructions in your block of code. It can cache variables in registers and they will not be visible in other threads...
Locking is expensive, and there are different implementations of lock-less data structures to optimize for high performance, but it is hard to do it correctly. And the problem is a that concurrency bugs are usually obscure and difficult to debug.
Yes. If you are on Windows you can take a look at Interlocked functions/variables and if you are of the Boost persuasion then you can look at their implementation of atomic variables.
If boost is too heavyweight putting "atomic c++" into your favourite search engine will give you plenty of food for thought.