CUDA boolean variable for host loop - c++

A very simplified version of my code looks like:
do {
//reset loop variable b to 0/false
b = 0;
// execute kernel
kernel<<<...>>>(b);
// use the value of b for while condition
} while(b);
Boolean variable b can be set to true by any thread in kernel and it tells us whether we continue running our loop.
Using cudaMalloc, cudaMemset, and cudaMemcpy we can create/set/copy device memory to implement this. However I just found the existence of pinned memory. Using cudaMalloHost to allocate b and a call to cudaDeviceSynchronize right after the kernel gave quite a speed up (~50%) in a simple test program.
Is pinned memory the best option for this boolean variable b or is there a better option?

You haven't shown your initial code and the modified code therefore nobody can have any idea about the details of the improvement you are stating in your post.
The answer to your question varies depending on
The b is read and written or is only written inside the GPU kernel. Reads might need to fetch the actual value directly from the host side if b is not found in the cache resulting in latencies. On the other hand, the latency for writes can be covered if there are further operations that can keep the threads busy.
How frequent you modify the value. If you access the value frequently in your program, the GPU probably can keep the variable inside L2 avoiding host side accesses.
The frequency of memory operations between accesses to b. If there are many memory transactions between accesses to b, it is more probable that b in the cache is replaced with some other content. As a result, when accessed again, b could not be found in the cache and a time-consuming host-access is necessary.
In cases having b in the host side causes many host memory transactions, it is logical to keep it inside the GPU global memory and transfer it back at the end of each loop iteration. You can do it rather fast with an asynchronous copy in the same stream as kernel's and synchronize with the host right after.
All above items are for cache-enabled devices. If your device is pr-Fermi (CC<2.0), the story is different.

Related

What exactly is the problem that memory barriers deal with?

I'm trying to wrap my head around the issue of memory barriers right now. I've been reading and watching videos about the subject, and I want to make sure I understand it correctly, as well as ask a question or two.
I start with understanding the problem accurately. Let's take the following classic example as the basis for the discussion: Suppose we have 2 threads running on 2 different cores
This is pseudo-code!
We start with int f = 0; int x = 0; and then run those threads:
# Thread 1
while(f == 0);
print(x)
# Thread 2
x = 42;
f = 1;
Of course, the desired result of this program is that thread 1 will print 42.
NOTE: I leave "compile-time reordering" out of this discussion, I only want to focus on what happens in runtime, so ignore all kinds of optimizations that the compiler might do.
Ok so from what I understand the problem here is what is called "memory reordering": the CPU is free to reorder memory operations as long as the end result is what the program expects it to be. In this case, within thread 2, the f = 1 may be executed before x = 42. In this case, thread 1 will print 0, which is not what the programmer want.
At this point, Wikipedia points at another possible scenario that may occur:
Similarly, thread #1's load operations may be executed out-of-order and it is possible for x to be read before f is checked
Since we're talking right now about "out of order execution" - let's ignore for a moment from the caches of the cores. So let's analyze what happens here. Start with thread 2 - the compiled instructions will look (in pseudo-assembly) something like:
1 put 42 into register1
2 write register1 to memory location of x
3 put 1 into register 2
4 write register2 to memory location of f
Ok so I understand that 3-4 may be executed before 1-2. But I don't understand the equivalent in thread 1:
Let's say the instructions of thread 1 will be something like:
1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2
What exactly may be out of order here? 3 can be before 1-2?
Let's go on: Up until now we talked about out-of-order execution, which brings me to my primary confusion:
In this great post the author describes the problem as such: Each core has its own cache, and the core does the memory operations against the cache, not against the main memory. The movement of memory from the core-specific caches to the main memory (or a shared cache) occurs in unpredictable time and order. So in our example - even if thread 2 will execute its instructions in-order - the writing of x=42 will occur before f=1, but that will be only to the cache of core2. The movement of these values to a shared memory may be in the opposite order, and hence the problem.
So I don't understand - when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?
when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?
When a thread observes changes of values in a particular order, then from the programmer's perspective it is indistinguishable whether that was due to out-of-order execution of loads, a store buffer delaying stores relative to loads and possibly letting them commit out of order (regardless of execution order), or (hypothetically in a CPU without coherent cache) cache synchronization.
Or even by forwarding store data between logical cores without going through cache, before it commits to cache and becomes visible to all cores. Some POWER CPUs can do this in real life but few if any others.
Real CPUs have coherent caches; once a value commits to cache, it's visible to all cores; it can't happen until other copies are already invalidated, so this is not the mechanism for reading "stale" data. Memory reordering on real-world CPUs is something that happens within a core, with reads and writes of coherent cache possibly happening in a different order than program order. Cache doesn't re-sync after getting out of sync; it maintains coherency in the first place.
The important effect, regardless of mechanism, is that another thread observing the same variables you're reading/writing, can see the effects happen in a different order than assembly program-order.
The two mail questions you have both have the same answer (Yes!), but for different reasons.
First let's look at this particular piece of pseudo-machine-code
Let's say the instructions of thread 1 will be something like:
1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2
What exactly may be out of order here? 3 can be before 1-2?
To answer your question, this is a reverberating "YES!". Since the contents of register1 are not tied in any way to the contents of register2 the CPU may happily (and correctly, for that matter) preload register2, so that when the the 1,2 loop finally breaks, it can immediately go to 4.
For a practical example, register1 might be an I/O peripheral register tied to a polled serial clock, and the CPU is just waiting for the clock to transition to low, so that it can bit-bang the next value onto the data output lines. Doing it that way for one saves precious time on the data fetch and more importantly may avoid contention on the peripheral data bus.
So, yes, this kind of reordering is perfectly fine and allowed, even with optimizations turned off and happening on a single threaded, single core CPU. The only way to make sure that register2 is definitely read, after the loop breaks, is to insert a barrier.
The second question is about cache coherence. And again, the answer to the need of memory barriers is "yes! you need them". Cache coherence is an issue, because modern CPUs don't talk to the system memory directly, but through their caches. As long as you're dealing with only a single CPU core, and a single cache, coherence is not an issue, since all the threads running on the same core do work against the same cache. However the moment you have multiple cores with independent caches, their individual views of the system memory contents may differ, and some form of memory consistency model is required. Either through explicit insertion of memory barriers, or on the hardware level.
For my point of view you missed the most important thing!
As the compiler did not see that the change of x nor f has any side effect, the compiler also can optimize all of that away. And also the loop with condition f==0 will result in "nothing" as the compiler only sees that you propagate a constant for f=0 before, it can assume that f==0 will always be true and optimize it away.
And for all of that you have to tell the compiler that there will be something happen which is not visible from the given flow of code. That can be something like a call to some semaphore/mutex/... or other IPC functionality or the use of atomic vars.
If you compile your code, I assume you get more or less "nothing" as for each of both code parts nothing has any effect and the compiler did not see that the variables are used from two thread context and optimize all and everything away.
If we implement the code as the following example, we see it fails and print 0 on my system.
int main()
{
int f = 0;
int x = 0;
std::thread s( [&f,&x](){ x=42; f=1; } );
while( f==0);
std::cout << x << std::endl;
s.join();
}
and if we change int f = 0; to std::atomic<int> f = 0 we get the expected result.

Why don't memory barriers only block instructions per specific memory address?

As I understand it, a memory barrier will "separate" loads/stores (depending on what type of barrier is used) regardless of the memory address associated with the "fenced" instruction. So if we had an atomic increment, surrounded by loads and stores:
LOAD A
STORE B
LOAD C
LOCK ADD D ; Assume full fence here
LOAD E
STORE F
the instructions operating on A, B and C would have to complete before D; and E and F may not start until after D.
However, as the LOCK is only applied to address D, why restrict the other instructions? Is it too complicated to implement in circuitry? Or is there another reason?
The basic reason is because the basic intent of a fence is to enforce ordering, so if the fence affected only reads/writes of the specific item to which it was applied, it wouldn't do its job.
For example, you fairly frequently have patterns like:
prepare some data
signal that the data is ready
and:
consume some data
signal that the memory used for the data is now free
In such cases, the memory location used as the "signal" is what you're probably going to protect with the fence--but it's not the only thing that really needs to be protected.
In the first case, I have to assure that all the code that writes the data gets executed, and only after it's all done, the signal will be set.
Another thread can then see that the signal is set. Based on that, it knows that it can read all the data associated with the signal, not just the signal itself. If the fence affected only the signal itself, it would mean that the other code that was writing the data might still execute after the signal--and then we'd get a collision between that code writing the data, and the other code trying to read the data.
In theory, we could get around that by using a fence around each individual piece of data being written. In reality, we almost certainly want to avoid that--a fence is fairly expensive, so we'd usually prefer to write a significant amount of data, then use a single fence to indicate that the entire "chunk" of memory is ready.

If we use memory fences to enforce consistency, how does "thread-thrashing" ever occur?

Before I knew of the CPU's store buffer I thought thread-thrashing simply occured when two threads wanted to write to the same cacheline. One would prevent the other from writing. However, this seems pretty synchronous. I later learnt that there is a store buffer, which temporarily flushes the writes. It is forced to flush through the SFENCE instruction, kinda implying there is no synchronous prevention of multiple cores accessing the same cacheline....
I am totally confused how thread-thrashing occurs, if we have to be careful and use SFENCEs? Thread-thrashing implies blocking, whereas SFENCEs implies the writes are done asynchronously and the programmer must manually flush the write??
(My understanding of SFENCEs may be confused too- because I also read the Intel memory model is "strong" and therefore memory fences are only required for string x86 instructions).
Could somebody please remove my confusion?
"Thrashing" meaning multiple cores retrieving the same cpu cacheline and this causing latency overhead for other cores competing for the same cacheline.
So, at least in my vocabulary, thread-thrashing happens when you have something like this:
// global variable
int x;
// Thread 1
void thread1_code()
{
while(!done)
x++;
}
// Thread 2
void thread2_code()
{
while(!done)
x++;
}
(This code is of course total nonsense - I'm making it ridiculously simple but pointless to not have complicated code that is complicated to explain what is going on in the thread itself)
For simplicity, we'll assume thread 1 always runs on processor 1, and thread 2 always runs on processor 2 [1]
If you run these two threads on an SMP system - and we've JUST started this code [both threads start, by magic, at almost exactly the same time, not like in a real system, many thousand clock-cycles apart], thread one will read the value of x, update it, and write it back. By now, thread 2 is also running, and it will also read the value of x, update it, and write it back. To do that, it needs to actually ask the other processor(s) "do you have (new value for) x in your cache, if so, can you please give me a copy". And of course, processor 1 will have a new value because it has just stored back the value of x. Now, that cache-line is "shared" (our two threads both have a copy of the value). Thread two updates the value and writes it back to memory. When it does so, another signal is sent from this processor saying "If anyone is holding a value of x, please get rid of it, because I've just updated the value".
Of course, it's entirely possible that BOTH threads read the same value of x, update to the same new value, and write it back as the same new modified value. And sooner or later one processor will write back a value that is lower than the value written by the other processor, because it's fallen behind a bit...
A fence operation will help ensure that the data written to memory has actually got all the way to cache before the next operation happens, because as you say, there are write-buffers to hold memory updates before they actually reach memory. If you don't have a fence instruction, your processors will probably get seriously out of phase, and update the value more than once before the other has had time to say "do you have a new value for x?" - however, it doesn't really help prevent processor 1 asking for the data from processor 2 and processor 2 immediately asking for it "back", thus ping-ponging the cache-content back and forth as quickly as the system can achieve.
To ensure that ONLY ONE processor updates some shared value, it is required that you use a so called atomic instruction. These special instructons are designed to operate in conjunction with write buffers and caches, such that they ensure that ONLY one processor actually holds an up-to-date value for the cache-line that is being updated, and NO OTHER processor is able to update the value until this processor has completed the update. So you never get "read the same value of x and write back the same value of x" or any similar thing.
Since caches don't work on single bytes or single integer sized things, you can also have "false sharing". For example:
int x, y;
void thread1_code()
{
while(!done) x++;
}
void thread2_code()
{
while(!done) y++;
}
Now, x and y are not actually THE same variable, but they are (quite plausibly, but we can't know for 100% sure) located within the same cache-line of 16, 32, 64 or 128 bytes (depending on processor architecture). So although x and y are distinct, when one processor says "I've just updated x, please get rid of any copies", the other processor will get rid of it's (still correct) value of y at the same time as getting rid of x. I had such an example where some code was doing:
struct {
int x[num_threads];
... lots more stuff in the same way
} global_var;
void thread_code()
{
...
global_var.x[my_thread_number]++;
...
}
Of course, two threads would then update value right next to each other, and the performance was RUBBISH (about 6x slower than when we fixed it by doing:
struct
{
int x;
... more stuff here ...
} global_var[num_threads];
void thread_code()
{
...
global_var[my_thread_number].x++;
...
}
Edit to clarify:
fence does not (as my recent edit explains) "help" against ping-poning the cache-content between threads. It also doesn't, in and of itself, prevent data from being updated out of sync between the processors - it does, however, ensure that the processor performing the fence operation doesn't continue doing OTHER memory operations until this particular operations memory content has got "out of" the processor core itself. Since there are various pipeline stages, and most modern CPU's have multiple execution units, one unit may well be "ahead" of another that is technically "behind" in the execution stream. A fence will ensure that "everything has been done here". It's a bit like the man with the big stop-board in Formula 1 racing, that ensures that the driver doesn't drive off from the tyre-change until ALL new tyres are securely on the car (if everyone does what they should).
The MESI or MOESI protocol is a state-machine system that ensures that operations between different processors is done correctly. A processor can have a modified value (in which case a signal is sent to all other processors to "stop using the old value"), a processor may "own" the value (it is the holder of this data, and may modify the value), a processor may have "exclusive" value (it's the ONLY holder of the value, everyone else has got rid of their copy), it may be "shared" (more than one processor has a copy, but this processor should not update the value - it is not the "owner" of the data), or Invalid (data is not present in the cache). MESI doesn't have the "owned" mode which means a little more traffic on the snoop bus ("snoop" means "Do you have a copy of x", "please get rid of your copy of x" etc)
[1] Yes, processor numbers usually start with zero, but I can't be bothered to go back and rename thread1 to thread0 and thread2 to thread1 by the time I wrote this additional paragraph.

is using cudaHostAlloc good for my case

i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.
therefore when a block finds the solution it should inform the cpu that the solution is found, so the cpu prints the solution provided by this block.
so what i am currently doing is the following:
__global__ kernel(int sol)
{
//do some computations
if(the block found a solution)
sol = blockId.x //atomically
}
now on every call to the kernel i copy sol back to the host memory and check its value. if its set to 3 for example, i know that blockid 3 found the solution so i now know where the index of the solution start, and copy the solution back to the host.
in this case, will using cudaHostAlloc be a better option? more over would copying the value of a single integer on every kernel call slows down my program?
Issuing a copy from GPU to CPU and then waiting for its completion will slow your program a bit. Note that if you choose to send 1 byte or 1KB, that won't make much of a difference. In this case bandwidth is not a problem, but latency.
But launching a kernel does consume some time as well. If the "meat" of your algorithm is in the kernel itself I wouldn't spend too much time on that single, small transfer.
Do note, if you choose to use the mapped memory, instead of using cudaMemcpy, you will need to explicitly put a cudaDeviceSynchronise (or cudaThreadSynchronise with older CUDA) barrier (as opposed to an implicit barrier at cudaMemcpy) before reading the status. Otherwise, your host code may go achead reading an old value stored in your pinned memory, before the kernel overwrites it.

Concurrent writes in the same global memory location

I have several blocks, each having some integers in a shared memory array of size 512. How can I check if the array in every block contains a zero as an element?
What I am doing is creating an array that resides in the global memory. The size of this array depends on the number of blocks, and it is initialized to 0. Hence every block writes to a[blockid] = 1 if the shared memory array contains a zero.
My problem is when I have several threads in a single block writing at the same time. That is, if the array in the shared memory contains more than one zero, then several threads will write a[blockid] = 1. Would this generate any problem?
In other words, would it be a problem if 2 threads write the exact same value to the exact same array element in global memory?
For a CUDA program, if multiple threads in a warp write to the same location then the location will be updated but it is undefined how many times the location is updated (i.e. how many actual writes occur in series) and it is undefined which thread will write last (i.e. which thread will win the race).
For devices of compute capability 2.x, if multiple threads in a warp write to the same address then only one thread will actually perform the write, which thread is undefined.
From the CUDA C Programming Guide section F.4.2:
If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.
See also section 4.1 of the guide for more info.
In other words, if all threads writing to a given location write the same value, then it is safe.
In the CUDA execution model, there are no guarantees that every simultaneous write from threads in the same block to the same global memory location will succeed. At least one write will work, but it isn't guaranteed by the programming model how many write transactions will occur, or in what order they will occur if more than one transaction is executed.
If this is a problem, then a better approach (from a correctness point of view), would be to have only one thread from each block do the global write. You can either use a shared memory flag set atomically or a reduction operation to determine whether the value should be set. Which you choose might depend on how many zeros there are likely to be. The more zeroes there are, the more attractive the reduction will be. CUDA includes warp level __any() and __all() operators which can be built into a very efficient boolean reduction in a few lines of code.
Yes, it will be a problem called as Race Condition.
You should consider synchronizing access to the global data through process Semaphores
While not a mutex or semaphore, CUDA does contain a synchronization primative you can utilize for serializing access to a given code segment or memory location. Through the __syncthreads() function, you can create a barrier so that any given thread blocks at the point of the command call until all the threads in a given block have executed the __syncthreads() command. That way you can hopefully serialize access to your memory location and avoid a situation where two threads need to write to the same memory location at the same time. The only warning is that all the threads have to at some point execute __syncthreads(), or else you will end up with a dead-lock situation. So don't place the call inside some conditional if-statement where some threads may never execute the command. If you do approach your problem like this, there will need to be some provision made for the threads that don't initially call __syncthreads() to call the function later in order to avoid deadlock.