How are registers allocated to threads inside a GPU? - c++

How is the number of registers per thread decided inside the GPU? I want to see if the GPU has 65536 registers per SM that it can allocate among the threads, do these registers all get allocated to the active thread block running on the SM? So, right now, I have a CUDA program where I have 1024 threads per thread block and 65536 available registers per block. My confusion is, the profiler says each thread only gets 40 registers. Another observation is that each thread actually makes use of exactly 64 registers in its assembly code, which means the performance could've been better if it was assigned that number of threads. Why doesn't it get 64? Who makes this decision? Is it decided at compile time per compute capability or runtime, etc?
Edit:
Here is the sample code and its assembly. I'm looking at %f64 at the end of the code to conclude the point above.
https://godbolt.org/z/eMzW8dY19

How is the number of registers per thread decided inside the GPU?
Actual (non-PTX-virtual) register assignments are determined at the point of running the ptxas tool on your code (part of the nvcc compiler driver toolchain), or the equivalent tool as part of the driver API loader or the NVRTC mechanism.
ptxas is the tool that converts PTX to SASS (machine code). SASS is the thing that actually runs on a GPU, PTX is not. PTX must first be converted to SASS.
PTX and the virtual register system in PTX are not useful for understanding of these concepts. There is essentially no limit to the number of virtual registers that can be defined in PTX, and the number of virtual registers defined in PTX tells you nothing at all about how actual registers will be used in GPU hardware. PTX is not useful for this sort of study.
The register assignments are entirely statically determined at this point. You can get some evidence of this by passing -Xptxas=-v compile switch to nvcc when your nvcc compile command has specified a valid SASS target. There is no runtime variability (ignoring the "variability" that would come about via the CUDA JIT PTX->SASS conversion mechanism; the item in focus here is SASS not PTX. Once the SASS is defined, there is no runtime variability.)
do these registers all get allocated to the active thread block running on the SM?
The number of registers allocated will be determined by the registers per thread, some granularity/rounding effects, and the number of threads per threadblock (i.e. the product of these). This quantity of registers will be "carved out" of the total available in the SM, at the point at which a threadblock is "deposited" on that SM, by the CUDA Work Distributor (CWD or CUDA block scheduler). The CWD will not deposit a block until a sufficient number of registers are available to be allocated.
The entire complement of registers (e.g. 65536 or whatever the SM capacity is) are not automatically or always allocated for a single threadblock. It will depend on the actual needs of that threadblock. Remaining/unallocated registers can be used in the future if the CWD decides to deposit another threadblock on that SM. CUDA SMs have the ability to support multiple threadblocks simultaneously, with registers allocated for each. Unless unallocated registers are available in sufficient quantity to meet the needs of a prospective threadblock, the CWD will not deposit a new threadblock on that SM.
My confusion is, the profiler says each thread only gets 40 registers. Another observation is that each thread actually makes use of exactly 64 registers in its assembly code,
The profiler reported number is correct (and it includes the granularity/rounding effects, which may or may not be included in the -Xptxas=-v output.) Your confusion is that you are attempting to understand what is happening via the PTX. Do not do that. It is irrelevant for this discussion.

Related

Can more than one Load/Store instructions can be executed at the same instance of time in Multiprocessor Environment

I believe in single processor systems, more than one Store will happen one after the other,
but what is the case for multi processsor systems?
Adding to the question, also if the machine is 32bit and when we try to write
a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
The reason for the above two questions is, if someone tries to read the same memory (a memory
of size 32bit/64 bit, in 32 bit systems), in another
thread will this be safe, or do i need to consider using locks.?
Added:
I wanted to do with minimum locks possible since ours is time critical execution.
Hence I wanted to understand is there ever a possibility of executing two Store/Load instructions
at the same instant of time to the same memory location, if things gets executed in multi processor
environment.
You are wrong if you only look on load/store cpu instructions.
The compiler and your os and cpu can:
change execution order to optimize the code
can hold values in separate caches
can store data in cpu registers without accessing cache or other memory
can optimize access complete away
... a lot more I believe!
If you want to access the same variable from different threads you must use a synchronization mechanism which is provided from your language or a library which fits to your os. Nothing else will give you a guarantee to work.
The problem is not the real access to any kind of memory. You definitely must ensure that your code contains memory barriers as needed for the underlying libraries and OS support. If there are no barriers between multi thread access you will maybe not see any change from a write in one thread while read it from a second one.
This will also be a problem on a single core cpu because the compiler have no idea that you modify a variable from two threads if you don't use any kind of synchronization.
To your add on:
You simply have no control over any kind of memory access without writing your code in assembler. And if you write it in assembler, you! have to deal with registers L1/L2/Lx Caching, Memory Mapping, Inter-CPU-Communication and so on. Forget all about load/store instruction. This is only 1% of the job!
If you have time critical jobs:
try to fix the core were the thread runs on ( see detailed description for threading libraries like posix pthreads or whatever lib you are running on )
it can be much faster to run a single process with a single thread and program it in a cooperative fashion. No locks, no memory barriers, no ipc. But you have to deal with all the thread a like problems. But it is fast!
Often it is much faster to split you problem in some processes each only with one thread and make the ipc minimal. This needs a deep understanding how you can scale your algorithms.
Often a very! simple 8/16 bit cpu runs much faster in special environments in comparison to a fat 8 core cpu with fat os on it.
But you don't tell us the rest of your environment and requirements so the answer never can give a full answer to your real problem. But keep in mind: load/store was yesterday.
This can not be answered generically. You have to know which model of what design of processor it is. An AMD Opteron will be different from an Intel Pentium which is different from a Intel Core2, and all of those are different from an ARMv7 design. [They are probably fairly similar, but there's details that you may care about if you REALLY want to rely on these operations to be performed in a specific way]. And of course, if you share memory between, say, a GPU (graphics processing unit) and a CPU, you have even more possible scenarios of "different design".
There are single core "superscalar" (more than one execution unit) and "out of order execution" (processors that reorder instructions), so more than one execution unit (including more than one load/store unit), and thus more than one instruction (including load or store) can be performed at the same time.
Obviously, once the processor determines that the memory operation needs to go "outside" (that is, the value is not available in the cache), it has to be serialized, but there is no guarantee that a load or store as sequenced by you or the compiler won't be re-ordered between loads and stores. If the processor has instructions to support "data wider than the bus" (e.g. 32-bit processor loading 64-bit word), these are typically atomic to that processor. If the processor does not in itself support 64-bit words, then the load of a 64-bit value would encompass two 32-bit loads.
[When I write "load", the same applies for "store"]
In case of multiprocessor or multicore architectures, it becomes a system architecture question, which makes it even more complicated than "we can't answer this without understanding the processor design", since there are now more components involved: memory design (one lump of memory shared between processors, several lumps of memory that are not directly shared, etc).
In general, if you have multiple threads, you will need to use atomic operations - most processors have a way to say "I want this to happen without someone else interfering" - in the old days, it would be a "lock" pin on the processor(s) that would be wired to anything else that could access the memory bus, and when that pin was active, all other devices has to wait for it to become inactive before accessing the memory bus. These days, it's a fair bit more sophisticated, since there are caches involved. Most systems use an "exclusive cache content" method: The processor signals all it's peers that "I want this address to be exclusive in my cache", at which point all other processors will "flush and invalidate" that particular address in their caches. Then the atomic operation is performed in the cache, and the results available to be read by other processors only when the atomic operation is completed. This is a pretty simplified view of how it works - modern processors are very complex, and there is a lot of work involved with such seemingly simple things as "make sure this value gets updated in a way that doesn't get interrupted by some other processor writing to the same thing".
If there isn't support in the processor for "atomic" operations, then there has to be proper locks (and any processor designed for use in a multicore/multicpu environment will have operations to support locks in some way), where the lock is taken before updating something, and then released after the update. This is clearly more complex than having builtin atomic operations, but it makes the design of the processor simpler. Also, for more complex updates (where more than one 32- or 64-bit value needs updating) this sort of locking is still required - for example, if we have a "queue" where you have a "where we're writing" and "elements in queue" that both need to be updated on write, you can't do that in a single operation [without being VERY clever about it, at least].
In heterogeneous systems, such as GPU + CPU combinations, you can't do atomics between different devices, because the cache of one device doesn't "understand the language" of the other device. So when the CPU says "I want this as exclusive", the GPU sees "Hurdi gurdi meatballs" and thinks "I have no idea what that is about, I'll just ignore it" [or something like that]. In this case, there has to be some other way to access shared data, but it's typically not atomic ever, you have to send commands (via other means than the interprocessor signalling system) to the GPU to say "flush your cache, and tell me when you're done with that", and when the CPU has written something the GPU needs, the CPU will flush it's cache before telling the GPU that it can use the data. This can get pretty messy, and takes a fair amount of time.
I believe in single processor systems, more than one Store will happen one after the other,
False. Most machines are set up like that, but for performance reasons many CPUs can be configured to have a much more relaxed store ordering. This is almost never a problem for an application on a single CPU (because the CPU will make it look like you expect) but it's really critical to understand when talking to hardware.
Here's a wikipedia article: http://en.wikipedia.org/wiki/Memory_ordering
This gets doubly complicated on CPUs with non-coherent local caches. Because then you can have strong ordering as seen from this CPU while other CPUs will see totally different results depending on the cache flush order.
Adding to the question, also if the machine is 32bit and when we try to write a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
Some 32 bit CPUs have instructions to do atomic 64 bit writes, others don't. Those that don't will do two separate writes where a partial result can be seen by other CPUs or threads (if you get unlucky with context switching) or signal handlers or interrupt handlers.
The reason for the above two questions is, if someone tries to read the same memory (a memory of size 32bit/64 bit, in 32 bit systems), in another thread will this be safe, or do i need to consider using locks.?
Yes, no, maybe. If it's just one value and it doesn't tell the other thread that some other memory might be in a certain state, then yes, it can be safe in certain circumstances. You're not guaranteed that the other thread will see the changed value in memory for a long time, but eventually it should see it.
Generally, you can't reason about the behavior of access to shared memory in a threaded environment without strictly following the documentation of the thread model you're using. And most of those say something like without locks the behavior is undefined, with locks everything that happened before the lock is guaranteed to happen before the lock and everything that happens after the lock is guaranteed to happen after the lock. This is not only because of differences between CPUs, but also because the operating system can do something funny and the locking code needs to be designed to convince the compiler to not do something funny either (which is surprisingly hard with modern compilers).

Cuda Stream Processing for multiple kernels Disambiguation

Hi a few questions regarding Cuda stream processing for multiple kernels.
Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32.
kernel uses a dev_input array of size n and a dev output array of size s*n.
kernel reads data from input array, stores its value in a register, manipulates it and writes its result back to dev_output at the position s*n + tid.
We aim to run the same kernel s times using one of the n streams each time. Similar to the simpleHyperQ example. Can you comment if and how any of the following affects concurrency please?
dev_input and dev_output are not pinned;
dev_input as it is vs dev_input size s*n, where each kernel reads unique data (no read conflicts)
kernels read data from constant memory
10kb of shared memory are allocated per block.
kernel uses 60 registers
Any good comments will be appreciated...!!!
cheers,
Thanasio
Robert,
thanks a lot for your detailed answer. It has been very helpful. I edited 4, it is 10kb per block. So in my situation, i launch grids of 61 blocks and 256 threads. The kernels are rather computationally bound. I launch 8 streams of the same kernel. Profile them and then i see a very good overlap between the first two and then it gets worse and worse. The kernel execution time is around 6ms. After the first two streams execute almost perfectly concurrent the rest have a 3ms distance between them. Regarding 5, i use a K20 which has a 255 register file. So i would not expect drawbacks from there. I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s..
Please take a look at the following link. There is an image called kF.png .It shows the profiler output for the streams..!!!
https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/
Concurrency amongst kernels depends upon a number of factors, but one that many people overlook is simply the size of the kernel (i.e. number of blocks in the grid.) Kernels that are of a size that can effectively utilize the GPU by themselves will not generally run concurrently to a large degree, and there would be little throughput advantage even if they did. The work distributor inside the GPU will generally begin distributing blocks as soon as a kernel is launched, so if one kernel is launched before another, and both have a large number of blocks, then the first kernel will generally occupy the GPU until it is nearly complete, at which point blocks of the second kernel will then get scheduled and executed, perhaps with a small amount of "concurrent overlap".
The main point is that kernels that have enough blocks to "fill up the GPU" will prevent other kernels from actually executing, and apart from scheduling, this isn't any different on a compute 3.5 device. In addition, rather than just specifying a few parameters for the kernel as a whole, also specifying launch parameters and statistics (such as register usage, shared mem usage, etc.) at the block level are helpful for providing crisp answers. The benefits of the compute 3.5 architecture in this area will still mainly come from "small" kernels of "few" blocks, attempting to execute together. Compute 3.5 has some advantages there.
You should also review the answer to this question.
When global memory used by the kernel is not pinned, it affects the speed of data transfer, and also the ability to overlap copy and compute but does not affect the ability of two kernels to execute concurrently. Nevertheless, the limitation on copy and compute overlap may skew the behavior of your application.
There shouldn't be "read conflicts", I'm not sure what you mean by that. Two independent threads/blocks/grids are allowed to read the same location in global memory. Generally this will get sorted out at the L2 cache level. As long as we are talking about just reads there should be no conflict, and no particular effect on concurrency.
Constant memory is a limited resource, shared amongst all kernels executing on the device (try running deviceQuery). If you have not exceeded the total device limit, then the only issue will be one of utilization of the constant cache, and things like cache thrashing. Apart from this secondary relationship, there is no direct effect on concurrency.
It would be more instructive to identify the amount of shared memory per block rather than per kernel. This will directly affect how many blocks can be scheduled on a SM. But answering this question would be much crisper also if you specified the launch configuration of each kernel, as well as the relative timing of the launch invocations. If shared memory happened to be the limiting factor in scheduling, then you can divide the total available shared memory per SM by the amount used by each kernel, to get an idea of the possible concurrency based on this. My own opinion is that number of blocks in each grid is likely to be a bigger issue, unless you have kernels that use 10k per grid but only have a few blocks in the whole grid.
My comments here would be nearly the same as my response to 4. Take a look at deviceQuery for your device, and if registers became a limiting factor in scheduling blocks on each SM, then you could divide available registers per SM by the register usage per kernel (again, it makes a lot more sense to talk about register usage per block and the number of blocks in the kernel) to discover what the limit might be.
Again, if you have reasonable sized kernels (hundreds or thousands of blocks, or more) then the scheduling of blocks by the work distributor is most likely going to be the dominant factor in the amount of concurrency between kernels.
EDIT: in response to new information posted in the question. I've looked at the kF.png
First let's analyze from a blocks per SM perspective. CC 3.5 allows 16 "open" or currently scheduled blocks per SM. If you are launching 2 kernels of 61 blocks each, that may well be enough to fill the "ready-to-go" queue on the CC 3.5 device. Stated another way, the GPU can handle 2 of these kernels at a time. As the blocks of one of those kernels "drains" then another kernel is scheduled by the work distributor. The blocks of the first kernel "drain" sufficiently in about half the total time, so that the next kernel gets scheduled about halfway through the completion of the first 2 kernels, so at any given point (draw a vertical line on the timeline) you have either 2 or 3 kernels executing simultaneously. (The 3rd kernel launched overlaps the first 2 by about 50% according to the graph, I don't agree with your statement that there is a 3ms distance between each successive kernel launch). If we say that at peak we have 3 kernels scheduled (there are plenty of vertical lines that will intersect 3 kernel timelines) and each kernel has ~60 blocks, then that is about 180 blocks. Your K20 has 13 SMs and each SM can have at most 16 blocks scheduled on it. This means at peak you have about 180 blocks scheduled (perhaps) vs. a theoretical peak of 16*13 = 208. So you're pretty close to max here, and there's not much more that you could possibly get. But maybe you think you're only getting 120/208, I don't know.
Now let's take a look from a shared memory perspective. A key question is what is the setting of your L1/shared split? I believe it defaults to 48KB of shared memory per SM, but if you've changed this setting that will be pretty important. Regardless, according to your statement each block scheduled will use 10KB of shared memory. This means we would max out around 4 blocks scheduled per SM, or 4*13 total blocks = 52 blocks max that can be scheduled at any given time. You're clearly exceeding this number, so probably I don't have enough information about the shared memory usage by your kernels. If you're really using 10kb/block, this would more or less preclude you from having more than one kernel's worth of threadblocks executing at a time. There could still be some overlap, and I believe this is likely to be the actual limiting factor in your application. The first kernel of 60 blocks gets scheduled. After a few blocks drain (or perhaps because the 2 kernels were launched close enough together) the second kernel begins to get scheduled, so nearly simultaneously. Then we have to wait a while for about a kernel's worth of blocks to drain before the 3rd kernel can get scheduled, this may well be at the 50% point as indicated in the timeline.
Anyway I think the analyses 1 and 2 above clearly suggest you're getting most of the capability out of the device, based on the limitations inherent in your kernel structure. (We could do a similar analysis based on registers to discover if that is a significant limiting factor.) Regarding this statement: "I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.." I hope you see that the concurrency spec (e.g. 32 kernels) is a maximum spec, and in most cases you are going to run into some other kind of machine limit before you hit the limit on the maximum number of kernels that can execute simultaneously.
EDIT: regarding documentation and resources, the answer I linked to above from Greg Smith provides some resource links. Here are a few more:
The C programming guide has a section on Asynchronous Concurrent Execution.
GPU Concurrency and Streams presentation by Dr. Steve Rennich at NVIDIA is on the NVIDIA webinar page
My experience with HyperQ so far is 2-3 (3.5) times parallellization of my kernels, as the kernels usually are larger for a little more complex calculations. With small kernels its a different story, but usually the kernels are more complicated.
This is also answered by Nvidia in their cuda 5.0 documentation that more complex kernels will take down the amount of parallellization.
But still, GK110 has a great advantage just allowing this.

Is memory reordering visible to other threads on a uniprocessor?

It's common that modern CPU architectures employ performance optimizations that can result in out-of-order execution. In single threaded applications memory reordering may also occur, but it's invisible to programmers as if memory was accessed in program order. And for SMP, memory barriers come to the rescue which are used to enforce some sort of memory ordering.
What I'm not sure, is about multi-threading in a uniprocessor. Consider the following example: When thread 1 runs, the store to f could take place before the store to x. Let's say context switch happens after f is written, and right before x is written. Now thread 2 starts to run, and it ends the loop and print 0, which is undesirable of course.
// Both x, f are initialized w/ 0.
// Thread 1
x = 42;
f = 1;
// Thread 2
while (f == 0)
;
print x;
Is the scenario described above possible? Or is there a guarantee that physical memory is committed during thread context switch?
According to this wiki,
When a program runs on a single-CPU machine, the hardware performs
the necessary bookkeeping to ensure that the program execute as if all
memory operations were performed in the order specified by the
programmer (program order), so memory barriers are not necessary.
Although it didn't explicitly mention uniprocessor multi-threaded applications, it includes this case.
I'm not sure it's correct/complete or not. Note that this may highly depend on the hardware(weak/strong memory model). So you may want to include the hardware you know in the answers. Thanks.
PS. device I/O, etc are not my concern here. And it's a single-core uniprocessor.
Edit: Thanks Nitsan for the reminder, we assume no compiler reordering here(just hardware reordering), and loop in thread 2 is not optimized away..Again, devil is in the details.
As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.
That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.
[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]
You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.
So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.
On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.
One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.
The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.
And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.
A strong memory ordering execute memory access instructions with the exact same order as defined in the program, it is often referred as "program ordering".
Weaker memory ordering may be employed to allow the processor reorder memory access for better performance, it is often referred as "processor ordering".
AFAIK, the scenario described above is NOT possible in the Intel ia32 architecture, whose processor ordering outlaws such cases. The relevant rules are (intel ia-32 software development manual Vol3A 8.2 Memory Ordering) :
writes are not reordered with other writes, with the exception of streaming stores, CLFLUSH and string operations.
To illustrate the rule, it gives an example similar to this:
memory location x, y, initialized to 0;
thread 1:
mov [x] 1
mov [y] 1
thread 2:
mov r1 [y]
mov r2 [x]
r1 == 1 and r2 == 0 is not allowed
In your example, thread 1 cannot store f before storing x.
#Eric in respond to your comments.
fast string store instruction "stosd", may store string out of order inside its operation. In a multiprocessor environment, when a processor store a string "str", another processor may observe str[1] being written before str[0], while the logic order presumed to be writing str[0] before str[1];
But these instructions are not reorder with any other stores. and must have precise exception handling. When exception occurs in the middle of stosd, the implementation may choose to delay it so that all out-of-order sub-stores (doesn't necessarily mean the whole stosd instruction) must commit before the context switch.
Edited to address the claims made on as if this is a C++ question:
Even this is considered in the context of C++, As I understand, a standard confirming compiler should NOT reorder the assignment of x and f in thread 1.
$1.9.14
Every value computation and side effect associated with a full-expression is sequenced before every value
computation and side effect associated with the next full-expression to be evaluated.
This isn't really a C or C++ question, since you've explicitly assumed no load/store re-ordering, which compilers for both languages are perfectly allowed to do.
Allowing that assumption for the sake of argument, note that loop may anyway never exit, unless you either:
give the compiler some reason to believe f may change (eg, by passing its address to some non-inlineable function which could modify it)
mark it volatile, or
make it an explicitly atomic type and request acquire semantics
On the hardware side, your worry about physical memory being "committed" during a context switch isn't an issue. Both software threads share the same memory hardware and cache, so there's no risk of inconsistency there whatever consistency/coherence protocol pertains between cores.
Say both stores were issued, and the memory hardware decides to re-order them. What does this really mean? Perhaps f's address is already in cache, so it can be written immediately, but x's store is deferred until that cache line is fetched. Well, a read from x is dependent on the same address, so either:
the load can't happen until the fetch happens, in which case a sane implementation must issue the queued store before the queued load
or the load can peek into the queue and fetch x's value without waiting for the write
Consider anyway that the kernel pre-emption required to switch threads will itself issue whatever load/store barriers are required for consistency of the kernel scheduler state, and it should be obvious that hardware re-ordering can't be a problem in this situation.
The real issue (which you're trying to avoid) is your assumption that there is no compiler re-ordering: this is simply wrong.
You would only need a compiler fence. From the Linux kernel docs on Memory Barriers (link):
SMP memory barriers are reduced to compiler barriers on uniprocessor
compiled systems because it is assumed that a CPU will appear to be
self-consistent, and will order overlapping accesses correctly with
respect to itself.
To expand on that, the reason why synchronization is not required at the hardware level is that:
All threads on a uniprocessor system share the same memory and thus there are no cache-coherency issues (such as propagation latency) that can occur on SMP systems, and
Any out-of-order load/store instructions in the CPU's execution pipeline would either be committed or rolled back in full if the pipeline is flushed due to a preemptive context switch.
This code may well never finish(in thread 2) as the compiler can decide to hoist the whole expression out of the loop(this is similar to using an isRunning flag which is not volatile).
That said you need to worry about 2 types of re-orderings here: compiler and CPU, both are free to move the stores around. See here: http://preshing.com/20120515/memory-reordering-caught-in-the-act for an example. At this point the code you describe above is at the mercy of compiler, compiler flags, and particular architecture. The wiki quoted is misleading as it may suggest internal re-ordering is not at the mercy of the cpu/compiler which is not the case.
As far as the x86 is concerned, the out-of-order-stores are made consistent from the viewpoint of the executing code with regards to program flow. In this case, "program flow" is just the flow of instructions that a processor executes, not something constrained to a "program running in a thread". All the instructions necessary for context switching, etc. are considered part of this flow so the consistency is maintained across threads.
A context switch has to store the complete machine state so that it can be restored before the suspended thread resumes execution. The machine states includes the processor registers but not the processor pipeline.
If you assume no compiler reordering, this means that all hardware instructions that are "on-the-fly" have to be completed before a context switch (i.e. an interrupt) takes place, otherwise they get lost and are not stored by the context switch mechanism. This is independend of hardware reordering.
In your example, even if the processor swaps the two hardware instructions "x=42" and "f=1", the instruction pointer is already after the second one, and therefore both instructions must be completed before the context switch begins. if it were not so, since the content of the pipeline and of the cache are not part of the "context", they would be lost.
In other words, if the interrupt that causes the ctx switch happens when the IP register points at the instruction following "f=1", then all instructions before that point must have completed all their effects.
From my point of view, processor fetch instructions one by one.
In your case, if "f = 1" was speculatively executed before "x = 42", that means both these two instructions are already in the processor's pipeline. The only possible way to schedule current thread out is interrupt. But the processor(at least on X86) will flush the pipeline's instructions before serving the interrupt.
So no need to worry about the reordering in a uniprocessor.

Using Nsight to determine bank conflicts and coalescing

How can i know the number of non Coalesced read/write and bank conflicts using parallel nsight?
Moreover what should i look at when i use nsight is a profiler? what are the important fields that may cause my program to slow down?
I don't use NSight, but typical fields that you'll look at with a profiler are basically:
memory consumption
time spent in functions
More specifically, with CUDA, you'll be careful to your GPU's occupancy.
Other interesting values are the way the compiler has set your local variables: in registers or in local memory.
Finally, you'll check the time spent to transfer data to and back from the GPU, and compare it with the computation time.
For bank conflicts, you need to watch warp serialization. See here.
And here is a discussion about monitoring memory coalescence <-- basically you just need to watch Global Memory Loads/Stores - Coalesced/Uncoalesced and flag the Uncoalesced.
M. Tibbits basically answered what you need to know for bank conflicts and non-coalesced memory transactions.
For the question on what are the important fields/ things to look at (when using the Nsight profiler) that may cause my program to slow down:
Use Application or System Trace to determine if you are CPU bound, memory bound, or kernel bound. This can be done by looking at the Timeline.
a. CPU bound – you will see large areas where no kernel or memory copy is occurring but your application threads (Thread State) is Green
b. Memory bound – kernels execution blocked on memory transfers to or from the device. You can see this by looking at the Memory Row. If you are spending a lot of time in Memory Copies then you should consider using CUDA streams to pipeline your application. This can allow you to overlap memory transfers and kernels. Before changing your code you should compare the duration of the transfers and kernels and make sure you will get a performance gain.
c. Kernel bound – If the majority of the application time is spent waiting on kernels to complete then you should switch to the "Profile" activity, re-run your application, and start collecting hardware counters to see how you can make your kernel's actual execution time faster.

How to use DSP to speed-up a code on OMAP?

I'm working on a video codec for OMAP3430. I already have code written in C++, and I try to modify/port certain parts of it to take advantage of the DSP (the SDK (OMAP ZOOM3430 SDK) I have has an additional DSP).
I tried to port a small for loop which is running over a very small amount of data (~250 bytes), but about 2M times on different data. But the overload from the communication between CPU and DSP is much more than the gain (if I have any).
I assume this task is much like optimizing a code for the GPU's in normal computers. My question is porting what kind of parts would be beneficial? How do GPU programmers take care of such tasks?
edit:
GPP application allocates a buffer of size 0x1000 bytes.
GPP application invokes DSPProcessor_ReserveMemory to reserve a DSP virtual address space for each allocated buffer using a size that is 4K greater than the allocated buffer to account for automatic page alignment. The total reservation size must also be aligned along a 4K page boundary.
GPP application invokes DSPProcessor_Map to map each allocated buffer to the DSP virtual address spaces reserved in the previous step.
GPP application prepares a message to notify the DSP execute phase of the base address of virtual address space, which have been mapped to a buffer allocated on the GPP. GPP application uses DSPNode_PutMessage to send the message to the DSP.
GPP invokes memcpy to copy the data to be processed into the shared memory.
GPP application invokes DSPProcessor_FlushMemory to ensure that the data cache has been flushed.
GPP application prepares a message to notify the DSP execute phase that it has finished writing to the buffer and the DSP may now access the buffer. The message also contains the amount of data written to the buffer so that the DSP will know just how much data to copy. The GPP uses DSPNode_PutMessage to send the message to the DSP and then invokes DSPNode_GetMessage to wait to hear a message back from the DSP.
After these the execution of DSP program starts, and DSP notifies the GPP with a message when it finishes the processing. Just to try I don't put any processing inside the DSP program. I just send a "processing finished" message back to the GPP. And this still consumes a lot of time. Could that be because of the internal/external memory usage, or is it merely because of the communication overload?
The OMAP3430 does not have an on board DSP, it has a IVA2+ Video/Audio decode engine hooked to the system bus and the Cortex core has DSP-like SIMD instructions. The GPU on the OMAP3430 is a PowerVR SGX based unit. While it does have programmable shaders and i don't believe there is any support for general purpose programming ala CUDA or OpenCL. I could be wrong but I've never heard of such support
If your using the IVA2+ encode/decode engine that is on board you need to use the proper libraries for this unit and it only supports specific codecs from that I know. Are you trying to write your own library to this module?
If your using the Cortex's built in DSPish (SIMD instructions), post some code.
If your dev board has some extra DSP on it, what is the DSP and how is it connected to the OMAP?
As to the desktop GPU question, in the case of video decode you use the vender supplied function libraries to make calls to the hardware, there are several, VDAPU for Nvidia on linux, similar libraries on windows(PureViewHD I think its called). ATI also has both linux and windows libraries for their on board decode engines, i don't know the names.
I don't know what the time base your transfering data in is, but I know the TMS32064x which is listed on the specsheet for the SDK has a very powerful DMA engine. (I'm assuming it's the orignal ZOOM OMAP34X MDK. It says it has a 64xx.) I would hope the OMAP has something simalar, use them to their fullest advantage. I would recomend setting up "ping-pong" buffers in the interal ram of the 64xx and using the SDRAM as shared memory with the transfers handle by DMA. External RAM is going to be a bottleneck on any of the 6xxx series parts so keep whatever you can locked into internal memory to improve performance. Typically these parts will have the ability to bus 8 32bits words to the processor core once it's in internal memory, but that vary from part to part based on what level cache it allows you to map as direct access ram. Cost sensitive parts from TI move the "mappable memory" farther away than some of the other chips. Also all the manuals for the parts are available from TI for free download in PDF. They even gave me hardcopies for free of the TMS320C6000 CPU and Instruction Set manual and many other books.
As far as programming is concerned you may need to use some of the "processor intrinsics" or inline assembly to optimize any math you are doing. For the 64xx favor integer operation when possible because it doesn't have a built in floating point core. (Those are in the 67xx series.) If look at the excution units and you can map your calculations such that the different parts target different operations in a manner which can occur in a single cycle then you will be able to achive the best performance out of those parts. The instruction set manual list the types of ops that are performed by each execution unit. If you can break you calculation up in to a dual data flow sets and unwind the loops a bit the compiler will be "nicer" to you when full optimizaiton is on. This is due to the fact that the processor is broken up into a left and a right side with nearly identical execution units on either side.
Hope this helps.
From the measurements I did, one messaging cycle between CPU and DSP takes about 160us. I don't know whether this is because of the kernel I use, or the bridge driver; but this is a very long time for a simple back & forth messaging.
It seems that it is only reasonable to port an algorithm to DSP if the total computational load is comparable to the time required for messaging; and if the algorithm is suitable for simultaneous computing on CPU and DSP.