i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.
therefore when a block finds the solution it should inform the cpu that the solution is found, so the cpu prints the solution provided by this block.
so what i am currently doing is the following:
__global__ kernel(int sol)
{
//do some computations
if(the block found a solution)
sol = blockId.x //atomically
}
now on every call to the kernel i copy sol back to the host memory and check its value. if its set to 3 for example, i know that blockid 3 found the solution so i now know where the index of the solution start, and copy the solution back to the host.
in this case, will using cudaHostAlloc be a better option? more over would copying the value of a single integer on every kernel call slows down my program?
Issuing a copy from GPU to CPU and then waiting for its completion will slow your program a bit. Note that if you choose to send 1 byte or 1KB, that won't make much of a difference. In this case bandwidth is not a problem, but latency.
But launching a kernel does consume some time as well. If the "meat" of your algorithm is in the kernel itself I wouldn't spend too much time on that single, small transfer.
Do note, if you choose to use the mapped memory, instead of using cudaMemcpy, you will need to explicitly put a cudaDeviceSynchronise (or cudaThreadSynchronise with older CUDA) barrier (as opposed to an implicit barrier at cudaMemcpy) before reading the status. Otherwise, your host code may go achead reading an old value stored in your pinned memory, before the kernel overwrites it.
Related
Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?
Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.
I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.
I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.
I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.
Before I knew of the CPU's store buffer I thought thread-thrashing simply occured when two threads wanted to write to the same cacheline. One would prevent the other from writing. However, this seems pretty synchronous. I later learnt that there is a store buffer, which temporarily flushes the writes. It is forced to flush through the SFENCE instruction, kinda implying there is no synchronous prevention of multiple cores accessing the same cacheline....
I am totally confused how thread-thrashing occurs, if we have to be careful and use SFENCEs? Thread-thrashing implies blocking, whereas SFENCEs implies the writes are done asynchronously and the programmer must manually flush the write??
(My understanding of SFENCEs may be confused too- because I also read the Intel memory model is "strong" and therefore memory fences are only required for string x86 instructions).
Could somebody please remove my confusion?
"Thrashing" meaning multiple cores retrieving the same cpu cacheline and this causing latency overhead for other cores competing for the same cacheline.
So, at least in my vocabulary, thread-thrashing happens when you have something like this:
// global variable
int x;
// Thread 1
void thread1_code()
{
while(!done)
x++;
}
// Thread 2
void thread2_code()
{
while(!done)
x++;
}
(This code is of course total nonsense - I'm making it ridiculously simple but pointless to not have complicated code that is complicated to explain what is going on in the thread itself)
For simplicity, we'll assume thread 1 always runs on processor 1, and thread 2 always runs on processor 2 [1]
If you run these two threads on an SMP system - and we've JUST started this code [both threads start, by magic, at almost exactly the same time, not like in a real system, many thousand clock-cycles apart], thread one will read the value of x, update it, and write it back. By now, thread 2 is also running, and it will also read the value of x, update it, and write it back. To do that, it needs to actually ask the other processor(s) "do you have (new value for) x in your cache, if so, can you please give me a copy". And of course, processor 1 will have a new value because it has just stored back the value of x. Now, that cache-line is "shared" (our two threads both have a copy of the value). Thread two updates the value and writes it back to memory. When it does so, another signal is sent from this processor saying "If anyone is holding a value of x, please get rid of it, because I've just updated the value".
Of course, it's entirely possible that BOTH threads read the same value of x, update to the same new value, and write it back as the same new modified value. And sooner or later one processor will write back a value that is lower than the value written by the other processor, because it's fallen behind a bit...
A fence operation will help ensure that the data written to memory has actually got all the way to cache before the next operation happens, because as you say, there are write-buffers to hold memory updates before they actually reach memory. If you don't have a fence instruction, your processors will probably get seriously out of phase, and update the value more than once before the other has had time to say "do you have a new value for x?" - however, it doesn't really help prevent processor 1 asking for the data from processor 2 and processor 2 immediately asking for it "back", thus ping-ponging the cache-content back and forth as quickly as the system can achieve.
To ensure that ONLY ONE processor updates some shared value, it is required that you use a so called atomic instruction. These special instructons are designed to operate in conjunction with write buffers and caches, such that they ensure that ONLY one processor actually holds an up-to-date value for the cache-line that is being updated, and NO OTHER processor is able to update the value until this processor has completed the update. So you never get "read the same value of x and write back the same value of x" or any similar thing.
Since caches don't work on single bytes or single integer sized things, you can also have "false sharing". For example:
int x, y;
void thread1_code()
{
while(!done) x++;
}
void thread2_code()
{
while(!done) y++;
}
Now, x and y are not actually THE same variable, but they are (quite plausibly, but we can't know for 100% sure) located within the same cache-line of 16, 32, 64 or 128 bytes (depending on processor architecture). So although x and y are distinct, when one processor says "I've just updated x, please get rid of any copies", the other processor will get rid of it's (still correct) value of y at the same time as getting rid of x. I had such an example where some code was doing:
struct {
int x[num_threads];
... lots more stuff in the same way
} global_var;
void thread_code()
{
...
global_var.x[my_thread_number]++;
...
}
Of course, two threads would then update value right next to each other, and the performance was RUBBISH (about 6x slower than when we fixed it by doing:
struct
{
int x;
... more stuff here ...
} global_var[num_threads];
void thread_code()
{
...
global_var[my_thread_number].x++;
...
}
Edit to clarify:
fence does not (as my recent edit explains) "help" against ping-poning the cache-content between threads. It also doesn't, in and of itself, prevent data from being updated out of sync between the processors - it does, however, ensure that the processor performing the fence operation doesn't continue doing OTHER memory operations until this particular operations memory content has got "out of" the processor core itself. Since there are various pipeline stages, and most modern CPU's have multiple execution units, one unit may well be "ahead" of another that is technically "behind" in the execution stream. A fence will ensure that "everything has been done here". It's a bit like the man with the big stop-board in Formula 1 racing, that ensures that the driver doesn't drive off from the tyre-change until ALL new tyres are securely on the car (if everyone does what they should).
The MESI or MOESI protocol is a state-machine system that ensures that operations between different processors is done correctly. A processor can have a modified value (in which case a signal is sent to all other processors to "stop using the old value"), a processor may "own" the value (it is the holder of this data, and may modify the value), a processor may have "exclusive" value (it's the ONLY holder of the value, everyone else has got rid of their copy), it may be "shared" (more than one processor has a copy, but this processor should not update the value - it is not the "owner" of the data), or Invalid (data is not present in the cache). MESI doesn't have the "owned" mode which means a little more traffic on the snoop bus ("snoop" means "Do you have a copy of x", "please get rid of your copy of x" etc)
[1] Yes, processor numbers usually start with zero, but I can't be bothered to go back and rename thread1 to thread0 and thread2 to thread1 by the time I wrote this additional paragraph.
A very simplified version of my code looks like:
do {
//reset loop variable b to 0/false
b = 0;
// execute kernel
kernel<<<...>>>(b);
// use the value of b for while condition
} while(b);
Boolean variable b can be set to true by any thread in kernel and it tells us whether we continue running our loop.
Using cudaMalloc, cudaMemset, and cudaMemcpy we can create/set/copy device memory to implement this. However I just found the existence of pinned memory. Using cudaMalloHost to allocate b and a call to cudaDeviceSynchronize right after the kernel gave quite a speed up (~50%) in a simple test program.
Is pinned memory the best option for this boolean variable b or is there a better option?
You haven't shown your initial code and the modified code therefore nobody can have any idea about the details of the improvement you are stating in your post.
The answer to your question varies depending on
The b is read and written or is only written inside the GPU kernel. Reads might need to fetch the actual value directly from the host side if b is not found in the cache resulting in latencies. On the other hand, the latency for writes can be covered if there are further operations that can keep the threads busy.
How frequent you modify the value. If you access the value frequently in your program, the GPU probably can keep the variable inside L2 avoiding host side accesses.
The frequency of memory operations between accesses to b. If there are many memory transactions between accesses to b, it is more probable that b in the cache is replaced with some other content. As a result, when accessed again, b could not be found in the cache and a time-consuming host-access is necessary.
In cases having b in the host side causes many host memory transactions, it is logical to keep it inside the GPU global memory and transfer it back at the end of each loop iteration. You can do it rather fast with an asynchronous copy in the same stream as kernel's and synchronize with the host right after.
All above items are for cache-enabled devices. If your device is pr-Fermi (CC<2.0), the story is different.
I have written a program (using FFTW) to perform Fourier transforms of some data files written in OpenFOAM.
The program first finds the paths to each data file (501 files in my current example), then splits the paths between threads, such that thread0 gets paths 0->61, thread1 gets 62-> 123 or so, etc, and then runs the remaining files in serial at the end.
I have implemented timers throughout the code to try and see where it bottlenecks, since run in serial each file takes around 3.5s and for 8 files in parallel the time taken is around 21s (a reduction from the 28s for 8x3.5 (serial time), but not by so much)
The problematic section of my code is below
if (DIAG_timers) {readTimer = timerNow();}
for (yindex=0; yindex<ycells; yindex++)
{
for (xindex=0; xindex<xcells; xindex++)
{
getline(alphaFile, alphaStringValue);
convertToNumber(alphaStringValue, alphaValue[xindex][yindex]);
}
}
if (DIAG_timers) {endTimerP(readTimer, tid, "reading value and converting", false);}
Here, timerNow() returns the clock value, and endTimerP calculates the time that has passed in ms. (The remaining arguments relate to it running in a parallel thread, to avoid outputting 8 lines for each loop etc, and a description of what the timer measures).
convertToNumber takes the value on alphaStringValue, and converts it to a double, which is then stored in the alphaValue array.
alphaFile is a std::ifstream object, and alphaStringValue is a std::string which stores the text on each line.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible, since only 8 (1 per thread) should be open at once. I am unsure if mmap would do this better? Several threads on stackoverflow argue about the merits of mmap vs more straightforward read operations, in particular for sequential access, so I don't know if that would be beneficial.
I tried surrounding the code block with a mutex so that only one thread could run the block at once, in case reading multiple files was leading to slow IO via vaguely random access, but that just reduced the process to roughly serial-speed times.
Any suggestions allowing me to run this section more quickly, possibly via copying the file, or indeed anything else, would be appreciated.
Edit:
template<class T> inline void convertToNumber(std::string const& s, T &result)
{
std::istringstream i(s);
T x;
if (!(i >> x))
throw BadConversion("convertToNumber(\"" + s + "\")");
result = x;
}
turns out to have been the slow section. I assume this is due to the creation of 5 million stringstreams per file, followed by the testing of 5 million if conditions? Replacing it with TonyD's suggestion presumably removes the possibility of catching an error, but saves a vast number of (at least in this controlled case) unnecessary operations.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible,
Yes. But loading them there will still count towards your process' wall clock time unless they were already read by another process short before.
since only 8 (1 per thread) should be open at once.
Since any files that were not loaded in memory before the process started will have to be loaded and thus the loading will count towards the process wall clock time, it does not matter how many are open at once. Any that are not cache will slow down the process.
I am unsure if mmap would do this better?
No, it wouldn't. mmap is faster, but because it saves the copy from kernel buffer to application buffer and some system call overhead (with read you do a kernel entry for each page while with mmap pages that are read with read-ahead won't cause further page faults). But it will not save you the time to read the files from disk if they are not already cached.
mmap does not load anything in memory. The kernel loads data from disk to internal buffers, the page cache. read copies the data from there to your application buffer while mmap exposes parts of the page cache directly in your address space. But in either case the data are fetched on first access and remain there until the memory manager drops them to reuse the memory. The page cache is global, so if one process causes some data to be cached, next process will get them faster. But if it's first access after longer time, the data will have to be read and this will affect read and mmap exactly the same way.
Since parallelizing the process didn't improve the time much, it seems majority of the time is the actual I/O. So you can optimize a bit more and mmap can help, but don't expect much. The only way to improve I/O time is to get a faster disk.
You should be able to ask the system to tell you how much time was spent on the CPU and how much was spent waiting for data (I/O) using getrusage(2) (call it at end of each thread to get data for that thread). So you can confirm how much time was spent by I/O.
mmap is certainly the most efficient way to get large amounts of data into memory. The main benefit here is that there is no extra copying involved.
It does however make the code slightly more complex, since you can't directly use the file I/O functions to use mmap (and the main benefit is sort of lost if you use "m" mode of stdio functions, as you are now getting at least one copy). From past experiments that I've made, mmap beats all other file reading variants by some amount. How much depends on what proportion of the overall time is spent on waiting for the disk, and how much time is spent actually processing the file content.
Assuming that we have lots of threads that will access global memory sequentially, which option performs faster in the overall? I'm in doubt because __threadfence() takes into account all shared and global memory writes but the writes are coalesced. In the other hand atomicExch() takes into account just the important memory addresses but I don't know if the writes are coalesced or not.
In code:
array[threadIdx.x] = value;
Or
atomicExch(&array[threadIdx.x] , value);
Thanks.
On Kepler GPUs, I would bet on atomicExch since atomics are very fast on Kepler. On Fermi, it may be a wash, but given that you have no collisions, atomicExch could still perform well.
Please make an experiment and report the results.
Those two do very different things.
atomicExch ensures that no two threads try to modify a given cell at a time. If such conflict would occur, one or more threads may be stalled. If you know beforehand that no two threads access the same cell, there is no point to use any atomic... function.
__threadfence() delays the current thread (and only the current thread!) to ensure that any subsequent writes by given thread do actually happen later.
As such, __threadfence() on its own, without any follow-up code is not very interesting.
For that reason, I don't think there is a point to compare the efficiency of those two. Maybe if you could show a bit more concrete use case I could relate...
Note, that neither of those actually give you any guarantees on the actual order of execution of the threads.