Concurrent execution of two processes sharing a Tesla K20 - concurrency

I have been experiencing a strange behaviour when I launch 2 instances of a kernel in order to run at the same time while sharing the GPU resources.
I have developed a CUDA kernel which aims to run in a single SM (Multiprocessor) where the threads perform an operation several times (with a loop).
The kernel is prepared to create only a block, therefore to use only one SM.
simple.cu
#include <cuda_runtime.h>
#include <stdlib.h>
#include <stdio.h>
#include <helper_cuda.h>
using namespace std;
__global__ void increment(float *in, float *out)
{
int it=0, i = blockIdx.x * blockDim.x + threadIdx.x;
float a=0.8525852f;
for(it=0; it<99999999; it++)
out[i] += (in[i]+a)*a-(in[i]+a);
}
int main( int argc, char* argv[])
{
int i;
int nBlocks = 1;
int threadsPerBlock = 1024;
float *A, *d_A, *d_B, *B;
size_t size=1024*13;
A = (float *) malloc(size * sizeof(float));
B = (float *) malloc(size * sizeof(float));
for(i=0;i<size;i++){
A[i]=0.74;
B[i]=0.36;
}
cudaMalloc((void **) &d_A, size * sizeof(float));
cudaMalloc((void **) &d_B, size * sizeof(float));
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
increment<<<nBlocks,threadsPerBlock>>>(d_A, d_B);
cudaDeviceSynchronize();
cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);
free(A);
free(B);
cudaFree(d_A);
cudaFree(d_B);
cudaDeviceReset();
return (0);
}
So if I execute the kernel:
time ./simple
I get
real 0m36.659s
user 0m4.033s
sys 0m1.124s
Otherwise, If I execute two instances:
time ./simple & time ./simple
I get for each process:
real 1m12.417s
user 0m29.494s
sys 0m42.721s
real 1m12.440s
user 0m36.387s
sys 0m8.820s
As far as I know, the executions should run concurrently lasting as one (about 36 seconds). However, they last twice the base time. We know that the GPU has 13 SMs, each one should execute one block, thus the kernels only create 1 block.
Are they being executed in the same SM?
Shouldn’t they running concurrently in different SMs?
EDITED
In order to make me clearer I will attach the profiles of the concurrent execution, obtained from nvprof:
Profile, first instance
Profile, second instance
Now, I would like to show you the behavior of the same scenario but executing concurrently two instances of matrixMul sample:
Profile, first instance
Profile, second instance
As you can see, in the first scenario, a kernel waits for the other to finish. While, in the second scenario (matrixMul), kernels from both contexts are running at the same time.
Thank you.

When you run two separate processes using the same GPU, they each have their own context. CUDA doesn't support having multiple contexts on the same device simultaneously. Instead, each context competes for the device in an undefined manner, with driver level context switching. That is why the execution behaves as if the processes are serialised -- effectively they are, but at a driver rather than GPU level.
There are technologies available (MPS, Hyper-Q) which can do what you want, but the way you are trying to do this won't work.
Edit to respond to the update in your question:
The example you have added using the MatrixMul sample doesn't show what you think it does. That application runs 300 short kernels and computes a performance number over the average of those 300 runs. Your profiling display has been set to a very coarse timescale resolution so that it looks like there is a single long running kernel launch, when in fact it is a series of very short running time kernels.
To illustrate this, consider the following:
This is a normal profiling run for a single MatrixMul process running on a Kepler device. Note that there are many individual kernels running directly after one another.
These are the profiling traces of two simultaneous MatrixMul processes running on the same Kepler device:
Note that there are gaps in the profile traces of each process, this is where context switching between the two processes is occurring. The behaviour is identical to your original example, just at a much finer time granularity. As has been repeated a number of times by several different people in the course of this discussion -- CUDA doesn't support multiple contexts on the sample device simultaneously using the standard runtime API. The MPS server does allow this by adding a daemon which reimplements the API with a large shared internal Hyper-Q pipeline, but you are not using this and it has no bearing on the results you have shown in this question.

Related

Cuda Multi-GPU Latency

I'm new to CUDA and I'm trying to analyse the performance of two GPUs (RTX 3090; 48GB vRAM) in parallel. The issue I face is that for the simple block of code shown below, I would expect this overall block to complete at the same time regardless of the presence of Device 2 code, as they are running Asynchronously on different streams.
// aHost, bHost, cHost, dHost are pinned memory. All arrays are of same length.
for(int i = 0; i < 2; i++){
// ---------- Device 1 code -----------
cudaSetDevice(0);
cudaMemcpyAsync(aDest, aHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(bDest, bHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
// ---------- Device 2 code -----------
cudaSetDevice(1);
cudaMemcpyAsync(cDest, cHost, N* sizeof(float), cudaMemcpyHostToDevice,stream2);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
}
But alas, when I do end up running the block, running Device 1 code alone takes 80ms but adding Device 2 code to the above block adds 20ms, thus reaching 100ms as execution time for the block. I tried profiling the above code and observed the following:
Device 1 + Device 2 Concurrently (image)
When I run Device 1 alone though, I get the below profile:
Device 1 alone (image)
I can see that the initial HtoD process of Device 1 is extended in duration when I add Device 2, and I'm not sure why this is happening cause as far as I'm concerned, these processes are running independently, on different GPUs.
I realise that I haven't created any seperate CPU threads to handle seperate devices but I'm not sure if that would help. Could someone please help me understand why this elongation of duration happens when I add Device 2 code?
EDIT:
Tried profiling the code, and expected the execution durations to be independent of GPU, although I realise MemCpyAsync involves the host as well and perhaps the addition of Device 2 gives rise to more stress on the CPU as it now has to handle additional transfers...?

Very long D2H time vs. H2D (CUDA)

I have an application that uses CUDA to processes data. The basic flow is:
Transfer data H2D (this is around 1.5k integers)
invoke several kernels that transform and reduce data to a single int value
Copy result D2H
Profiling with NSight shows that the H2D and D2H transfers average around 13 uS and 70 uS respectively. This is weird to me as the D2H is moving a tiny amount of data compared to H2D.
Both input and output memory locations are pinned.
Is this this difference in transfer duration expected or am I doing something wrong?
//allocating the memory locations for IO
cudaMallocHost((void**)&gpu_permutation_data, size_t(rowsPerThread) * size_t(permutation_size) * sizeof(keyEntry));
cudaMallocHost((void**)&gpu_constant_maxima, sizeof(keyEntry));
//H2D
cudaMemcpy(gpu_permutation_data, input.data(), size_t(permutation_size) * size_t(rowsPerThread) * sizeof(keyEntry), cudaMemcpyHostToDevice);
// kernels go here
//D2H
cudaMemcpy(&result, gpu_constant_maxima, sizeof(keyEntry), cudaMemcpyDeviceToHost);
As Robert pointed out, NSight displays the time from API start to finish, so the time between when the copy API is called and when it actually starts (after previous kernels are done) is included.

CUDA streams not running in parallel

Given this code:
void foo(cv::gpu::GpuMat const &src, cv::gpu::GpuMat *dst[], cv::Size const dst_size[], size_t numImages)
{
cudaStream_t streams[numImages];
for (size_t image = 0; image < numImages; ++image)
{
cudaStreamCreateWithFlags(&streams[image], cudaStreamNonBlocking);
dim3 Threads(32, 16);
dim3 Blocks((dst_size[image].width + Threads.x - 1)/Threads.x,
(dst_size[image].height + Threads.y - 1)/Threads.y);
myKernel<<<Blocks, Threads, 0, streams[image]>>>(src, dst[image], dst_size[image]);
}
for (size_t image = 0; image < numImages; ++image)
{
cudaStreamSynchronize(streams[image]);
cudaStreamDestroy(streams[image]);
}
}
Looking at the output of nvvp, I see almost perfectly serial execution, even though the first stream is a lengthy process that the others should be able to overlap with.
Note that my kernel uses 30 registers, and all report an "Achieved Occupancy" of around 0.87. For the smallest image, Grid Size is [10,15,1] and Block Size [32, 16,1].
The conditions describing the limits for concurrent kernel execution are given in the CUDA programming guide (link), but the gist of it is that your GPU can potentially run multiple kernels from different streams only if your GPU has sufficient resources to do so.
In your usage case, you have said that you are running multiple launches of a kernel with 150 blocks of 512 threads each. Your GPU has 12 SMM (I think), and you could have up to 4 blocks per SMM running concurrently (4 * 512 = 2048 threads, which is the SMM limit). So your GPU can only run a maximum of 4 * 12 = 48 blocks concurrently. When multiple launches of 150 blocks sitting in the command pipeline, it would seem that there is little (perhaps even no) opportunity for concurrent kernel execution.
You might be able to encourage kernel execution overlap if you increase the scheduling granularity of you kernel by reducing the block size. Smaller blocks are more likely to find available resources and scheduling slots than larger blocks. Similarly, reducing the total block count per kernel launch (probably by increasing the parallel work per thread) might also help increase the potential for overlap or concurrent execution of multiple kernels.

OpenCL SHA1 Throughput Optimisation

Hoping someone more experienced in OpenCL usage may be able to help me here! I'm doing a project (to help me learn a bit more crypto and to try my hand at GPGPU programming) where I'm trying to implement my own SHA-1 algorighm.
Ultimately my question is about maximizing my throughput rates. At present I'm seeing something like 56.1 MH/sec, which compares very badly to open source programs I've looked at, such as John the Ripper and OCLHashcat, which are giving 1,000 and 1,500 MH/sec respectively (heck, I'd be well-chuffed with a 3rd of that!).
So, what I'm doing
I've written a SHA-1 implementation in an OpenCL kernel and a C++ host application to load data to the GPU (using CL 1.2 C++ wrapper). I'm generating blocks of candidate data to hash in a threaded fashion on the CPU and loading this data onto the global GPU memory using the CL C++ call to enqueueWriteBuffer (using uchars to represent the bytes to hash):
errorCode = dispatchQueue->enqueueWriteBuffer(
inputBuffer,
CL_FALSE,//CL_TRUE,
0,
sizeof(cl_uchar) * inputBufferSize,
passwordBuffer,
NULL,
&dispatchDelegate);
I'm en-queuing data using enqueueNDRangeKernel in the following manner (where global worksize is a user-defined variable, at present I've set this to my GPUs maximum flattened global worksize of 16.777 million per run):
errorCode = dispatchQueue->enqueueNDRangeKernel(
*kernel,
NullRange,
NDRange(globalWorkgroupSize, 1),
NullRange,
NULL,
NULL);
This means that (per dispatch) I load 16.777 million items in a 1D array and index from my kernel into this using get_global_offset(0).
My Kernel signature:
__kernel void sha1Crack(__global uchar* out, __global uchar* in,
__constant int* passLen, __constant int* targetHash,
__global bool* collisionFound)
{
//Kernel Instance Global GPU Mem IO Mapping:
__private int id = get_global_id(0);
__private int inputIndexStart = id * passwordLen;
//Select Password input key space:
#pragma unroll
for (i = 0; i < passwordLen; i++)
{
inputMem[i] = in[inputIndexStart + i];
}
//SHA1 Code omitted for brevity...
}
So, given all this: am I doing something fundamentally wrong in the way I'm loading data? I.e. 1 call to enqueueNDrange for 16.7 million kernel executions over a 1D input vector? Should I be using a 2-D space and sub-dividing into localworkgroup ranges? I tried playing with this but it didn't seem quicker.
Or, perhaps as likely is my algorithm itself the source of slowness? I've spent a good while optimizing it and manually unrolling all of the loop stages using pre-processor directives.
I've read about memory coalescing on the hardware. Could that be my issue? :S
Any advice at all appreciated! If I've missed anything important please let me know and I'll update.
Thanks in advance! ;)
Update: 16,777,216 is the device maximum reported workgroup size; 256**3. The global array of boolean values is one boolean. It's set to false at the start of the kernel enqueue, then a branching statement sets this to true if a collision is found only - will that force a convergence? passwordLen is the length of the current input value and target hash is an int[4] encoded hash to check against.
Your 'maximum flattened global worksize' should be multiplied by passwordLen. It is the number of kernels you can run, not the maximal length of an input array. You can most likely send much more data than this to the GPU.
Other potential issues: the 'generating blocks of candidate data to hash in a threaded fashion on the CPU', try doing this in advance of the kernel iterations to see whether the delay is in the generation of the data blocks or in the processing of the kernels; your sha1 algorithm is the other obvious potential issue. I'm not sure how much you've really optimised it by 'unrolling' the loops, usually the bigger optimisation issue is 'if' statements (if a single kernel instance within a workgroup tests to true then all of the lockstepped workgroup instances must follow that branch in parallel).
And DarkZeros is correct, you should manually play with the local workgroup size making it the highest common multiple of the global size and the number of kernels which can be run at once on the card. The easiest way to do this is to round up the global work group size to the next multiple of the card capacity and use an external if{} statement in the kernel only running the kernel for global_id less than the actual number of kernels you want to run.
Dave.

Linux AIO: Poor Scaling

I am writing a library that uses the Linux asynchronous I/O system calls, and would like to know why the io_submit function is exhibiting poor scaling on the ext4 file system. If possible, what can I do to get io_submit not to block for large IO request sizes? I already do the following (as described here):
Use O_DIRECT.
Align the IO buffer to a 512-byte boundary.
Set the buffer size to a multiple of the page size.
In order to observe how long the kernel spends in io_submit, I ran a test in which I created a 1 Gb test file using dd and /dev/urandom, and repeatedly dropped the system cache (sync; echo 1 > /proc/sys/vm/drop_caches) and read increasingly larger portions of the file. At each iteration, I printed the time taken by io_submit and the time spent waiting for the read request to finish. I ran the following experiment on an x86-64 system running Arch Linux, with kernel version 3.11. The machine has an SSD and a Core i7 CPU. The first graph plots the number of pages read against the time spent waiting for io_submit to finish. The second graph displays the time spent waiting for the read request to finish. The times are measured in seconds.
For comparison, I created a similar test that uses synchronous IO by means of pread. Here are the results:
It seems that the asynchronous IO works as expected up to request sizes of around 20,000 pages. After that, io_submit blocks. These observations lead to the following questions:
Why isn't the execution time of io_submit constant?
What is causing this poor scaling behavior?
Do I need to split up all read requests on ext4 file systems into multiple requests, each of size less than 20,000 pages?
Where does this "magic" value of 20,000 come from? If I run my program on another Linux system, how can I determine the largest IO request size to use without experiencing poor scaling behavior?
The code used to test the asynchronous IO follows below. I can add other source listings if you think they are relevant, but I tried to post only the details that I thought might be relevant.
#include <cstddef>
#include <cstdint>
#include <cstring>
#include <chrono>
#include <iostream>
#include <memory>
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
// For `__NR_*` system call definitions.
#include <sys/syscall.h>
#include <linux/aio_abi.h>
static int
io_setup(unsigned n, aio_context_t* c)
{
return syscall(__NR_io_setup, n, c);
}
static int
io_destroy(aio_context_t c)
{
return syscall(__NR_io_destroy, c);
}
static int
io_submit(aio_context_t c, long n, iocb** b)
{
return syscall(__NR_io_submit, c, n, b);
}
static int
io_getevents(aio_context_t c, long min, long max, io_event* e, timespec* t)
{
return syscall(__NR_io_getevents, c, min, max, e, t);
}
int main(int argc, char** argv)
{
using namespace std::chrono;
const auto n = 4096 * size_t(std::atoi(argv[1]));
// Initialize the file descriptor. If O_DIRECT is not used, the kernel
// will block on `io_submit` until the job finishes, because non-direct
// IO via the `aio` interface is not implemented (to my knowledge).
auto fd = ::open("dat/test.dat", O_RDONLY | O_DIRECT | O_NOATIME);
if (fd < 0) {
::perror("Error opening file");
return EXIT_FAILURE;
}
char* p;
auto r = ::posix_memalign((void**)&p, 512, n);
if (r != 0) {
std::cerr << "posix_memalign failed." << std::endl;
return EXIT_FAILURE;
}
auto del = [](char* p) { std::free(p); };
std::unique_ptr<char[], decltype(del)> buf{p, del};
// Initialize the IO context.
aio_context_t c{0};
r = io_setup(4, &c);
if (r < 0) {
::perror("Error invoking io_setup");
return EXIT_FAILURE;
}
// Setup I/O control block.
iocb b;
std::memset(&b, 0, sizeof(b));
b.aio_fildes = fd;
b.aio_lio_opcode = IOCB_CMD_PREAD;
// Command-specific options for `pread`.
b.aio_buf = (uint64_t)buf.get();
b.aio_offset = 0;
b.aio_nbytes = n;
iocb* bs[1] = {&b};
auto t1 = high_resolution_clock::now();
auto r = io_submit(c, 1, bs);
if (r != 1) {
if (r == -1) {
::perror("Error invoking io_submit");
}
else {
std::cerr << "Could not submit request." << std::endl;
}
return EXIT_FAILURE;
}
auto t2 = high_resolution_clock::now();
auto count = duration_cast<duration<double>>(t2 - t1).count();
// Print the wait time.
std::cout << count << " ";
io_event e[1];
t1 = high_resolution_clock::now();
r = io_getevents(c, 1, 1, e, NULL);
t2 = high_resolution_clock::now();
count = duration_cast<duration<double>>(t2 - t1).count();
// Print the read time.
std::cout << count << std::endl;
r = io_destroy(c);
if (r < 0) {
::perror("Error invoking io_destroy");
return EXIT_FAILURE;
}
}
My understanding is that very few (if any) filesystems on linux fully supports AIO. Some filesystem operations still block, and sometimes io_submit() will, indirectly via filesystem operations, invoke such blocking calls.
My understanding is further that the main users of kernel AIO primarily care about AIO being truly asynchronous on raw block devices (i.e. no filesystem). essentially database vendors.
Here's a relevant post from the linux-aio mailing list. (head of the thread)
A possibly useful recommendation:
Add more requests via /sys/block/xxx/queue/nr_requests and the problem
will get better.
Why isn't the execution time of io_submit constant?
Because you are submitting I/Os that are so big, the block layer has to split them up and then queue the resulting requests. This can then cause you to hit resource limitations that in turn cause io_submit() to behave as if it's blocking...
What is causing this poor scaling behavior?
The bigger the I/O is over the splitting threshold (see below) the more likely it becomes that the number of splits done to turn it into appropriately sized requests will also increase (presumably actually doing the splits will cost a small amount of time too). With direct I/O io_submit() does not return until all its requests have been allocated and queued at the block layer level. Further, the amount of requests that can be queued by the block layer for a given disk is limited to /sys/block/[disk_device]/queue/nr_requests. Exceeding this limit leads to io_submit() blocking until enough request slots have been freed up such that all its allocations have been satisfied (this is related to Arvid was recommending).
Do I need to split up all read requests on ext4 file systems into multiple requests, each of size less than 20,000 pages?
Ideally you should split your requests into far smaller amounts than that - 20000 pages (assuming a 4096 byte page which is what is used on x86 platforms) is roughly 78 megabytes! This doesn't just apply to when you're using ext4 - doing such large io_submit() I/O sizes to other filesystems or even directly to block devices will be unlikely perform well.
If you work out which disk device your filesystem is on and look at /sys/block/[disk_device]/queue/max_sectors_kb that will give you an upper bound but the bound at which splitting starts may be even smaller so you may want to limit the size of each I/O to /sys/block/[disk_device]/queue/max_segments * PAGE_SIZE instead.
Where does this "magic" value of 20,000 come from?
This is likely down to some combination of:
The maximum size each I/O can be before the block layer splits it (at most this will be /sys/block/[disk_device]/queue/max_sectors_kb but the observed split limit may be even lower)
The maximum number of I/Os that can be queued before blocking occurs (/sys/block/[disk_device]/queue/nr_requests)
Your hardware's command queue depth (/sys/block/[disk_device]/device/queue_depth)
How fast your disk is at completing requests. When the kernel can't queue any more I/Os to the real device (due to the hardware queue_depth being full and the kernel's additional queues being full) it becomes blocking on new requests until in-flight ones sent to the hardware have completed.
If I run my program on another Linux system, how can I determine the largest IO request size to use without experiencing poor scaling behavior?
Limit each request I/O to the lower of /sys/block/[disk_device]/queue/max_sectors_kb or /sys/block/[disk_device]/queue/max_segments * PAGE_SIZE. I would imagine I/Os no bigger than 524288 bytes should be safe but your hardware may be able to cope with a larger size and thus get a higher throughput but possibly at the expense of completion (as opposed to submission) latency.
If possible, what can I do to get io_submit not to block for large IO request sizes?
There's going to be an upper "good" limit and if you surpass it there are going to be consequences which you can't escape.
Related questions
asynchronous IO io_submit latency in Ubuntu Linux
You are missing the purpose of using AIO in the first place. The referenced example shows a sequence of [fill-buffer], [write], [write], [write], ... [read], [read], [read], ... operations. In effect you are stuffing data down a pipe. Eventually the pipe fills up when you reach the I/O bandwidth limit to your storage. Now you busy wait, which shows up on your linear performance degradation behavior.
The performance gains for an AIO write is that the application fills a buffer and then tells the kernel to begin the write operation; control returns to the application immediately while the kernel still owns the data buffer and its content; until the kernel signals I/O complete, the application must not touch the data buffer because you don't know yet what part (if any) of the buffer has actually made it to the media: modify the buffer before the I/O is complete and you've corrupted the data going out to the media.
Conversely, the gain from an AIO read is when the application allocates an I/O buffer, and then tells the kernel to begin filling the buffer. Control returns to the application immediately and the application must leave the buffer alone until the kernel signifies it is finished with the buffer by posting the I/O completion event.
So the behavior you see is the example quickly filling a pipeline to the storage. Eventually data are generated faster than the storage can suck in the data and performance drops to linearity while the pipeline gets refilled as quickly as it is emptied: linear behavior.
The example program does use AIO calls but it's still a linear stop-and-wait program.