CUDA streams not running in parallel - c++

Given this code:
void foo(cv::gpu::GpuMat const &src, cv::gpu::GpuMat *dst[], cv::Size const dst_size[], size_t numImages)
{
cudaStream_t streams[numImages];
for (size_t image = 0; image < numImages; ++image)
{
cudaStreamCreateWithFlags(&streams[image], cudaStreamNonBlocking);
dim3 Threads(32, 16);
dim3 Blocks((dst_size[image].width + Threads.x - 1)/Threads.x,
(dst_size[image].height + Threads.y - 1)/Threads.y);
myKernel<<<Blocks, Threads, 0, streams[image]>>>(src, dst[image], dst_size[image]);
}
for (size_t image = 0; image < numImages; ++image)
{
cudaStreamSynchronize(streams[image]);
cudaStreamDestroy(streams[image]);
}
}
Looking at the output of nvvp, I see almost perfectly serial execution, even though the first stream is a lengthy process that the others should be able to overlap with.
Note that my kernel uses 30 registers, and all report an "Achieved Occupancy" of around 0.87. For the smallest image, Grid Size is [10,15,1] and Block Size [32, 16,1].

The conditions describing the limits for concurrent kernel execution are given in the CUDA programming guide (link), but the gist of it is that your GPU can potentially run multiple kernels from different streams only if your GPU has sufficient resources to do so.
In your usage case, you have said that you are running multiple launches of a kernel with 150 blocks of 512 threads each. Your GPU has 12 SMM (I think), and you could have up to 4 blocks per SMM running concurrently (4 * 512 = 2048 threads, which is the SMM limit). So your GPU can only run a maximum of 4 * 12 = 48 blocks concurrently. When multiple launches of 150 blocks sitting in the command pipeline, it would seem that there is little (perhaps even no) opportunity for concurrent kernel execution.
You might be able to encourage kernel execution overlap if you increase the scheduling granularity of you kernel by reducing the block size. Smaller blocks are more likely to find available resources and scheduling slots than larger blocks. Similarly, reducing the total block count per kernel launch (probably by increasing the parallel work per thread) might also help increase the potential for overlap or concurrent execution of multiple kernels.

Related

Cuda Multi-GPU Latency

I'm new to CUDA and I'm trying to analyse the performance of two GPUs (RTX 3090; 48GB vRAM) in parallel. The issue I face is that for the simple block of code shown below, I would expect this overall block to complete at the same time regardless of the presence of Device 2 code, as they are running Asynchronously on different streams.
// aHost, bHost, cHost, dHost are pinned memory. All arrays are of same length.
for(int i = 0; i < 2; i++){
// ---------- Device 1 code -----------
cudaSetDevice(0);
cudaMemcpyAsync(aDest, aHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(bDest, bHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
// ---------- Device 2 code -----------
cudaSetDevice(1);
cudaMemcpyAsync(cDest, cHost, N* sizeof(float), cudaMemcpyHostToDevice,stream2);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
}
But alas, when I do end up running the block, running Device 1 code alone takes 80ms but adding Device 2 code to the above block adds 20ms, thus reaching 100ms as execution time for the block. I tried profiling the above code and observed the following:
Device 1 + Device 2 Concurrently (image)
When I run Device 1 alone though, I get the below profile:
Device 1 alone (image)
I can see that the initial HtoD process of Device 1 is extended in duration when I add Device 2, and I'm not sure why this is happening cause as far as I'm concerned, these processes are running independently, on different GPUs.
I realise that I haven't created any seperate CPU threads to handle seperate devices but I'm not sure if that would help. Could someone please help me understand why this elongation of duration happens when I add Device 2 code?
EDIT:
Tried profiling the code, and expected the execution durations to be independent of GPU, although I realise MemCpyAsync involves the host as well and perhaps the addition of Device 2 gives rise to more stress on the CPU as it now has to handle additional transfers...?

Very long D2H time vs. H2D (CUDA)

I have an application that uses CUDA to processes data. The basic flow is:
Transfer data H2D (this is around 1.5k integers)
invoke several kernels that transform and reduce data to a single int value
Copy result D2H
Profiling with NSight shows that the H2D and D2H transfers average around 13 uS and 70 uS respectively. This is weird to me as the D2H is moving a tiny amount of data compared to H2D.
Both input and output memory locations are pinned.
Is this this difference in transfer duration expected or am I doing something wrong?
//allocating the memory locations for IO
cudaMallocHost((void**)&gpu_permutation_data, size_t(rowsPerThread) * size_t(permutation_size) * sizeof(keyEntry));
cudaMallocHost((void**)&gpu_constant_maxima, sizeof(keyEntry));
//H2D
cudaMemcpy(gpu_permutation_data, input.data(), size_t(permutation_size) * size_t(rowsPerThread) * sizeof(keyEntry), cudaMemcpyHostToDevice);
// kernels go here
//D2H
cudaMemcpy(&result, gpu_constant_maxima, sizeof(keyEntry), cudaMemcpyDeviceToHost);
As Robert pointed out, NSight displays the time from API start to finish, so the time between when the copy API is called and when it actually starts (after previous kernels are done) is included.

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.
I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

Concurrent execution of two processes sharing a Tesla K20

I have been experiencing a strange behaviour when I launch 2 instances of a kernel in order to run at the same time while sharing the GPU resources.
I have developed a CUDA kernel which aims to run in a single SM (Multiprocessor) where the threads perform an operation several times (with a loop).
The kernel is prepared to create only a block, therefore to use only one SM.
simple.cu
#include <cuda_runtime.h>
#include <stdlib.h>
#include <stdio.h>
#include <helper_cuda.h>
using namespace std;
__global__ void increment(float *in, float *out)
{
int it=0, i = blockIdx.x * blockDim.x + threadIdx.x;
float a=0.8525852f;
for(it=0; it<99999999; it++)
out[i] += (in[i]+a)*a-(in[i]+a);
}
int main( int argc, char* argv[])
{
int i;
int nBlocks = 1;
int threadsPerBlock = 1024;
float *A, *d_A, *d_B, *B;
size_t size=1024*13;
A = (float *) malloc(size * sizeof(float));
B = (float *) malloc(size * sizeof(float));
for(i=0;i<size;i++){
A[i]=0.74;
B[i]=0.36;
}
cudaMalloc((void **) &d_A, size * sizeof(float));
cudaMalloc((void **) &d_B, size * sizeof(float));
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
increment<<<nBlocks,threadsPerBlock>>>(d_A, d_B);
cudaDeviceSynchronize();
cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);
free(A);
free(B);
cudaFree(d_A);
cudaFree(d_B);
cudaDeviceReset();
return (0);
}
So if I execute the kernel:
time ./simple
I get
real 0m36.659s
user 0m4.033s
sys 0m1.124s
Otherwise, If I execute two instances:
time ./simple & time ./simple
I get for each process:
real 1m12.417s
user 0m29.494s
sys 0m42.721s
real 1m12.440s
user 0m36.387s
sys 0m8.820s
As far as I know, the executions should run concurrently lasting as one (about 36 seconds). However, they last twice the base time. We know that the GPU has 13 SMs, each one should execute one block, thus the kernels only create 1 block.
Are they being executed in the same SM?
Shouldn’t they running concurrently in different SMs?
EDITED
In order to make me clearer I will attach the profiles of the concurrent execution, obtained from nvprof:
Profile, first instance
Profile, second instance
Now, I would like to show you the behavior of the same scenario but executing concurrently two instances of matrixMul sample:
Profile, first instance
Profile, second instance
As you can see, in the first scenario, a kernel waits for the other to finish. While, in the second scenario (matrixMul), kernels from both contexts are running at the same time.
Thank you.
When you run two separate processes using the same GPU, they each have their own context. CUDA doesn't support having multiple contexts on the same device simultaneously. Instead, each context competes for the device in an undefined manner, with driver level context switching. That is why the execution behaves as if the processes are serialised -- effectively they are, but at a driver rather than GPU level.
There are technologies available (MPS, Hyper-Q) which can do what you want, but the way you are trying to do this won't work.
Edit to respond to the update in your question:
The example you have added using the MatrixMul sample doesn't show what you think it does. That application runs 300 short kernels and computes a performance number over the average of those 300 runs. Your profiling display has been set to a very coarse timescale resolution so that it looks like there is a single long running kernel launch, when in fact it is a series of very short running time kernels.
To illustrate this, consider the following:
This is a normal profiling run for a single MatrixMul process running on a Kepler device. Note that there are many individual kernels running directly after one another.
These are the profiling traces of two simultaneous MatrixMul processes running on the same Kepler device:
Note that there are gaps in the profile traces of each process, this is where context switching between the two processes is occurring. The behaviour is identical to your original example, just at a much finer time granularity. As has been repeated a number of times by several different people in the course of this discussion -- CUDA doesn't support multiple contexts on the sample device simultaneously using the standard runtime API. The MPS server does allow this by adding a daemon which reimplements the API with a large shared internal Hyper-Q pipeline, but you are not using this and it has no bearing on the results you have shown in this question.

Linux AIO: Poor Scaling

I am writing a library that uses the Linux asynchronous I/O system calls, and would like to know why the io_submit function is exhibiting poor scaling on the ext4 file system. If possible, what can I do to get io_submit not to block for large IO request sizes? I already do the following (as described here):
Use O_DIRECT.
Align the IO buffer to a 512-byte boundary.
Set the buffer size to a multiple of the page size.
In order to observe how long the kernel spends in io_submit, I ran a test in which I created a 1 Gb test file using dd and /dev/urandom, and repeatedly dropped the system cache (sync; echo 1 > /proc/sys/vm/drop_caches) and read increasingly larger portions of the file. At each iteration, I printed the time taken by io_submit and the time spent waiting for the read request to finish. I ran the following experiment on an x86-64 system running Arch Linux, with kernel version 3.11. The machine has an SSD and a Core i7 CPU. The first graph plots the number of pages read against the time spent waiting for io_submit to finish. The second graph displays the time spent waiting for the read request to finish. The times are measured in seconds.
For comparison, I created a similar test that uses synchronous IO by means of pread. Here are the results:
It seems that the asynchronous IO works as expected up to request sizes of around 20,000 pages. After that, io_submit blocks. These observations lead to the following questions:
Why isn't the execution time of io_submit constant?
What is causing this poor scaling behavior?
Do I need to split up all read requests on ext4 file systems into multiple requests, each of size less than 20,000 pages?
Where does this "magic" value of 20,000 come from? If I run my program on another Linux system, how can I determine the largest IO request size to use without experiencing poor scaling behavior?
The code used to test the asynchronous IO follows below. I can add other source listings if you think they are relevant, but I tried to post only the details that I thought might be relevant.
#include <cstddef>
#include <cstdint>
#include <cstring>
#include <chrono>
#include <iostream>
#include <memory>
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
// For `__NR_*` system call definitions.
#include <sys/syscall.h>
#include <linux/aio_abi.h>
static int
io_setup(unsigned n, aio_context_t* c)
{
return syscall(__NR_io_setup, n, c);
}
static int
io_destroy(aio_context_t c)
{
return syscall(__NR_io_destroy, c);
}
static int
io_submit(aio_context_t c, long n, iocb** b)
{
return syscall(__NR_io_submit, c, n, b);
}
static int
io_getevents(aio_context_t c, long min, long max, io_event* e, timespec* t)
{
return syscall(__NR_io_getevents, c, min, max, e, t);
}
int main(int argc, char** argv)
{
using namespace std::chrono;
const auto n = 4096 * size_t(std::atoi(argv[1]));
// Initialize the file descriptor. If O_DIRECT is not used, the kernel
// will block on `io_submit` until the job finishes, because non-direct
// IO via the `aio` interface is not implemented (to my knowledge).
auto fd = ::open("dat/test.dat", O_RDONLY | O_DIRECT | O_NOATIME);
if (fd < 0) {
::perror("Error opening file");
return EXIT_FAILURE;
}
char* p;
auto r = ::posix_memalign((void**)&p, 512, n);
if (r != 0) {
std::cerr << "posix_memalign failed." << std::endl;
return EXIT_FAILURE;
}
auto del = [](char* p) { std::free(p); };
std::unique_ptr<char[], decltype(del)> buf{p, del};
// Initialize the IO context.
aio_context_t c{0};
r = io_setup(4, &c);
if (r < 0) {
::perror("Error invoking io_setup");
return EXIT_FAILURE;
}
// Setup I/O control block.
iocb b;
std::memset(&b, 0, sizeof(b));
b.aio_fildes = fd;
b.aio_lio_opcode = IOCB_CMD_PREAD;
// Command-specific options for `pread`.
b.aio_buf = (uint64_t)buf.get();
b.aio_offset = 0;
b.aio_nbytes = n;
iocb* bs[1] = {&b};
auto t1 = high_resolution_clock::now();
auto r = io_submit(c, 1, bs);
if (r != 1) {
if (r == -1) {
::perror("Error invoking io_submit");
}
else {
std::cerr << "Could not submit request." << std::endl;
}
return EXIT_FAILURE;
}
auto t2 = high_resolution_clock::now();
auto count = duration_cast<duration<double>>(t2 - t1).count();
// Print the wait time.
std::cout << count << " ";
io_event e[1];
t1 = high_resolution_clock::now();
r = io_getevents(c, 1, 1, e, NULL);
t2 = high_resolution_clock::now();
count = duration_cast<duration<double>>(t2 - t1).count();
// Print the read time.
std::cout << count << std::endl;
r = io_destroy(c);
if (r < 0) {
::perror("Error invoking io_destroy");
return EXIT_FAILURE;
}
}
My understanding is that very few (if any) filesystems on linux fully supports AIO. Some filesystem operations still block, and sometimes io_submit() will, indirectly via filesystem operations, invoke such blocking calls.
My understanding is further that the main users of kernel AIO primarily care about AIO being truly asynchronous on raw block devices (i.e. no filesystem). essentially database vendors.
Here's a relevant post from the linux-aio mailing list. (head of the thread)
A possibly useful recommendation:
Add more requests via /sys/block/xxx/queue/nr_requests and the problem
will get better.
Why isn't the execution time of io_submit constant?
Because you are submitting I/Os that are so big, the block layer has to split them up and then queue the resulting requests. This can then cause you to hit resource limitations that in turn cause io_submit() to behave as if it's blocking...
What is causing this poor scaling behavior?
The bigger the I/O is over the splitting threshold (see below) the more likely it becomes that the number of splits done to turn it into appropriately sized requests will also increase (presumably actually doing the splits will cost a small amount of time too). With direct I/O io_submit() does not return until all its requests have been allocated and queued at the block layer level. Further, the amount of requests that can be queued by the block layer for a given disk is limited to /sys/block/[disk_device]/queue/nr_requests. Exceeding this limit leads to io_submit() blocking until enough request slots have been freed up such that all its allocations have been satisfied (this is related to Arvid was recommending).
Do I need to split up all read requests on ext4 file systems into multiple requests, each of size less than 20,000 pages?
Ideally you should split your requests into far smaller amounts than that - 20000 pages (assuming a 4096 byte page which is what is used on x86 platforms) is roughly 78 megabytes! This doesn't just apply to when you're using ext4 - doing such large io_submit() I/O sizes to other filesystems or even directly to block devices will be unlikely perform well.
If you work out which disk device your filesystem is on and look at /sys/block/[disk_device]/queue/max_sectors_kb that will give you an upper bound but the bound at which splitting starts may be even smaller so you may want to limit the size of each I/O to /sys/block/[disk_device]/queue/max_segments * PAGE_SIZE instead.
Where does this "magic" value of 20,000 come from?
This is likely down to some combination of:
The maximum size each I/O can be before the block layer splits it (at most this will be /sys/block/[disk_device]/queue/max_sectors_kb but the observed split limit may be even lower)
The maximum number of I/Os that can be queued before blocking occurs (/sys/block/[disk_device]/queue/nr_requests)
Your hardware's command queue depth (/sys/block/[disk_device]/device/queue_depth)
How fast your disk is at completing requests. When the kernel can't queue any more I/Os to the real device (due to the hardware queue_depth being full and the kernel's additional queues being full) it becomes blocking on new requests until in-flight ones sent to the hardware have completed.
If I run my program on another Linux system, how can I determine the largest IO request size to use without experiencing poor scaling behavior?
Limit each request I/O to the lower of /sys/block/[disk_device]/queue/max_sectors_kb or /sys/block/[disk_device]/queue/max_segments * PAGE_SIZE. I would imagine I/Os no bigger than 524288 bytes should be safe but your hardware may be able to cope with a larger size and thus get a higher throughput but possibly at the expense of completion (as opposed to submission) latency.
If possible, what can I do to get io_submit not to block for large IO request sizes?
There's going to be an upper "good" limit and if you surpass it there are going to be consequences which you can't escape.
Related questions
asynchronous IO io_submit latency in Ubuntu Linux
You are missing the purpose of using AIO in the first place. The referenced example shows a sequence of [fill-buffer], [write], [write], [write], ... [read], [read], [read], ... operations. In effect you are stuffing data down a pipe. Eventually the pipe fills up when you reach the I/O bandwidth limit to your storage. Now you busy wait, which shows up on your linear performance degradation behavior.
The performance gains for an AIO write is that the application fills a buffer and then tells the kernel to begin the write operation; control returns to the application immediately while the kernel still owns the data buffer and its content; until the kernel signals I/O complete, the application must not touch the data buffer because you don't know yet what part (if any) of the buffer has actually made it to the media: modify the buffer before the I/O is complete and you've corrupted the data going out to the media.
Conversely, the gain from an AIO read is when the application allocates an I/O buffer, and then tells the kernel to begin filling the buffer. Control returns to the application immediately and the application must leave the buffer alone until the kernel signifies it is finished with the buffer by posting the I/O completion event.
So the behavior you see is the example quickly filling a pipeline to the storage. Eventually data are generated faster than the storage can suck in the data and performance drops to linearity while the pipeline gets refilled as quickly as it is emptied: linear behavior.
The example program does use AIO calls but it's still a linear stop-and-wait program.