std::thread's and OpenMP GPU Offloading - c++

I’ve been refactoring some code that involves trying to use OpenMP to offload parts of a larger function to an NVIDIA A100. Problem is, the section that I’m trying to offload is part of a larger function that is being threaded via std::thread’s in C++.
Specifically, each std::thread starts a function and within this function parts of it is being offloaded to the GPU via OpenMP. The OpenMP clause is typical e.g. “#pragma omp target teams distribute parallel for”…
This seems to be causing the following runtime error:
> libgomp: cuLaunchKernel error: invalid resource handle
If I get rid of any concurrency (remove any std::thread-ing) and keep the OpenMP offloading it seems to run fine.
Any ideas of what might be causing this? I guess I’m unsure about the thread-safety of OpenMP GPU offloading.

The blog post CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs recommends to always explicitly set the device id on a freshly spawned CPU thread with cudaSetDevice().
Setting the device before a fork will only work for the master thread which then spawns the new worker threads. So the worker threads will by default try to use the default device. i.e. device id 0.

Related

C++ Using OpenMP #pragma and std:thread?

OpenMP's tasks and parallel for loops seem like a good choice for concurrent processing within a specific process, but I'm not sure if it covers all use-cases.
What if I just want to spawn off an asynchronous task that doesn't need to return anything and I don't want to wait for it, I just want it to run in the background then end when done... does OpenMP have this ability? And if not can I safely use std:thread whilst also using OpenMP pragmas for other things?
And what if the spawned thread itself uses OpenMP task group, whilst the parent thread is also using another OpenMP task group for something else? Is this going to cause issues?

Correct place to use cudaSetDeviceFlags?

Win10 x64, CUDA 8.0, VS2015, 6-core CPU (12 logical cores), 2 GTX580 GPUs.
In general, I'm working on a multithreaded application that launches 2 threads that are associated with 2 GPUs available, these threads are stored in a thread pool.
Each thread does the following initialization procedure upon it's launch (i.e. this is done only ones during the runtime of each thread):
::cudaSetDevice(0 or 1, as we have only two GPUs);
::cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
::cudaSetDeviceFlags(cudaDeviceMapHost | cudaDeviceScheduleBlockingSync);
Then, from other worker threads (12 more threads that do not touch GPUs at all), I begin feeding these 2 GPU-associated worker threads with data, it works perfectly as long as the number of GPU threads being laucnhed is equal to the number of physical GPUs available.
Now I want to launch 4 GPU threads (i.e 2 threads per GPU) and make each one work via separate CUDA stream. I know the requirements that are essential for proper CUDA streams usage, I meet all of them. What I'm failing on is the initialization procedure mentioned above.
As soon as this procedure is attempted to be executed twice from different GPU threads but for the same GPU, the ::cudaSetDeviceFlags(...) starts failing with "cannot set while device is active in this process" error message.
I have looked into the manual and seems like I get the reason why this happens, what I can't understand is how to use ::cudaSetDeviceFlags(...) for my setup properly.
I can comment this ::cudaSetDeviceFlags(...) line and the propgram will work fine even for 8 thread per GPU, but I need the cudaDeviceMapHost flag to be set in order to use streams, pinned memory won't be available otherwise.
EDIT Extra info to consider #1:
If to call ::cudaSetDeviceFlags before ::cudaSetDevice then no error
occurs.
Each GPU thread allocates a chunk of pinned memory via
::VirtualAlloc ->::cudaHostRegister approach upon thread launch
(works just fine no matter how many GPU threads launched) and
deallocates it upon thread termination (via ::cudaHostUnregister ->
::VirtualFree). ::cudaHostUnregister fails with "pointer does not
correspond to a registered memory region" for half the threads if the number of threads per GPU is greater than 1.
Well, highly sophisticated method of trythis-trythat-seewhathappens-tryagain practice finally did the trick, as always.
Here is the excerpt from the documentation on ::cudaSetDeviceFlags():
Records flags as the flags to use when initializing the current
device. If no device has been made current to the calling thread, then
flags will be applied to the initialization of any device initialized
by the calling host thread, unless that device has had its
initialization flags set explicitly by this or any host thread.
Consequently, in the GPU worker thread it is necessary to call ::cudaSetDeviceFlags() before ::cudaSetDevice().
I have implemented somthing like this in the GPU thread initialization code in order to make sure that device flags being set before the device set are actually applied properly:
bse__throw_CUDAHOST_FAILED(::cudaSetDeviceFlags(nFlagsOfDesire));
bse__throw_CUDAHOST_FAILED(::cudaSetDevice(nDevice));
unsigned int nDeviceFlagsActual = 0;
bse__throw_CUDAHOST_FAILED(::cudaGetDeviceFlags(&nDeviceFlagsActual));
bse__throw_IF(nFlagsOfDesire != nDeviceFlagsActual);
Also, the comment of talonmies showed the way to resolve the ::cudaHostUnregister errors.

Hybrid Parallelism: MPI and TBB

In TBB, the task_scheduler_init () method, which is often (and should be?) invoked internally, is a deliberate design decision.
However, if we mix TBB and MPI, is it guaranteed to be thread-safe without controlling the number of threads of each MPI process?
For example, say we have 7 cores (with no hyper-threading) and 2 MPI processes. If each process spawns an individual TBB task using 4 threads simultaneously, then there is a conflict which might cause the program to crash at runtime.
I'm a newbie of TBB.
Looking forward to your comments and suggestions!
From Intel TBB runtime perspective it does not matter if it is a MPI process or not. So if you have two processes you will have two independent instances of Intel TBB runtime. I am not sure that I understand the question related to thread-safeness but it should work correctly. However, you may have oversubscription that can lead to performance issues.
In addition, you should check the MPI implementation documentation if you use MPI routines concurrently (from multiple threads) because it can cause some issues.
Generally speaking, this is a two steps tango
MPI binds each task to some resources
the thread runtime (TBB, same things would apply to OpenMP) is generally smart enough to bind threads within the previously provided resources.
bottom line, if MPI binds its tasks to non overlapping resources, then there should be no conflict caused by the TBB runtime.
a typical scenario is to run 2 MPI tasks with 8 OpenMP threads per task on a dual socket octo core box. as long as MPI binds a task to a socket, and the OpenMP runtime is told to bind the threads to cores, performance will be optimal.

Calls to GPU kernel from a multithreaded C++ application?

I'm re-implementing some sections of an image processing library that's multithreaded C++ using pthreads. I'd like to be able to invoke a CUDA kernel in every thread and trust the device itself to handle kernel scheduling, but I know better than to count on that behavior. Does anyone have any experience with this type of issue?
CUDA 4.0 made it much simpler to drive a single CUDA context from multiple threads - just call cudaSetDevice() to specify which CUDA device you want the thread to submit commands.
Note that this is likely to be less efficient than driving the CUDA context from a single thread - unless the CPU threads have other work to keep them occupied between kernel launches, they are likely to get serialized by the mutexes that CUDA uses internally to keep its data structures consistent.
Perhaps Cuda streams are the solution to your problem. Try to invoke kernels from a different stream in each thread. However, I don't see how this will help, as I think that your kernel executions will be serialized, even though they are invoked in parallel. In fact, Cuda kernel invocations even on the same stream are asynchronous by nature, so you can make any number of invocations from the same thread. I really don't understand what you are trying to achieve.

Handling GUI thread in a program using OpenMP

I have a C++ program that performs some lengthy computation in parallel using OpenMP. Now that program also has to respond to user input and update some graphics. So far I've been starting my computations from the main / GUI thread, carefully balancing the workload so that it is neither to short to mask the OpenMP threading overhead nor to long so the GUI becomes unresponsive.
Clearly I'd like to fix that by running everything concurrently. As far as I can tell, OpenMP 2.5 doesn't provide a good mechanism for doing this. I assume it wasn't intended for this type of problem. I also wouldn't want to dedicate an entire core to the GUI thread, it just needs <10% of one for its work.
I thought maybe separating the computation into a separate pthread which launches the parallel constructs would be a good way of solving this. I coded this up but had OpenMP crash when invoked from the pthread, similar to this bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36242 . Note that I was not trying to launch parallel constructs from more than one thread at a time, OpenMP was only used in one pthread throughout the program.
It seems I can neither use OpenMP to schedule my GUI work concurrently nor use pthreads to have the parallel constructs run concurrently. I was thinking of just handling my GUI work in a separate thread, but that happens to be rather ugly in my case and might actually not work due to various libraries I use.
What's the textbook solution here? I'm sure others have used OpenMP in a program that needs to concurrently deal with a GUI / networking etc., but I haven't been able to find any information using Google or the OpenMP forum.
Thanks!
There is no textbook solution. The textbook application for OpenMP is non-interactive programs that read input files, do heavy computation, and write output files, all using the same thread pool of size ~ #CPUs in your supercomputer. It was not designed for concurrent execution of interactive and computation code and I don't think interop with any threads library is guaranteed by the spec.
Leaving theory aside, you seem to have encountered a bug in the GCC implementation of OpenMP. Please file a bug report with the GCC maintainers and for the time being, either look for a different compiler or run your GUI code in a separate process, communicating with the OpenMP program over some IPC mechanism. (E.g., async I/O over sockets.)