I have some CUDA kernels I want to run in individual pthreads.
I basically have to have each pthread execute, say, 3 cuda kernels, and they must be executed sequentially.
I thought I would try to pass each pthread a reference to a stream, and so each of those 3 cuda kernels would all execute sequentially, in the same stream.
I could get this working with a different context for pthread, which would then execute the kernels as normal, but that seems to take a lot of overhead.
So how do I make each pthread work in the same context, concurrently with the other pthreads?
Thanks
Before CUDA 4.0, the way to access a given context from different CPU threads was to use cuCtxPopCurrent()/cuCtxPushCurrent(). A context could only be current to one CPU thread at a time.
In CUDA 4.0, you can call cudaSetDevice() in each pthread and it can be current to more than one thread at a time.
The kernel invocations will be serialized by the context in the order received, but you may have to perform CPU thread synchronization to make sure the work is submitted in the order desired.
Related
I am just started coding of device driver and new to threading, went through many documents for getting an idea about threads. I still have some doubts.
what is a kernel thread?
how it differs from user thread?
what is the relationship between the two threads?
how can i implement kernel threads?
where can i see the output of the implementation?
Can anyone help me?
Thanks.
A kernel thread is a task_struct with no userspace components.
Besides the lack of userspace, it has different ancestors (kthreadd kernel thread instead of the init process) and is created by a kernel-only API instead of sequences of clone from fork/exec system calls.
Two kernel threads have kthreadd as a parent. Apart from that, kernel threads enjoy the same "independence" one from another as userspace processes.
Use the kthread_run function/macro from the kthread.h header You will most probably have to write a kernel module in order to call this function, so you should take a look a the Linux Device Drivers
If you are referring to the text output of your implementation (via printk calls), you can see this output in the kernel log using the dmesg command.
A kernel thread is a kernel task running only in kernel mode; it usually has not been created by fork() or clone() system calls. An example is kworker or kswapd.
You probably should not implement kernel threads if you don't know what they are.
Google gives many pages about kernel threads, e.g. Frey's page.
user threads & stack:
Each thread has its own stack so that it can use its own local variables, thread’s share global variables which are part of .data or .bss sections of linux executable.
Since threads share global variables i.e we use synchronization mechanisms like mutex when we want to access/modify global variables in multi threaded application. Local variables are part of thread individual stack, so no need of any synchronization.
Kernel threads
Kernel threads have emerged from the need to run kernel code in process context. Kernel threads are the basis of the workqueue mechanism. Essentially, a thread kernel is a thread that only runs in kernel mode and has no user address space or other user attributes.
To create a thread kernel, use kthread_create():
#include <linux/kthread.h>
structure task_struct *kthread_create(int (*threadfn)(void *data),
void *data, const char namefmt[], ...);
kernel threads & stack:
Kernel threads are used to do post processing tasks for kernel like pdf flush threads, workq threads etc.
Kernel threads are basically new process only without address space(can be created using clone() call with required flags), means they can’t switch to user-space. kernel threads are schedulable and preempt-able as normal processes.
kernel threads have their own stacks, which they use to manage local info.
More about kernel stacks:-
https://www.kernel.org/doc/Documentation/x86/kernel-stacks
Since you're comparing kernel threads with user[land] threads, I assume you mean something like the following.
The normal way of implementing threads nowadays is to do it in the kernel, so those can be considered "normal" threads. It's however also possible to do it in userland, using signals such as SIGALRM, whose handler will save the current process state (registers, mostly) and change them to another one previously saved. Several OSes used this as a way to implement threads before they got proper kernel thread support. They can be faster, since you don't have to go into kernel mode, but in practice they've faded away.
There's also cooperative userland threads, where one thread runs until it calls a special function (usually called yield), which then switches to another thread in a similar way as with SIGALRM above. The advantage here is that the program is in total control, which can be useful when you have timing concerns (a game for example). You also don't have to care much about thread safety. The big disadvantage is that only one thread can run at a time, and therefore this method is also uncommon now that processors have multiple cores.
Kernel threads are implemented in the kernel. Perhaps you meant how to use them? The most common way is to call pthread_create.
Win10 x64, CUDA 8.0, VS2015, 6-core CPU (12 logical cores), 2 GTX580 GPUs.
In general, I'm working on a multithreaded application that launches 2 threads that are associated with 2 GPUs available, these threads are stored in a thread pool.
Each thread does the following initialization procedure upon it's launch (i.e. this is done only ones during the runtime of each thread):
::cudaSetDevice(0 or 1, as we have only two GPUs);
::cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
::cudaSetDeviceFlags(cudaDeviceMapHost | cudaDeviceScheduleBlockingSync);
Then, from other worker threads (12 more threads that do not touch GPUs at all), I begin feeding these 2 GPU-associated worker threads with data, it works perfectly as long as the number of GPU threads being laucnhed is equal to the number of physical GPUs available.
Now I want to launch 4 GPU threads (i.e 2 threads per GPU) and make each one work via separate CUDA stream. I know the requirements that are essential for proper CUDA streams usage, I meet all of them. What I'm failing on is the initialization procedure mentioned above.
As soon as this procedure is attempted to be executed twice from different GPU threads but for the same GPU, the ::cudaSetDeviceFlags(...) starts failing with "cannot set while device is active in this process" error message.
I have looked into the manual and seems like I get the reason why this happens, what I can't understand is how to use ::cudaSetDeviceFlags(...) for my setup properly.
I can comment this ::cudaSetDeviceFlags(...) line and the propgram will work fine even for 8 thread per GPU, but I need the cudaDeviceMapHost flag to be set in order to use streams, pinned memory won't be available otherwise.
EDIT Extra info to consider #1:
If to call ::cudaSetDeviceFlags before ::cudaSetDevice then no error
occurs.
Each GPU thread allocates a chunk of pinned memory via
::VirtualAlloc ->::cudaHostRegister approach upon thread launch
(works just fine no matter how many GPU threads launched) and
deallocates it upon thread termination (via ::cudaHostUnregister ->
::VirtualFree). ::cudaHostUnregister fails with "pointer does not
correspond to a registered memory region" for half the threads if the number of threads per GPU is greater than 1.
Well, highly sophisticated method of trythis-trythat-seewhathappens-tryagain practice finally did the trick, as always.
Here is the excerpt from the documentation on ::cudaSetDeviceFlags():
Records flags as the flags to use when initializing the current
device. If no device has been made current to the calling thread, then
flags will be applied to the initialization of any device initialized
by the calling host thread, unless that device has had its
initialization flags set explicitly by this or any host thread.
Consequently, in the GPU worker thread it is necessary to call ::cudaSetDeviceFlags() before ::cudaSetDevice().
I have implemented somthing like this in the GPU thread initialization code in order to make sure that device flags being set before the device set are actually applied properly:
bse__throw_CUDAHOST_FAILED(::cudaSetDeviceFlags(nFlagsOfDesire));
bse__throw_CUDAHOST_FAILED(::cudaSetDevice(nDevice));
unsigned int nDeviceFlagsActual = 0;
bse__throw_CUDAHOST_FAILED(::cudaGetDeviceFlags(&nDeviceFlagsActual));
bse__throw_IF(nFlagsOfDesire != nDeviceFlagsActual);
Also, the comment of talonmies showed the way to resolve the ::cudaHostUnregister errors.
I don't know how to thread in C++ and I would not just wan't to know that but is there a way i can force a thread onto a different core? Also how would I find out how many cores the user has?
Binding thread to the arbitrary CPU is called setting affinity. It's platform-dependent operation.
For windows: SetProcessAffinityMask
For pthreads: pthread_attr_setaffinity_np(3) and pthread_setaffinity_np(3)
For Boost you can use native_handle() to get platform-specific thread handle to use them with functions above.
I'm re-implementing some sections of an image processing library that's multithreaded C++ using pthreads. I'd like to be able to invoke a CUDA kernel in every thread and trust the device itself to handle kernel scheduling, but I know better than to count on that behavior. Does anyone have any experience with this type of issue?
CUDA 4.0 made it much simpler to drive a single CUDA context from multiple threads - just call cudaSetDevice() to specify which CUDA device you want the thread to submit commands.
Note that this is likely to be less efficient than driving the CUDA context from a single thread - unless the CPU threads have other work to keep them occupied between kernel launches, they are likely to get serialized by the mutexes that CUDA uses internally to keep its data structures consistent.
Perhaps Cuda streams are the solution to your problem. Try to invoke kernels from a different stream in each thread. However, I don't see how this will help, as I think that your kernel executions will be serialized, even though they are invoked in parallel. In fact, Cuda kernel invocations even on the same stream are asynchronous by nature, so you can make any number of invocations from the same thread. I really don't understand what you are trying to achieve.
I was wondering if there was a way to run a thread on a seperate core instead of just a thread on that core?
Thanks
If you create a thread, you have by default no control on which core it will run. The operation system's scheduling algorithm takes care of that, and is pretty good at its job. However, you can use the SetThreadAffinity WinAPI to specify the logical cores a thread is allowed to run on.
Don't do that unless you have very good reasons. Quoting MSDN:
Setting an affinity mask for a process or thread can result in threads receiving less processor time, as the system is restricted from running the threads on certain processors. In most cases, it is better to let the system select an available processor.