CUDA Segmentation fault in threads with no CUDA code

CUDA Segmentation fault in threads with no CUDA code - c++

I have this code:
__global__ void testCuda() {}
void wrapperLock()
{
std::lock_guard<std::mutex> lock(mutexCudaExecution);
// changing this value to 20000 does NOT trigger "Segmentation fault"
usleep(5000);
runCuda();
}
void runCuda()
{
testCuda<<<1, 1>>>();
cudaDeviceSynchronize();
}
When these functions are executed from around 20 threads then I get Segmentation fault. As written in the comment, changing the value in usleep() to 20000 works fine.
Is there an issue with CUDA and threads?
It looks to me like CUDA needs a bit of time to recover when an execution finished even when there was nothing to do.

Using a single CUDA context, multiple host threads should either delegate their CUDA work to a context-owner thread (similar to a worker thread) or bind the context with cuCtxSetCurrent (driver API) or cudaSetDevice in order to not overwrite the context resources.

UPDATE:
According to http://docs.nvidia.com/cuda/cuda-c-programming-guide/#um-gpu-exclusive the problem was a concurrent access to the Unified Memory I am using. I had to wrap the CUDA kernel calls and access to the Unified Memory with a std::lock_guard and now the program runs for 4 days under heavy thread load without any problems.
I have to call in each thread - as suggested by Marco & Robert - cudaSetDevice otherwise it crashes again.

Related

Multithreaded C++ Program Not Running In Parallel Using vector<thread> and .join()

Note: This is the first post I have made on this site, but I have searched extensively and was not able to find a solution to my problem.
I have written a program which essentially tests all permutations of a vector of numbers to find an optimal sequence as defined by me. Of course, computing permutations of numbers is very time consuming even for small inputs, so I am trying to speed things up by using multithreading.
Here is a small sample which replicates the problem:
class TaskObject {
public:
void operator()() {
recursiveFunc();
}
private:
Solution *bestSolution; //Shared by every TaskObject, but can only be accessed by one at a time
void recursiveFunc() {
if (base_case) {
//Only part where shared object is accessed
//base_case is rarely reached
return;
}
recursiveFunc();
}
};
void runSolutionWithThreads() {
vector<thread> threads(std::thread::hardware_concurrency());
vector<TaskObject> tasks_vector(std::thread::hardware_concurrency());
updateTasks(); //Sets parameters that intialize the first call to recursiveFunc
for (int q = 0; q < (int)tasks_vector.size(); ++q) {
threads[q] = std::thread(tasks_vector[q]);
}
for (int i = 0; i < (int)threads.size(); ++i) {
threads[i].join();
}
}
I imagined that this would enable all threads to run in parallel, but I can see using the performance profiler in visual studio and in the advanced settings of windows task manager that only 1 thread is running at a time. On a system with access to 4 threads, the CPU gets bounded at 25%. I get correct output every time I run, so there are no issues with the algorithm logic. Work is spread out as evenly as possible among all task objects. Collisions with shared data rarely occur. Program implementation with thread pool always ran at nearly 100%.
The objects submitted to the threads don't print to cout and all have their own copies of the data required to perform their work except for one shared object they all reference by pointer.
private:
Solution* bestSolution;
This shared data is not susceptible to a data race condition since I used lock_guard from mutex to make it so only one thread can update bestSolution at a time.
In other words, why isn't my CPU running at nearly 100% for my multithreaded program which uses as many threads as there are available in the system?
I can readily update this post with more information if needed.

In debugging your application, use the debugger to "break all" threads. Then examine each thread with the debug thread window to see where each thread is executing. Likely you will find that only one thread is executing code, while the rest are all blocked on the mutex that the one running thread is holding.
If you show a more complete example of the code it can greatly assist.

How are user-level threads scheduled/created, and how are kernel level threads created?

Apologies if this question is stupid. I tried to find an answer online for quite some time, but couldn't and hence I'm asking here. I am learning threads, and I've been going through this link and this Linux Plumbers Conference 2013 videoabout kernel level and user level threads, and as far as I understood, using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside. In such a case,
who decides the scheduling of these user threads during the timeslice the process gets, as the kernel sees it as a single process and is unaware of the threads, and how is the scheduling done?
If pthreads create user level threads, how are kernel level or OS threads created from user space programs, if required?
According to the above link, it says Operating Systems kernel provides system call to create and manage threads. So does a clone() system call creates a kernel level thread or user level thread?
If it creates a kernel level thread, then strace of a simple pthreads program also shows using clone() while executing, but then why would it be considered user level thread?
If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs?
According to the link, it says "It require a full thread control block (TCB) for each thread to maintain information about threads. As a result there is significant overhead and increased in kernel complexity.", so in kernel level threads, only the heap is shared, and the rest all are individual to the thread?
Edit:
I was asking about the user-level thread creation, and it's scheduling because here, there is a reference to Many to One Model where many user level threads are mapped to one Kernel-level thread, and Thread management is done in user space by the thread library. I've been only seeing references to using pthreads, but unsure if it creates user-level or kernel-level threads.

This is prefaced by the top comments.
The documentation you're reading is generic [not linux specific] and a bit outdated. And, more to the point, it is using different terminology. That is, I believe, the source of the confusion. So, read on ...
What it calls a "user-level" thread is what I'm calling an [outdated] LWP thread. What it calls a "kernel-level" thread is what is called a native thread in linux. Under linux, what is called a "kernel" thread is something else altogether [See below].
using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside.
This was how userspace threads were done prior to the NPTL (native posix threads library). This is also what SunOS/Solaris called an LWP lightweight process.
There was one process that multiplexed itself and created threads. IIRC, it was called the thread master process [or some such]. The kernel was not aware of this. The kernel didn't yet understand or provide support for threads.
But, because, these "lightweight" threads were switched by code in the userspace based thread master (aka "lightweight process scheduler") [just a special user program/process], they were very slow to switch context.
Also, before the advent of "native" threads, you might have 10 processes. Each process gets 10% of the CPU. If one of the processes was an LWP that had 10 threads, these threads had to share that 10% and, thus, got only 1% of the CPU each.
All this was replaced by the "native" threads that the kernel's scheduler is aware of. This changeover was done 10-15 years ago.
Now, with the above example, we have 20 threads/processes that each get 5% of the CPU. And, the context switch is much faster.
It is still possible to have an LWP system under a native thread, but, now, that is a design choice, rather than a necessity.
Further, LWP works great if each thread "cooperates". That is, each thread loop periodically makes an explicit call to a "context switch" function. It is voluntarily relinquishing the process slot so another LWP can run.
However, the pre-NPTL implementation in glibc also had to [forcibly] preempt LWP threads (i.e. implement timeslicing). I can't remember the exact mechanism used, but, here's an example. The thread master had to set an alarm, go to sleep, wake up and then send the active thread a signal. The signal handler would effect the context switch. This was messy, ugly, and somewhat unreliable.
Joachim mentioned pthread_create function creates a kernel thread
That is [technically] incorrect to call it a kernel thread. pthread_create creates a native thread. This is run in userspace and vies for timeslices on an equal footing with processes. Once created there is little difference between a thread and a process.
The primary difference is that a process has its own unique address space. A thread, however, is a process that shares its address space with other process/threads that are part of the same thread group.
If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs?
Kernel threads are not userspace threads, NPTL, native, or otherwise. They are created by the kernel via the kernel_thread function. They run as part of the kernel and are not associated with any userspace program/process/thread. They have full access to the machine. Devices, MMU, etc. Kernel threads run in the highest privilege level: ring 0. They also run in the kernel's address space and not the address space of any user process/thread.
A userspace program/process may not create a kernel thread. Remember, it creates a native thread using pthread_create, which invokes the clone syscall to do so.
Threads are useful to do things, even for the kernel. So, it runs some of its code in various threads. You can see these threads by doing ps ax. Look and you'll see kthreadd, ksoftirqd, kworker, rcu_sched, rcu_bh, watchdog, migration, etc. These are kernel threads and not programs/processes.
UPDATE:
You mentioned that kernel doesn't know about user threads.
Remember that, as mentioned above, there are two "eras".
(1) Before the kernel got thread support (circa 2004?). This used the thread master (which, here, I'll call the LWP scheduler). The kernel just had the fork syscall.
(2) All kernels after that which do understand threads. There is no thread master, but, we have pthreads and the clone syscall. Now, fork is implemented as clone. clone is similar to fork but takes some arguments. Notably, a flags argument and a child_stack argument.
More on this below ...
then, how is it possible for user level threads to have individual stacks?
There is nothing "magic" about a processor stack. I'll confine discussion [mostly] to x86, but this would be applicable to any architecture, even those that don't even have a stack register (e.g. 1970's era IBM mainframes, such as the IBM System 370)
Under x86, the stack pointer is %rsp. The x86 has push and pop instructions. We use these to save and restore things: push %rcx and [later] pop %rcx.
But, suppose the x86 did not have %rsp or push/pop instructions? Could we still have a stack? Sure, by convention. We [as programmers] agree that (e.g.) %rbx is the stack pointer.
In that case, a "push" of %rcx would be [using AT&T assembler]:
subq $8,%rbx
movq %rcx,0(%rbx)
And, a "pop" of %rcx would be:
movq 0(%rbx),%rcx
addq $8,%rbx
To make it easier, I'm going to switch to C "pseudo code". Here are the above push/pop in pseudo code:
// push %ecx
%rbx -= 8;
0(%rbx) = %ecx;
// pop %ecx
%ecx = 0(%rbx);
%rbx += 8;
To create a thread, the LWP scheduler had to create a stack area using malloc. It then had to save this pointer in a per-thread struct, and then kick off the child LWP. The actual code is a bit tricky, assume we have an (e.g.) LWP_create function that is similar to pthread_create:
typedef void * (*LWP_func)(void *);
// per-thread control
typedef struct tsk tsk_t;
struct tsk {
tsk_t *tsk_next; //
tsk_t *tsk_prev; //
void *tsk_stack; // stack base
u64 tsk_regsave[16];
};
// list of tasks
typedef struct tsklist tsklist_t;
struct tsklist {
tsk_t *tsk_next; //
tsk_t *tsk_prev; //
};
tsklist_t tsklist; // list of tasks
tsk_t *tskcur; // current thread
// LWP_switch -- switch from one task to another
void
LWP_switch(tsk_t *to)
{
// NOTE: we use (i.e.) burn register values as we do our work. in a real
// implementation, we'd have to push/pop these in a special way. so, just
// pretend that we do that ...
// save all registers into tskcur->tsk_regsave
tskcur->tsk_regsave[RAX] = %rax;
// ...
tskcur = to;
// restore most registers from tskcur->tsk_regsave
%rax = tskcur->tsk_regsave[RAX];
// ...
// set stack pointer to new task's stack
%rsp = tskcur->tsk_regsave[RSP];
// set resume address for task
push(%rsp,tskcur->tsk_regsave[RIP]);
// issue "ret" instruction
ret();
}
// LWP_create -- start a new LWP
tsk_t *
LWP_create(LWP_func start_routine,void *arg)
{
tsk_t *tsknew;
// get per-thread struct for new task
tsknew = calloc(1,sizeof(tsk_t));
append_to_tsklist(tsknew);
// get new task's stack
tsknew->tsk_stack = malloc(0x100000)
tsknew->tsk_regsave[RSP] = tsknew->tsk_stack;
// give task its argument
tsknew->tsk_regsave[RDI] = arg;
// switch to new task
LWP_switch(tsknew);
return tsknew;
}
// LWP_destroy -- destroy an LWP
void
LWP_destroy(tsk_t *tsk)
{
// free the task's stack
free(tsk->tsk_stack);
remove_from_tsklist(tsk);
// free per-thread struct for dead task
free(tsk);
}
With a kernel that understands threads, we use pthread_create and clone, but we still have to create the new thread's stack. The kernel does not create/assign a stack for a new thread. The clone syscall accepts a child_stack argument. Thus, pthread_create must allocate a stack for the new thread and pass that to clone:
// pthread_create -- start a new native thread
tsk_t *
pthread_create(LWP_func start_routine,void *arg)
{
tsk_t *tsknew;
// get per-thread struct for new task
tsknew = calloc(1,sizeof(tsk_t));
append_to_tsklist(tsknew);
// get new task's stack
tsknew->tsk_stack = malloc(0x100000)
// start up thread
clone(start_routine,tsknew->tsk_stack,CLONE_THREAD,arg);
return tsknew;
}
// pthread_join -- destroy an LWP
void
pthread_join(tsk_t *tsk)
{
// wait for thread to die ...
// free the task's stack
free(tsk->tsk_stack);
remove_from_tsklist(tsk);
// free per-thread struct for dead task
free(tsk);
}
Only a process or main thread is assigned its initial stack by the kernel, usually at a high memory address. So, if the process does not use threads, normally, it just uses that pre-assigned stack.
But, if a thread is created, either an LWP or a native one, the starting process/thread must pre-allocate the area for the proposed thread with malloc. Side note: Using malloc is the normal way, but the thread creator could just have a large pool of global memory: char stack_area[MAXTASK][0x100000]; if it wished to do it that way.
If we had an ordinary program that does not use threads [of any type], it may wish to "override" the default stack it has been given.
That process could decide to use malloc and the above assembler trickery to create a much larger stack if it were doing a hugely recursive function.
See my answer here: What is the difference between user defined stack and built in stack in use of memory?

User level threads are usually coroutines, in one form or another. Switch context between flows of execution in user mode, with no kernel involvement. From kernel POV, is all one thread. What the thread actually does is controlled in the user mode, and the user mode can suspend, switch, resume logical flows of executions (ie. coroutines). It all happens during the quanta scheduled for the actual thread. Kernel can, and will unceremoniously interrupt the actual thread (kernel thread) and give control of the processor to another thread.
User mode coroutines require cooperative multitasking. User mode threads must periodically relinquish control to other user mode threads (basically the execution changes context to the new user mode thread, without the kernel thread ever noticing anything). Usually what happens is that the code knows a whole lot better when it wants to release control that the kernel would. A poorly coded coroutine can steal control and starve all other coroutines.
The historical implementation used setcontext but that is now deprecated. Boost.context offers a replacement for it, but is not fully portable:
Boost.Context is a foundational library that provides a sort of cooperative multitasking on a single thread. By providing an abstraction of the current execution state in the current thread, including the stack (with local variables) and stack pointer, all registers and CPU flags, and the instruction pointer, a execution_context represents a specific point in the application's execution path.
Not surprisingly, Boost.coroutine is based on Boost.context.
Windows provided Fibers. .Net runtime has Tasks and async/await.

LinuxThreads follows the so-called "one-to-one" model: each thread is actually a separate process in the kernel. The kernel scheduler takes care of scheduling the threads, just like it schedules regular processes. The threads are created with the Linux clone() system call, which is a generalization of fork() allowing the new process to share the memory space, file descriptors, and signal handlers of the parent.
Source - interview of Xavier Leroy(person who created LinuxThreads)
http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html#K

OpenCV in multithreaded environment (OpenMP) causes segmentation fault

I have OpenCV 3.0.0 installed. My code is multithreaded using OpenMP.
Each thread accesses the same opencv function ("convertTo").
This causes a segmentation fault.
The error does not occurr
if I print a simple statement using std::cout at the beginning of each thread or
if I use only a single thread.
Can anyone help, what the reason might be?

Many functions and data openCV use the same memory addresses for different variables, for example if you have a matrix Mat A and you do Mat B = A, data matrix B are stored in the same pociciones memory A, now when you use OpenMP must make sure that when you write to a memory location, just do it from a single thread, otherwise you will get an error at runtime.
Now when you use a single thread there is no problem, since it is only one thread which writes or reads a pocicion memory.
On the other hand when you use functions to print screen as printf () or std :: cout, there is the possibility that the threads are delayed, that is, that while a thread prints, another thread writes to the memory locations, by thus the possibility of an error at runtime decline, but that does not mean that in the future do not exist.
The solution when you use OpenMP in a loop to protect write in the same memory locations from different threads is:`
#pragma omp critical
{
   //code only be written from a thread
}

Does executing an int 3 interrupt stop the entire process on Linux or just the current thread?

Suppose the architecture is x86. And the OS is Linux based. Given a multithreaded process in which a single thread executes an int 3 instruction, does the interrupt handler stop from executing the entire process or just the thread that executed the int 3 instruction?

Since the question is Linux specific, let's dive into kernel sources! We know int 3 will generate a SIGTRAP, as we can see in do_int3. The default behaviour of SIGTRAP is to terminate the process and dump core.
do_int3 calls do_trap which, after a lot of indirection, calls complete_signal, where most of the magic happens. Following the comments, it's quite clear to see what is happening without much need for explanation:
A thread is found to deliver the signal to. The main thread is given first crack, but any thread can get it unless explicitly stated it doesn't want to.
SIGTRAP is fatal (and we've assumed we want to establish what the default behaviour is) and must dump core, so it is fatal to the whole group
The loop at line 1003 wakes up all threads and delivers the signal.
EDIT: To answer the comment:
When the process is being ptraced, the behaviour is pretty well documented in the manual page (see "Signal-delivery-stop"). Basically, after the kernel selects an arbitrary thread which handles the signal, if the selected thread is traced, it enters signal-delivery-stop -- this means the signal is not yet delivered to the process, and can be suppressed by the tracer process. This is the case with a debugger: a dead process is of no use to us when debugging (that's not entirely true, but let's consider the live-debugging scenario, which is the only one which makes sense in this context), so by default we block SIGTRAP unless the user specifies otherwise. In this case it is irrelevant how the traced process handles SIGTRAP (SIG_IGN or SIG_DFL or a custom handler) because it will never know it occurred.
Note that in the case of SIGTRAP, the tracer process must account for various scenarios other than the process being stopped, as also detailed in the man page under each ptrace action.

Easy enough to test:
#include <thread>
#include <vector>
void f(int v) {
std::this_thread::sleep_for(std::chrono::seconds(2));
if (v == 2) asm("int $3");
std::this_thread::sleep_for(std::chrono::seconds(1));
printf("%d\n", v); // no sync here to keep it simple
}
int main() {
std::vector<std::thread> threads;
for (int i = 0; i < 4; i++) threads.emplace_back(f, i);
for (auto& thread : threads) thread.join();
return 0;
}
If only thread was stopped it should still print the message from threads other then 2 but that is not the case and entire process stops before printing anything (or triggers a breakpoint when debugging). On Ubuntu the message you get is:
Trace/breakpoint trap (core dumped)

int 3 is a privileged instruction that userspace code is not allowed to run.
The kernel will then send a SIGTRAP signal to your process, and the default action for a SIGTRAP signal is to terminate the entire process.

The answer is really neither. Int 3 is used to trigger a breakpoint. The interrupt handler is tiny, and neither the interrupt nor its handler stop any threads.
If there is no debugger loaded the handler will either ignore it or call the OS to take some kind of error action like raising a signal (perhaps SIGTRAP). No threads are harmed.
If there is an in-process debugger, the breakpoint ISR transfers control to it. The breakpoint does not stop any threads, except the one that breaks. The debugger may try to suspend others.
If there is a out-of-process debugger, the handler will invoke it, but this has to be mediated through the OS in order to do a suitable context switch. As part of that switch the OS will suspend the debuggee, which means all its threads will stop.

Boost thread seems to block when going out of scope

I have a strange problem with Boost 1.54 threads which seem to block when the thread object goes out of scope.
Background: I'm working on a real-time application that uses external hardware through API calls. Some of these API calls block until execution. That's why I want to call them in separate threads in order to avoid blocking my main thread. The simplified structure looks as follows:
void some_func(){
//t2
boost::thread t(&blocking_call);
//t3
}
int main(){
//t1
some_func();
//t4
return 0;
}
Luckily, the external hardware has an onboard clock so that I was able to time the execution of my program precisely.
What I observed: t1, t2 and t3 are - as expected - incrementing only a tiny little bit, but t4 is always shortly after the execution time of the API call which is a lot later (and unfortunately even too late for me). It seems as if the thread object was calling join() when it is going out of scope, although I thought it should just get detached and finish its work.
Any hints what might be the issue?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js