Is it safe to fork from within a thread?

Is it safe to fork from within a thread? - c++

Let me explain: I have already been developing an application on Linux which forks and execs an external binary and waits for it to finish. Results are communicated by shm files that are unique to the fork + process. The entire code is encapsulated within a class.
Now I am considering threading the process in order to speed things up. Having many different instances of class functions fork and execute the binary concurrently (with different parameters) and communicate results with their own unique shm files.
Is this thread safe? If I fork within a thread, apart from being safe, is there something I have to watch for? Any advice or help is much appreciated!

The problem is that fork() only copies the calling thread, and any mutexes held in child threads will be forever locked in the forked child. The pthread solution was the pthread_atfork() handlers. The idea was you can register 3 handlers: one prefork, one parent handler, and one child handler. When fork() happens prefork is called prior to fork and is expected to obtain all application mutexes. Both parent and child must release all mutexes in parent and child processes respectively.
This isn't the end of the story though! Libraries call pthread_atfork to register handlers for library specific mutexes, for example Libc does this. This is a good thing: the application can't possibly know about the mutexes held by 3rd party libraries, so each library must call pthread_atfork to ensure it's own mutexes are cleaned up in the event of a fork().
The problem is that the order that pthread_atfork handlers are called for unrelated libraries is undefined (it depends on the order that the libraries are loaded by the program). So this means that technically a deadlock can happen inside of a prefork handler because of a race condition.
For example, consider this sequence:
Thread T1 calls fork()
libc prefork handlers are called in T1 (e.g. T1 now holds all libc locks)
Next, in Thread T2, a 3rd party library A acquires its own mutex AM, and then makes a libc call which requires a mutex. This blocks, because libc mutexes are held by T1.
Thread T1 runs prefork handler for library A, which blocks waiting to obtain AM, which is held by T2.
There's your deadlock and its unrelated to your own mutexes or code.
This actually happened on a project I once worked on. The advice I had found at that time was to choose fork or threads but not both. But for some applications that's probably not practical.

It's safe to fork in a multithreaded program as long as you are very careful about the code between fork and exec. You can make only re-enterant (aka asynchronous-safe) system calls in that span. In theory, you are not allowed to malloc or free there, although in practice the default Linux allocator is safe, and Linux libraries came to rely on it End result is that you must use the default allocator.

Back at the Dawn of Time, we called threads "lightweight processes" because while they act a lot like processes, they're not identical. The biggest distinction is that threads by definition live in the same address space of one process. This has advantages: switching from thread to thread is fast, they inherently share memory so inter-thread communications are fast, and creating and disposing of threads is fast.
The distinction here is with "heavyweight processes", which are complete address spaces. A new heavyweight process is created by fork(2). As virtual memory came into the UNIX world, that was augmented with vfork(2) and some others.
A fork(2) copies the entire address space of the process, including all the registers, and puts that process under the control of the operating system scheduler; the next time the scheduler comes around, the instruction counter picks up at the next instruction -- the forked child process is a clone of the parent. (If you want to run another program, say because you're writing a shell, you follow the fork with an exec(2) call, which loads that new address space with a new program, replacing the one that was cloned.)
Basically, your answer is buried in that explanation: when you have a process with many LWPs threads and you fork the process, you will have two independent processes with many threads, running concurrently.
This trick is even useful: in many programs, you have a parent process that may have many threads, some of which fork new child processes. (For example, an HTTP server might do that: each connection to port 80 is handled by a thread, and then a child process for something like a CGI program could be forked; exec(2) would then be called to run the CGI program in place of the parent process close.)

While you can use Linux's NPTL pthreads(7) support for your program, threads are an awkward fit on Unix systems, as you've discovered with your fork(2) question.
Since fork(2) is a very cheap operation on modern systems, you might do better to just fork(2) your process when you have more handling to perform. It depends upon how much data you intend to move back and forth, the share-nothing philosophy of forked processes is good for reducing shared-data bugs but does mean you either need to create pipes to move data between processes or use shared memory (shmget(2) or shm_open(3)).
But if you choose to use threading, you can fork(2) a new process, with the following hints from the fork(2) manpage:
* The child process is created with a single thread — the
one that called fork(). The entire virtual address space
of the parent is replicated in the child, including the
states of mutexes, condition variables, and other pthreads
objects; the use of pthread_atfork(3) may be helpful for
dealing with problems that this can cause.

Provided you quickly either call exec() or _exit() in the forked child process, you're ok in practice.
You might want to use posix_spawn() instead which will probably do the Right Thing.

My experience of fork()'ing within threads is really bad. The software generally fails pretty quickly.
I've found several solutions to the matter, although you may not like them much, I think these are generally the best way to avoid close to undebuggable errors.
Fork first
Assuming you know the number of external processes you need at the start, you can create them upfront and just have them sit there waiting for an event (i.e. read from a blocking pipe, wait on a semaphore, etc.)
Once you forked enough children you are free to use threads and communicate with those forked processes via your pipes, semaphores, etc. From the time you create a first thread, you cannot call fork anymore. Keep in mind that if you're using 3rd party libraries which may create threads, those have to be used/initialized after the fork() calls happened.
Note that you can then start using threads within the main and fork()'ed processes.
Know your state
In some circumstances, it may be possible for you to stop all of your threads to start a process and then restart your threads. This is somewhat similar to point (1) in the sense that you do not want threads running at the time you call fork(), although it requires a way for you to know about all the threads currently running in your software (something not always possible with 3rd party libraries).
Remember that "stopping a thread" using a wait is not going to work. You have to join with the thread so it is fully exited, because a wait require a mutex and those need to be unlocked when you call fork(). You just cannot know when the wait is going to unlock/re-lock the mutex and that's usually where you get stuck.
Choose one or the other
The other obvious possibility is to choose one or the other and not bother with whether you're going to interfere with one or the other. This is by far the simplest method if at all possible in your software.
Create Threads only when Necessary
In some software, one creates one or more threads in a function, use said threads, then joins all of them when exiting the function. This is somewhat equivalent to point (2) above, only you (micro-)manage threads as required instead of creating threads that sit around and get used when necessary. This will work too, just keep in mind that creating a thread is a costly call. It has to allocate a new task with a stack and its own set of registers... it is a complex function. However, this makes it easy to know when you have threads running and except from within those functions, you are free to call fork().
In my programming, I used all of these solutions. I used Point (2) because the threaded version of log4cplus and I needed to use fork() for some parts of my software.
As mentioned by others, if you are using a fork() to then call execve() then the idea is to use as little as possible between the two calls. That is likely to work 99.999% of the time (many people use system() or popen() with fairly good successes too and these do similar things). The fact is that if you do not hit any of the mutexes held by the other threads, then this will work without issue.
On the other hand, if, like me, you want to do a fork() and never call execve(), then it's not likely to work right while any thread is running.
What is actually happening?
The issue is that fork() create a separate copy of only the current task (a process under Linux is called a task in the kernel).
Each time you create a new thread (pthread_create()), you also create a new task, but within the same process (i.e. the new task shares the process space: memory, file descriptors, ownership, etc.). However, a fork() ignores those extra tasks when duplicating the currently running task.
+-----------------------------------------------+
| Process A |
| |
| +----------+ +----------+ +----------+ |
| | thread 1 | | thread 2 | | thread 3 | |
| +----------+ +----+-----+ +----------+ |
| | |
+----------------------|------------------------+
| fork()
|
+----------------------|------------------------+
| v Process B |
| +----------+ |
| | thread 1 | |
| +----------+ |
| |
+-----------------------------------------------+
So in Process B, we lose thread 1 & thread 3 from Process A. This means that if either or both have a lock on mutexes or something similar, then Process B is going to lock up quickly. The locks are the worst, but any resources that either thread still has at the time the fork() happens are lost (socket connection, memory allocations, device handle, etc.) This is where point (2) above comes in. You need to know your state before the fork(). If you have a very small number of threads or worker threads defined in one place and can easily stop all of them, then it will be easy enough.

If you are using the unix 'fork()' system call, then you are not technically using threads- you are using processes- they will have their own memory space, and therefore cannot interfere with eachother.
As long as each process uses different files, there should not be any issue.

Related

What is the point of the process fork creates being a copy of the parent?

I know the answer to "why is it this way" is because the language was invented so, but it seems like a lot of wasted effort that fork() spawns a copy of the process that called it. Perhaps it is useful sometimes, but surely the majority of time someone wants to start a new process its not to be a duplicate of the calling one? Why does fork create an identical process and not an empty one or one defined by passing an argument?
From yolinux
The fork() system call will spawn a new child process which is an
identical process to the parent except that has a new system process
ID
In other words when is it useful to start with a copy of the parent process?

One big advantage of having the parent process duplicated in the child is that it allows the parent program to make customizations to the child process' environment before executing it. For example, the parent might want to read the child process' stdout, in which case it needs to set up the pipes in order to allow it to read that before execing the new program.
It's also not as bad as it sounds, efficiency wise. The whole thing is implemented on Linux using copy-on-write semantics for the process' memory (except in the special cases noted in the man page):
Under Linux (and in most unices since version 7, parent of all unices alive now), fork() is implemented using copy-on-write pages, so the only
penalty that it incurs is the time and memory required to duplicate the
parent's page tables (which can be also copy-on-write), and to create a unique task structure for the child.

There are some very legitimate uses of the fork system call. Here are a few examples:
Memory saving. Because fork on any modern UNIX/Linux system shares memory between the child and parent (via copy-on-write semantics), a parent process can load some static data which can be instantly shared to a child process. The zygote process on Android does this: it preloads the Java (Dalvik) runtime and many classes, then simply forks to create new application processes on demand (which inherit a copy of the parent's runtime and loaded classes).
Time saving. A process can perform some expensive initialization procedure (such as Apache loading configuration files and modules), then fork off workers to perform tasks which use the preloaded initialization data.
Arbitrary process customization. On systems that have direct process creation methods (e.g. Windows with CreateProcess, QNX with spawn, etc., these direct process creation APIs tend to be very complex since every possible customization of the process has to be specified in the function call itself. By contrast, with fork/exec, a process can just fork, perform customizations via standard system calls (close, signal, dup, etc.) and then exec when it's ready. fork/exec is consequently one of the simplest process creation APIs in existence, yet simultaneously one of the most powerful and flexible.
To be fair, fork also has its fair share of problems. For example, it doesn't play nice with multithreaded programs: only one thread is created in the new process, and locks are not correctly closed (leading to the necessity of atfork handlers to reset lock states across a fork).

Contrary to all expectations, it's mainly fork that makes process creation so incredibly fast on Unices.
AFAIK, on Linux, the actual process memory is not copied upon fork, the child starts with the same virtual memory mapping as the parent, and pages are copied only where and when the child makes changes. The majority of pages are read-only code anyway, so they are never copied. This is called copy-on-write.
Use cases where copying the parent process is useful:
Shells
When you say cat foo >bar, the shell forks, and in the child process (still the shell) prepares the redirection, and then execs cat foo. The executed program runs under the same PID as the child shell and inherits all open file descriptors. You would not believe how easy it is to write a basic Unix shell.
Daemons (services)
Daemons run in the background. Many of them fork after some initial preparation, the parent exits, and the child detaches from the terminal and remains running in the background.
Network servers
Many networking daemons have to handle multiple connections at the same time. Example sshd. The main daemon runs as root and listens for new connections on port 22. When a new connection comes in it forks a child. The child just keeps the new socket representing that connection, authenticates the user, drops privileges and so on.
Etc

Why fork()? It had nothing to do with C. C was itself only coming into existence at the time. It's because of the way the original UNIX memory page and process management worked, it was trivial to cause a process to be paged out, and then paged back in at a different location, without unloading the first copy of the process.
In The Evolution of the Unix Time-sharing System (http://cm.bell-labs.com/cm/cs/who/dmr/hist.html), Dennis Ritchie says "In fact, the PDP-7's fork call required precisely 27 lines of assembly code." See the link for more.
Threads are evil. With threads, you essentially have a number of processes all with access to the same memory space, which can dance all over each others' values. There's no memory protection at all. See The Art of Unix Programming, Chapter 7 (http://www.faqs.org/docs/artu/ch07s03.html#id2923889) for a fuller explanation.

Fork and core dump with threads

Similar points to the one in this question have been raised before here and here, and I'm aware of the Google coredump library (which I've appraised and found lacking, though I might try and work on that if I understand the problem better).
I want to obtain a core dump of a running Linux process without interrupting the process. The natural approach is to say:
if (!fork()) { abort(); }
Since the forked process gets a fixed snapshot copy of the original process's memory, I should get a complete core dump, and since the copy uses copy-on-write, it should generally be cheap. However, a critical shortcoming of this approach is that fork() only forks the current thread, and all other threads of the original process won't exist in the forked copy.
My question is whether it is possible to somehow obtain the relevant data of the other, original threads. I'm not entirely sure how to approach this problem, but here are a couple of sub-questions I've come up with:
Is the memory that contains all of the threads' stacks still available and accessible in the forked process?
Is it possible to (quicky) enumerate all the running threads in the original process and store the addresses of the bases of their stacks? As I understand it, the base of a thread stack on Linux contains a pointer to the kernel's thread bookkeeping data, so...
with the stored thread base addresses, could you read out the relevant data for each of the original threads in the forked process?
If that is possible, perhaps it would only be a matter of appending the data of the other threads to the core dump. However, if that data is lost at the point of the fork already, then there doesn't seem to be any hope for this approach.

Are you familiar with process checkpoint-restart? In particular, CRIU? It seems to me it might provide an easy option for you.
I want to obtain a core dump of a running Linux process without interrupting the process [and] to somehow obtain the relevant data of the other, original threads.
Forget about not interrupting the process. If you think about it, a core dump has to interrupt the process for the duration of the dump; your true goal must therefore be to minimize the duration of this interruption. Your original idea of using fork() does interrupt the process, it just does so for a very short time.
Is the memory that contains all of the threads' stacks still available and accessible in the forked process?
No. The fork() only retains the thread that does the actual call, and the stacks for the rest of the threads are lost.
Here is the procedure I'd use, assuming CRIU is unsuitable:
Have a parent process that generates a core dump of the child process whenever the child is stopped. (Note that more than one consecutive stop event may be generated; only the first one until the next continue event should be acted on.)
You can detect the stop/continue events using waitpid(child,,WUNTRACED|WCONTINUED).
Optional: Use sched_setaffinity() to restrict the process to a single CPU, and sched_setscheduler() (and perhaps sched_setparam()) to drop the process priority to IDLE.
You can do this from the parent process, which only needs the CAP_SYS_NICE capability (which you can give it using setcap 'cap_sys_nice=pe' parent-binary to the parent binary, if you have filesystem capabilities enabled like most current Linux distributions do), in both the effective and permitted sets.
The intent is to minimize the progress of the other threads between the moment a thread decides it wants a snapshot/dump, and the moment when all threads have been stopped. I have not tested how long it takes for the changes to take effect -- certainly they only happen at the end of their current timeslices at the very earliest. So, this step should probably be done a bit beforehand.
Personally, I don't bother. On my four-core machine, the following SIGSTOP alone yields similar latencies between threads as a mutex or a semaphore does, so I don't see any need to strive for even better synchronization.
When a thread in the child process decides it wants to take a snapshot of itself, it sends a SIGSTOP to itself (via kill(getpid(), SIGSTOP)). This stops all threads in the process.
The parent process will receive the notification that the child was stopped. It will first examines /proc/PID/task/ to obtain the TIDs for each thread of the child process (and perhaps /proc/PID/task/TID/ pseudofiles for other information), then attaches to each TID using ptrace(PTRACE_ATTACH, TID). Obviously, ptrace(PTRACE_GETREGS, TID, ...) will obtain the per-thread register states, which can be used in conjunction with /proc/PID/task/TID/smaps and /proc/PID/task/TID/mem to obtain the per-thread stack trace, and whatever other information you're interested in. (For example, you could create a debugger-compatible core file for each thread.)
When the parent process is done grabbing the dump, it lets the child process continue. I believe you need to send a separate SIGCONT signal to let the entire child process continue, instead of just relying on ptrace(PTRACE_CONT, TID), but I haven't checked this; do verify this, please.
I do believe that the above will yield a minimal delay in wall clock time between the threads in the process stopping. Quick tests on AMD Athlon II X4 640 on Xubuntu and a 3.8.0-29-generic kernel indicates tight loops incrementing a volatile variable in the other threads only advance the counters by a few thousand, depending on the number of threads (there's too much noise in the few tests I made to say anything more specific).
Limiting the process to a single CPU, and even to IDLE priority, will drastically reduce that delay even further. CAP_SYS_NICE capability allows the parent to not only reduce the priority of the child process, but also lift the priority back to original levels; filesystem capabilities mean the parent process does not even have to be setuid, as CAP_SYS_NICE alone suffices. (I think it'd be safe enough -- with some good checks in the parent program -- to be installed in e.g. university computers, where students are quite active in finding interesting ways to exploit the installed programs.)
It is possible to create a kernel patch (or module) that provides a boosted kill(getpid(), SIGSTOP) that also tries to kick off the other threads from running CPUs, and thus try to make the delay between the threads stopping even smaller. Personally, I wouldn't bother. Even without the CPU/priority manipulation I get sufficient synchronization (small enough delays between the times the threads are stopped).
Do you need some example code to illustrate my ideas above?

When you fork you get a full copy of the running processes memory. This includes all thread's stacks (after all you could have valid pointers into them). But only the calling thread continues to execute in the child.
You can easily test this. Make a multithreaded program and run:
pid_t parent_pid = getpid();
if (!fork()) {
kill(parent_pid, SIGSTOP);
char buffer[0x1000];
pid_t child_pid = getpid();
sprintf(buffer, "diff /proc/%d/maps /proc/%d/maps", parent_pid, child_pid);
system(buffer);
kill(parent_pid, SIGTERM);
return 0;
} else for (;;);
So all your memory is there and when you create a core dump it will contain all the other threads stacks (provided your maximum core file size permits it). The only pieces that will be missing are their register sets. If you need those then you will have to ptrace your parent to obtain them.
You should keep in mind though that core dumps are not designed to contain runtime information of more then one thread - the one that caused the core dump.
To answer some of your other questions:
You can enumerate threads by going through /proc/[pid]/tasks, but you can not identify their stack bases until you ptrace them.
Yes, you have full access to the other threads stacks snapshots (see above) from the forked process. It is not trivial to determine them, but they do get put into a core dump provided the core file size permits it. Your best bet is to save them in some globally accessible structure if you can upon creation.

If you intend to get the core file at non-specific location, and just get core image of the process running without killing, then you can use gcore.
If you intend to get the core file at specific location (condition) and still continue running the process - a crude approach is to execute gcore programmatically from that location.
A more classical, clean approach would be to check the API which gcore uses and embedded it in your application - but would be too much of an effort compared to the need most of the time.
HTH!

If your goal is to snapshot the entire process in order to understand the exact state of all threads at a specific point then I can't see any way to do this that doesn't require some kind of interrupt service routine. You must halt all processors and record off the current state of each thread.
I don't know of any system that provides this kind of full process core dump. The rough outlines of the process would be:
issue an interrupt across all CPUs (both logical and physical cores).
busy wait for all cores to synchronize (this shouldn't take long).
clone the desired process's memory space: duplicate the page tables and mark all pages as copy on write.
have each processor check whether its current thread is in the target process. If so record the current stack pointer for that thread.
for every other thread examine the thread data block for the current stack pointer and record it.
create a kernel thread to save off the copied memory spaces and the thread stack pointers
resume all cores.
This should capture the entire process state, including a snapshot of any processes that were running at the moment the inter-processor interrupt was issued. Because all threads are interrupted (either through standard scheduler suspension process, or via our custom interrupt process) all register states will be on a stack somewhere in the process memory. You then only need to know where the top of each thread stack is. Using the copy on write mechanism to clone the page tables allows transparent save-off while the original process is allowed to resume.
This is a pretty heavyweight option, since it's main functionality requires suspending all processors for a significant amount of time (synchronize, clone, walk all threads). However this should allow you to exactly capture the status of all threads, as well as determine which threads were running (and on which CPUs) when the checkpoint was reached. I would assume some of the framework for doing this process exists (in CRIU for instance). Of course resuming the process will result in a storm of page allocations as the copy on write mechanism protects the check-pointed system state.

Threads and fork(). How can I deal with that? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
fork in multi-threaded program
If I have an application which employs fork() and might be developed as multithreaded, what are the thumb rules/guidelines to consider to safely program this kind of applications?

The basic thumb rules, according to various internet articles like ( http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them , fork in multi-threaded program ) are:
(Main) Process[0] Monothread --> fork() --> (Child) Process[1] Multithreaded: OK!
If Process[1] crashes or messes around with memory it won't touch address space of Process[0] (unless you use shared R/W memory... but this is another topic of its own).In Linux by default all fork()ed memory is Copy On Write. Given that Process[0] is monothreaded, when we invoke fork() all possible mutual exclusion primitives should be generally in an unlocked state.
(Main) Process[0] Multithreaded --> fork() --> (Child) Process[1] Mono/Multithread: BAD!
If you fork() a Multithreaded process your mutexes and many other thread synchronization primitives will likely be in an undefined state in Process[1]. You can work around with pthread_atfork() but if you use libraries you might as well roll a dice and hope to be lucky. Because generally you don't (want to) know the implementation details of libraries.
The advantages of fork() into a multithreaded process are that you could manipulate/read/aggregate your data quicker (in the Child process), without having to care about stability of the process you fork() from (Main). This is useful if your main process has a dataset of a lot of memory and you don't want to duplicate/reload it to safely process the data in another process (Child). This way the original process is stable and independent from the data aggregation/manipulation process (fork()ed).
Of course this means that the original process will generally be slower than it might be if developed in multithreaded fashion. But again, this is the price you might want to be paying for more stability.
If instead your main process is multithreaded, refrain from using fork(). It's going to be a proper mess to implement it in a stable way.
Cheers

On Linux, threads are implemented in terms of processes. In other words, threads are really just a fork() with mostly shared memory, instead of completely copy-on-write memory. What this means, is that when you use fork() in a thread (main or other), you end up copying the entire shared memory space of all of the threads, and the thread specific storage of the thread you call fork() from.
Now all of this sounds good, but that doesn't mean that this is what will happen or work well. If you want to make a cloned process, try to do a fork before starting any other threads, and then use read-only virtual memory to keep the forked process up to date with current memory values.
So although it may work, I just suggest testing, and try to find another way first. And be prepared for a lot of:
Segmentation fault

Boost, C++ how to kill thread opened by another thread?

so I have some main function. 24 time a second it opens a boost thread A with a function. that function takes in a buffer with data. It starts up a boost timer. It opens another thread B with a function sending buffer into it. I need thread A to kill thread B if it is executing way 2 long. Of course if thread B has executed in time I do not need to kill it it should kill itself. What boost function can help me to kill created thread (not join - stop/kill or something like that)?
BTW I cannot affect speed of Function I am exequting in thread B thats why I need to be capable of killing it when needed.

There's no clean way to kill a thread, so if you need to do something like this, your clean choices are to either use a function that includes some cancellation capability, or use a separate process for it, since you can kill a process cleanly.
Other than that, my immediate reaction is that instead of "opening" (do you mean creating?) thread A 24 times a second, you'd be better off with thread A reading a buffer, sending it on to thread B, then sleeping until it's ready to read another buffer. Creating and killing threads isn't terribly expensive, but doing it at a rate of 24 (or, apparently, 48) a second strikes me as a bit excessive.

The term you are looking for is "cancellation", as in pthread_cancel(3). Cancellation is troublesome, because the cancelled thread might not execute C++ destructors or release locks on the way out ... but then again it might; the uncertainty is actually worse than a definitive no.
Because of this, boost threads do not support cancellation (see for instance this older question) but they do support interruption, which you might be able to bend to fit. Interruption works by way of a regular C++ exception so it has predictable semantics.

please don't kill threads at random unless you completely control their execution (and then just make proper signals for threads to exit gracefully). you never know if other thread is in some critical section of a library you never heard of and then your program will end up stalling on that CS as it was never exited or something like that.

Mystery pthread problem with fork()

I have a program which:
has a main thread (1) which starts a server thread (2) and another (4).
the server thread (2) does an accept(), then creates a new thread (3) to handle the connection.
At some point, thread (4) does a fork/exec to run another program which should connect to the socket that thread (2) is listening to. Occasionally this fails or takes an unreasonably long time, and it's extremely difficult to diagnose. If I strace the system, it appears that the fork/exec has worked, the accept has happened, the new thread (4) has been created .. but nothing happens in that thread (using strace -ff, the file for the relevant pid is blank).
Any ideas?

I came to the conclusion that it was probably this phenomenon:
http://kerneltrap.org/mailarchive/linux-kernel/2008/8/15/2950234/thread
as the bug is difficult to trigger on our development systems but is generally reported by users running on large shared machines; also the forked application starts a JVM, which itself allocates a lot of threads. The problem is also associated with the machine being loaded, and extensive memory usage (we have a machine with 128Gb of RAM and processes may be 10-100G in size).
I've been reading the O'Reilly pthreads book, which explains pthread_atfork(), and suggests the use of a "surrogate parent" process forked from the main process at startup from which subprocesses are run. It also suggests the use of a pre-created thread pool. Both of these seem like good ideas, so I'm going to implement at least one of them.

It's look like a deadlock condition. Look for blocking functions, like accept(), the problem should be there.

Decrease the code to the smallest possible size that still has the behavior and post it here. Either you will find the answer or we will be able to track it down.
BTW - http://lists.samba.org/archive/linux/2002-February/002171.html it seems that pthread behavior for exec is not well defined and may depend on your OS.
Do you have any code between fork and exec? This may be a problem.

Be very careful with multiple threads and fork. Most of glibc/libstdc++ is thread safe. If a thread, other than the forking thread, is holding a lock when the fork executes the forked process will inherit the mutexes in their current locked state. The new process will never see those mutexes unlocked. For more information see man pthread_atfork.

I've just fallen into same problems, and finally found that fork() duplicates all the threads. Now imagine, what does your program do after a fork() with all the threads running double instance...
The following rules are from "A Mini-guide regarding fork() and Pthreads":
1- You DO NOT WANT to do that.
2- If you needs to fork() then:
whenever possible, fork() all your
childs prior to starting any threads.
Edit: tried, fork() does not duplicate threads.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js