C++11 Thread is joinable but join() raises an exception - c++

I have a strange issue with C++11 threads.
Unfortunately I cannot paste the full example (given the complexity) and I cannot replicate the issue on a simpler example.
So the problem is that I have a thread which is running (nor join nor detach has been called on it).
At some point another thread wants to stop this thread. The implementation simply set a boolean variable to false, and the call the join to wait for thread termination.
Well, the problem is the join.
I checked that the current thread (calling the join) is different from the joined thread and joinable() returns true.
Nevertheless this exception occurs:
libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: thread::join failed: No such process
This happens on macOS 10.11 but I had a colleague of mine test it on linux and it does not occur.
Any clue?

This might happen if you call fork() after creating additional threads in parent process. One important thing that differs the child process from the parent is that the child has only one thread.
So all C++ code which thinks there's a thread will be fooled and join() will throw
"No such process". In this case because native call will return ESRCH.
You shouldn't create threads before calling fork().

Related

Why must one call join() or detach() before thread destruction?

I don't understand why when an std::thread is destructed it must be in join() or detach() state.
Join waits for the thread to finish, and detach doesn't.
It seems that there is some middle state which I'm not understanding.
Because my understanding is that join and detach are complementary: if I don't call join() than detach() is the default.
Put it this way, let's say you're writing a program that creates a thread and only later in the life of this thread you call join(), so up until you call join the thread was basically running as if it was detached, no?
Logically detach() should be the default behavior for threads because that is the definition of what threads are, they are parallelly executed irrespective of other threads.
So when the thread object gets destructed why is terminate() called? Why can't the standard simply treat the thread as being detached?
I'm not understanding the rationale behind terminating a program when either join() or detached() wasn't called before the thread was destructed. What is the purpose of this?
UPDATE:
I recently came across this. Anthony Williams states in his book, Concurrency In Action, "One of the proposals for C++17 was for a joining_thread class that would be similar to std::thread, except that it would automatically join in the destructor much like scoped_thread does. This didn’t get consensus in the committee, so it wasn’t accepted into the standard (though it’s still on track for C++20 as std::jthread)..."
Technically the answer is "because the spec says so" but that is an obtuse answer. We can't read the designers' minds, but here are some issues that may have contributed:
With POSIX pthreads, child threads must be joined after they have exited, or else they continue to occupy system resources (like a process table entry in the kernel). This is done via pthread_join().
Windows has a somewhat analogous issue if the process holds a HANDLE to the child thread; although Windows doesn't require a full join, the process must still call CloseHandle() to release its refcount on the thread.
Since std::thread is a cross-platform abstraction, it's constrained by the POSIX requirement which requires the join.
In theory the std::thread destructor could have called pthread_join() instead of throwing an exception, but that (subjectively) that may increase the risk of deadlock. Whereas a properly written program would know when to insert the join at a safe time.
See also:
https://en.wikipedia.org/wiki/Zombie_process
https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessa
https://learn.microsoft.com/en-us/windows/win32/procthread/terminating-a-process
You're getting confused because you're conflating the std::thread object with the thread of execution it refers to. A std::thread object is a C++ object (a bunch of bytes in memory) that acts as a reference to a thread of execution. When you call std::thread::detach what happens is that the std::thread object is "detached" from the thread of execution -- it no longer refers to (any) thread of execution, and the thread of execution continues running independently. But the std::thread object still exists, until it is destroyed.
When a thread of execution completes, it stores its exit info into the std::thread object that refers to it, if there is one (If it was detached, then there isn't one, so the exit info is just thrown away.) It has no other effect on the std::thread object -- in particular the std::thread object is not destroyed and continues to exist until someone else destroys it.
You might want a thread to completely clean up after itself when it's done leaving no traces. This would mean that you could start a thread and then forget about it.
But you might also want to be able to manage a thread while it was running and get any return value it had provided when it was done. In this case, if a thread cleaned up after itself when it was done, your attempt to manage it could cause a crash because you would be accessing a handle that might be invalid. And to check for the return value when the thread finishes, the return value has to be stored somewhere, which means the thread can't be fully cleaned up because the place where the return value is stored has to be left around.
In most frameworks, by default, you get the second option. You can manage the thread (by interrupting it, sending signals to it, joining it, or whatever) but it can't clean up after itself. If you prefer the first option, there's a function to get that behavior (detach) but that means that you may not be able to access the thread because it may or may not continue to exist.
When a thread handle for an active thread goes out of scope you have a couple of options:
join
detach
kill thread
kill program
Each one of these options is terrible. No matter which one you pick it will be surprising, confusing and not what you wanted in most situations.
Arguably the joining thread you mentioned already exists in the form of std::async which gives you a std::future that blocks until the created thread is done, so doing an implicit join. But the many questions about why
std::async(std::launch::async, f);
g();
does not run f and g concurrently indicate how confusing that is. The best approach I'm aware of is to define it to be a programming error and have the programmer fix it, so an assert would be most appropriate. Unfortunately the standard went with std::terminate instead.
If you really want a detaching thread just write a little wrapper around std::thread that does if (thread.joinable()) thread.detach(); in its destructor or whichever handler you want.
Question: "So when the thread object gets destructed why is terminate() called? Why can't the standard simply treat the thread as being detached?"
Answer: Yes, I agree that it terminates the program badly but such design has its reasons. Without the std::terminate() mechanism in the destructor std::thread::~thread, if the users really wanted to do join(), but for some reason "join" didn't execute (for e.g. exception was thrown) then the new_thread will run in the background just like the detach() behaviors. This might cause undefined behaviors because that was not the original intention of the user to have a detached thread.

When can std::thread::join fail due to no_such_process

std::thread::join() is permitted to fail, throwing a std::system_error for no_such_process if the thread is "not valid". Note that the no_such_process case is distinct from a thread that is not joinable (for which the error code is invalid_argument).
In what circumstances might that happen? Alternatively, what must I do to ensure that join() does not fail for that reason? I want a destructor to join() some threads it manages, and of course I want the destructor to never throw exceptions. What can make a (properly constructed and not destroyed) thread "not valid".
In what circumstances might that happen?
On *nix systems, it happens when you try to join a thread whose ID is not in the thread table, meaning the thread does not exist (anymore). This might happen when a thread has already been joined and terminated, or if your thread variable's memory has been corrupted.
Alternatively, what must I do to ensure that join() does not fail for that reason?
You might test std::thread::joinable(), but it might also fail1. Just don't mess with your thread variables, and you're good to go. Simply ignore this possibility, if you encounter such an error your program better core dump and let you analyse the bug.
1) By fail, I mean report true instead of false or the other way around, not throw or crash.
The no_such_process error code corresponds to a ESRCH POSIX error code. On a POSIX system std::thread::join() probably delegates to pthread_join().
Issue 7 of POSIX removed the possibility of an ESRCH.
On Linux, pthread_join may give ESRCH if no thread with the given thread ID could be found. The ID of a C++ thread is private data, so the only way the ID could be not found would be if this does not point to a properly constructed std::thread.
I conclude that this error condition can only occur as a result of an earlier action that had undefined behaviour, such as a bad reinterpret_cast or use of a dangling pointer.

C++11 safely join a thread without using a try / catch block

According to the documentation here and here, the join method of a C++11 thread will throw a std::system_error if joinable() == false. Thus the natural way to wait for a thread to complete execution is something along the lines of:
if (thread2.joinable()) thread2.join();
However, this has the possibility to throw a std::system_error. Consider thread 1 calls thread2.joinable(), returns true, indicating that the thread2 is still running. Then the scheduler pauses thread1 and switches contexts to thread 2. Thread 2 completes, and then thread 1 resumes. Thread 1 calls thread2.join(), but thread2 has already completed, and as a result, std::system_error is thrown.
A possible solution is to wrap the whole thing in a try block:
try {
thread2.join();
catch (std::system_error &e) {}
But then when a legitimate std::system_error is thrown, possibly to indicate that the thread failed to join, the program continues on, acting as though everything is fine and dandy. Is there a proper way to join a thread besides using a try/catch block like this?
joinable does not do what you think it does. All it does is return whether the thread object is not associated with a thread. However, thread::join will fail if the thread object also represents the current thread. So the only reason for thread::join to fail due to lack of joinable is if you tried to join with yourself.
A completed thread (which isn't your own) is still perfectly joinable.

Why does pthread_exit() in rare cases cause a SEGV when called after pthread_detach()?

I am getting a SEGV in C++ that I cannot easily reproduce (it occurs in about one in 100,000 test runs) in my call to pthread_join() as my application is shutting down. I checked the value of errno and it is zero. This is running on Centos v4.
Under what conditions would pthread_join() get a SEGV? This might be some kind of race condition since it is extremely rare. One person suggests I should not be calling pthread_detach() and pthread_exit(), but I am not clear on why.
My first working hypothesis was that pthread_join() is being called while pthread_exit() is still running in the other thread and that this somehow leads to a SEGV, but many have stated this is not an issue.
The failing code getting SEGV in the main thread during application exit looks roughly like this (with error return code checking omitted for brevity):
// During application startup, this function is called to create the child thread:
return_val = pthread_create(&_threadId, &attr,
(void *(*)(void *))initialize,
(void *)this);
// Apparently this next line is the issue:
return_val = pthread_detach(_threadId);
// Later during exit the following code is executed in the main thread:
// This main thread waits for the child thread exit request to finish:
// Release condition so child thread will exit:
releaseCond(mtx(), startCond(), &startCount);
// Wait until the child thread is done exiting so we don't delete memory it is
// using while it is shutting down.
waitOnCond(mtx(), endCond(), &endCount, 0);
// The above wait completes at the point that the child thread is about
// to call pthread_exit().
// It is unspecified whether a thread that has exited but remains unjoined
// counts against {PTHREAD_THREADS_MAX}, hence we must do pthread_join() to
// avoid possibly leaking the threads we destroy.
pthread_join(_threadId, NULL); // SEGV in here!!!
The child thread which is being joined on exit runs the following code which begins at the point above where releaseCond() is called in the main thread:
// Wait for main thread to tell us to exit:
waitOnCond(mtx(), startCond(), &startCount);
// Tell the main thread we are done so it will do pthread_join():
releaseCond(mtx(), endCond(), &endCount);
// At this point the main thread could call pthread_join() while we
// call pthread_exit().
pthread_exit(NULL);
The thread appeared to come up properly and no error codes were produced during its creation during application startup and the thread performed its task correctly which took around five seconds before the application exited.
What might cause this rare SEGV to occur and how might I program defensively against it. One claim is that my call to pthread_detach() is the issue, if so, how should my code be corrected.
Assuming:
pthread_create returns zero (you are checking it, right?)
attr is a valid pthread_attr_t object (How are you creating it? Why not just pass NULL instead?)
attr does not specify that the thread is to be created detached
You did not call pthread_detach or pthread_join on the thread somewhere else
...then it is "impossible" for pthread_join to fail, and you either have some other memory corruption or a bug in your runtime.
[update]
The RATIONALE section for pthread_detach says:
The *pthread_join*() or *pthread_detach*() functions should eventually be
called for every thread that is created so that storage associated
with the thread may be reclaimed.
Although it does not say these are mutually exclusive, the pthread_join documentation specifies:
The behavior is undefined if the value specified by the thread
argument to *pthread_join*() does not refer to a joinable thread.
I am having trouble finding the exact wording that says a detached thread is not joinable, but I am pretty sure it is true.
So, either call pthread_join or pthread_detach, but not both.
If you read the standards documentation for pthread_join and pthread_exit and related pages, the join suspends execution "until the target thread terminates", and the thread calling pthread_exit doesn't terminate until it's done calling pthread_exit, so what you're worried about can't be the problem.
You may have corrupted memory somewhere (as Nemo suggests), or called pthread_exit from a cleanup handler (as user315052 suggests), or something else. But it's not "a race condition between pthread_join() and pthread_exit()", unless you're on a buggy or non-compliant implementation.
There is insufficient information to fully diagnose your problem. I concur with the other posted answers that the problem is more likely undefined behavior in your code than a race condition between pthread_join and pthread_exit. But I would also agree the existence of such a race would constitute a bug in the pthread library implementation.
Regarding pthread_join:
return_val = pthread_create(&_threadId, &attr,
(void *(*)(void *))initialize,
(void *)this);
//...
pthread_join(_threadId, NULL); // SEGV in here!!!
It looks like the join is in a class. This opens up the possibility that the object could be deleted while main is trying to do the join. If pthread_join is accessing freed memory, the result is undefined behavior. I am leaning towards this possibility, since accessing freed memory is very often undetected.
Regarding pthread_exit: The man page on Linux, and the POSIX spec state:
An implicit call to pthread_exit() is made when a thread other than the thread in which main() was first invoked returns from the start routine that was used to create it. The function's return value shall serve as the thread's exit status.
The behavior of pthread_exit() is undefined if called from a cancellation cleanup handler or destructor function that was invoked as a result of either an implicit or explicit call to pthread_exit().
If the pthread_exit call is made in a cleanup handler, you will have undefined behavior.

Is it safe to call CFRunLoopStop from another thread?

The Mac build of my (mainly POSIX) application spawns a child thread that calls CFRunLoopRun() to do an event loop (to get network configuration change events from MacOS).
When it's time to pack things up and go away, the main thread calls CFRunLoopStop() on the child thread's run-loop, at which point CFRunLoopRun() returns in the child thread, the child thread exits, and the main thread (which was blocking waiting for the child thread to exit) can continue.
This appears to work, but my question is: is this a safe/recommended way to do it? In particular, is calling CFRunLoopStop() from another thread liable to cause a race condition? Apple's documentation is silent on the subject, as far as I can tell.
If calling CFRunLoopStop() from the main thread is not the solution, what is a good solution? I know I could have the child thread call CFRunLoopRunInMode() and wake up every so often to check a boolean or something, but I'd prefer not to have the child thread do any polling if I can avoid it.
In the case of CFRunLoopStop - if it could only be called safely on the current run loop, then it would not be necessary to pass it a parameter indicating which run loop to stop.
The presence of the parameter is a strong indication that its ok to use it to stop run loops other than the current run loop.
In particular, is calling CFRunLoopStop() from another thread [safe]?
Here's what Run Loop Management says:
The functions in Core Foundation are generally thread-safe and can be called from any thread.
So maybe CFRunLoopStop is safe. But I do worry about their use of the word “generally”. My rule is: If Apple doesn't say it's safe, you should assume it's not.
To err on the safe side, you might consider creating a run loop source, adding that to your run loop, and signaling that source when it's time to end the thread. That same document includes an example of a custom run loop source.