When can std::thread::join fail due to no_such_process - c++

std::thread::join() is permitted to fail, throwing a std::system_error for no_such_process if the thread is "not valid". Note that the no_such_process case is distinct from a thread that is not joinable (for which the error code is invalid_argument).
In what circumstances might that happen? Alternatively, what must I do to ensure that join() does not fail for that reason? I want a destructor to join() some threads it manages, and of course I want the destructor to never throw exceptions. What can make a (properly constructed and not destroyed) thread "not valid".

In what circumstances might that happen?
On *nix systems, it happens when you try to join a thread whose ID is not in the thread table, meaning the thread does not exist (anymore). This might happen when a thread has already been joined and terminated, or if your thread variable's memory has been corrupted.
Alternatively, what must I do to ensure that join() does not fail for that reason?
You might test std::thread::joinable(), but it might also fail1. Just don't mess with your thread variables, and you're good to go. Simply ignore this possibility, if you encounter such an error your program better core dump and let you analyse the bug.
1) By fail, I mean report true instead of false or the other way around, not throw or crash.

The no_such_process error code corresponds to a ESRCH POSIX error code. On a POSIX system std::thread::join() probably delegates to pthread_join().
Issue 7 of POSIX removed the possibility of an ESRCH.
On Linux, pthread_join may give ESRCH if no thread with the given thread ID could be found. The ID of a C++ thread is private data, so the only way the ID could be not found would be if this does not point to a properly constructed std::thread.
I conclude that this error condition can only occur as a result of an earlier action that had undefined behaviour, such as a bad reinterpret_cast or use of a dangling pointer.

Related

Why must one call join() or detach() before thread destruction?

I don't understand why when an std::thread is destructed it must be in join() or detach() state.
Join waits for the thread to finish, and detach doesn't.
It seems that there is some middle state which I'm not understanding.
Because my understanding is that join and detach are complementary: if I don't call join() than detach() is the default.
Put it this way, let's say you're writing a program that creates a thread and only later in the life of this thread you call join(), so up until you call join the thread was basically running as if it was detached, no?
Logically detach() should be the default behavior for threads because that is the definition of what threads are, they are parallelly executed irrespective of other threads.
So when the thread object gets destructed why is terminate() called? Why can't the standard simply treat the thread as being detached?
I'm not understanding the rationale behind terminating a program when either join() or detached() wasn't called before the thread was destructed. What is the purpose of this?
UPDATE:
I recently came across this. Anthony Williams states in his book, Concurrency In Action, "One of the proposals for C++17 was for a joining_thread class that would be similar to std::thread, except that it would automatically join in the destructor much like scoped_thread does. This didn’t get consensus in the committee, so it wasn’t accepted into the standard (though it’s still on track for C++20 as std::jthread)..."
Technically the answer is "because the spec says so" but that is an obtuse answer. We can't read the designers' minds, but here are some issues that may have contributed:
With POSIX pthreads, child threads must be joined after they have exited, or else they continue to occupy system resources (like a process table entry in the kernel). This is done via pthread_join().
Windows has a somewhat analogous issue if the process holds a HANDLE to the child thread; although Windows doesn't require a full join, the process must still call CloseHandle() to release its refcount on the thread.
Since std::thread is a cross-platform abstraction, it's constrained by the POSIX requirement which requires the join.
In theory the std::thread destructor could have called pthread_join() instead of throwing an exception, but that (subjectively) that may increase the risk of deadlock. Whereas a properly written program would know when to insert the join at a safe time.
See also:
https://en.wikipedia.org/wiki/Zombie_process
https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessa
https://learn.microsoft.com/en-us/windows/win32/procthread/terminating-a-process
You're getting confused because you're conflating the std::thread object with the thread of execution it refers to. A std::thread object is a C++ object (a bunch of bytes in memory) that acts as a reference to a thread of execution. When you call std::thread::detach what happens is that the std::thread object is "detached" from the thread of execution -- it no longer refers to (any) thread of execution, and the thread of execution continues running independently. But the std::thread object still exists, until it is destroyed.
When a thread of execution completes, it stores its exit info into the std::thread object that refers to it, if there is one (If it was detached, then there isn't one, so the exit info is just thrown away.) It has no other effect on the std::thread object -- in particular the std::thread object is not destroyed and continues to exist until someone else destroys it.
You might want a thread to completely clean up after itself when it's done leaving no traces. This would mean that you could start a thread and then forget about it.
But you might also want to be able to manage a thread while it was running and get any return value it had provided when it was done. In this case, if a thread cleaned up after itself when it was done, your attempt to manage it could cause a crash because you would be accessing a handle that might be invalid. And to check for the return value when the thread finishes, the return value has to be stored somewhere, which means the thread can't be fully cleaned up because the place where the return value is stored has to be left around.
In most frameworks, by default, you get the second option. You can manage the thread (by interrupting it, sending signals to it, joining it, or whatever) but it can't clean up after itself. If you prefer the first option, there's a function to get that behavior (detach) but that means that you may not be able to access the thread because it may or may not continue to exist.
When a thread handle for an active thread goes out of scope you have a couple of options:
join
detach
kill thread
kill program
Each one of these options is terrible. No matter which one you pick it will be surprising, confusing and not what you wanted in most situations.
Arguably the joining thread you mentioned already exists in the form of std::async which gives you a std::future that blocks until the created thread is done, so doing an implicit join. But the many questions about why
std::async(std::launch::async, f);
g();
does not run f and g concurrently indicate how confusing that is. The best approach I'm aware of is to define it to be a programming error and have the programmer fix it, so an assert would be most appropriate. Unfortunately the standard went with std::terminate instead.
If you really want a detaching thread just write a little wrapper around std::thread that does if (thread.joinable()) thread.detach(); in its destructor or whichever handler you want.
Question: "So when the thread object gets destructed why is terminate() called? Why can't the standard simply treat the thread as being detached?"
Answer: Yes, I agree that it terminates the program badly but such design has its reasons. Without the std::terminate() mechanism in the destructor std::thread::~thread, if the users really wanted to do join(), but for some reason "join" didn't execute (for e.g. exception was thrown) then the new_thread will run in the background just like the detach() behaviors. This might cause undefined behaviors because that was not the original intention of the user to have a detached thread.

C++11 Thread is joinable but join() raises an exception

I have a strange issue with C++11 threads.
Unfortunately I cannot paste the full example (given the complexity) and I cannot replicate the issue on a simpler example.
So the problem is that I have a thread which is running (nor join nor detach has been called on it).
At some point another thread wants to stop this thread. The implementation simply set a boolean variable to false, and the call the join to wait for thread termination.
Well, the problem is the join.
I checked that the current thread (calling the join) is different from the joined thread and joinable() returns true.
Nevertheless this exception occurs:
libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: thread::join failed: No such process
This happens on macOS 10.11 but I had a colleague of mine test it on linux and it does not occur.
Any clue?
This might happen if you call fork() after creating additional threads in parent process. One important thing that differs the child process from the parent is that the child has only one thread.
So all C++ code which thinks there's a thread will be fooled and join() will throw
"No such process". In this case because native call will return ESRCH.
You shouldn't create threads before calling fork().

what will happen if we signal semaphore without wait?

In my project when I was implementing I came across the scenario. I have a binary semaphore which is taken by one thread. when that thread executing, the semaphore is signaled multiple times by another thread. Is it an issue or will it cause any undefined behavior??
It is an error to signal a semaphore without a corresponding wait. What happens if you do this is implementation dependent.
If a call to ReleaseSemaphore on a Windows Semaphore object would result in the maximum count being exceeded, ReleaseSemaphore returns FALSE. It will not throw an exception or cause a fatal runtime error.
Under Linux, a call to sem_post that would exceed the maximum count returns -1, and errno is set to EOVERFLOW. Again, this will not be fatal to your application.
Under .NET, a call to Release that would exceed the semaphore's maximum value results in SemaphoreFullException being thrown.
It's a logic error to release a semaphore more often than it's acquired. If your program does that, you have a latent bug. It might be okay in your particular situation, but if you try this with anything other than a binary semaphore, you're likely to end up with some very difficult to find bugs.
I would strongly recommend that you check the return value when you release the semaphore, and treat a failure as a fatal exception.
A binary semaphore is called a Mutex.
Nothing will happen, the mutex can either be taken or not taken, those are the only two states. Releasing a non acquired mutex has no negative effects.
However, take into account the logic may be affected. It doesn't seem like you're controlling too well when and how you signal the mutex, which may result on you releasing the mutex that another thread has acquired.

Could std::mutex::lock throw even if everything looks "good"?

From CPPReference, it isn't said explicitly that the lock function of std::mutex won't throw if the lock won't result in a dead lock.
PThread's lock only have a deadlock error. I don't know for window's implementation of thread. I also don't know if they are other implementation of thread used as backend of std::thread/std::mutex.
So my question is "Should I write my code as if, some times, for no special reason, a lock may fail?".
I actually need to lock a mutex in some noexcept methods, and I want to be sure that they are noexcept.
The std::mutex::lock() member function is not declared as noexcept and from section 30.4.1.2 Mutex types of the c++11 standard (draft n3337), clause 6:
The expression m.lock() shall be well-formed and have the following semantics:
...
Throws: system_error when an exception is required (30.2.2).
Error conditions:
operation_not_permitted — if the thread does not have the privilege to perform the operation.
resource_deadlock_would_occur — if the implementation detects that a deadlock would occur.
device_or_resource_busy — if the mutex is already locked and blocking is not possible.
This implies that any function that uses mutex::lock() cannot be marked noexcept, unless that function is capable of handling the exception itself and prevents it from propogating to the caller.
I am unable to comment on the likelihood of these error conditions occuring, but in relation to std::mutex and resource_deadlock_would_occur (which might be thrown) it indicates a bug in the code as opposed to a runtime a failure as this error might be raised if a thread attempts to lock a std::mutex it already owns. From section 30.4.1.2.1 Class mutex, clause 4:
[ Note: A program may deadlock if the thread that owns a mutex object calls lock() on that object. If the implementation can detect the deadlock, a resource_deadlock_would_occur error condition may be observed. —end note ]
By selecting std::mutex as the lock type the programmer is explicitly stating that an attempt by the same thread to lock a mutex it already has locked is not possible.
It if is a legal path of execution for a thread to re-lock a mutex then a std:recursive_mutex is the more appropriate choice (but changing to a recursive_lock does not mean the lock() function is exception free).
On a POSIX system, std::mutex will probably be implemented using POSIX mutexes, and std::mutex::lock() will eventually delegate to pthread_mutex_lock(). Although C++ mutexes are not required to be implemented using POSIX mutexes, the authors of the C++ standard multi-threading seem to have modelled the possible error conditions on the POSIX error conditions, so examining those can be instructive. As user hmjd says, the C++ error conditions permitted for the lock method are operation_not_permitted, resource_deadlock_would_occur and device_or_resource_busy.
The POSIX error conditions are:
EINVAL: if a POSIX specific lock-priorty feature is misused, which can never happen if you use only the standard C++ multi-threading facilities. This case might correspond to the operation_not_permitted C++ error code.
EINVAL: if the mutex has not been initialized, which would correspond to a corrupted std::mutex object, use of a dangling reference, or some other undefined behaviour that indicates a program bug.
EAGAIN: if the mutex is recursive and the recursion is too deep. That can't happen to a std::mutex, but could happen for a std::recursive_mutex. This would seem to correspond to the device_or_resource_busy error condition.
EDEADLK: if deadlock would occur because of the thread already holds the lock. This would correspond to the resource_deadlock_would_occur C++ error code, but would indicate a program bug, because a program should not attempt to lock a std::mutex it already holds a lock on (use a std::recursive_mutex if you really want to do that).
The C++ operation_not_permitted error code is evidently intended to correspond to the POSIX EPERM error status. The pthread_mutex_lock() function never gives this status code. But the POSIX manual page that describes that function also describes the pthread_mutex_unlock() function, which may given EPERM if you try to unlock a lock you have not locked. Perhaps the C++ standards authors included operation_not_permitted by a mistaken reading of the POSIX manual page. As C++ has no concept of lock "permissions", it is hard to see how any correctly constructed and manipulated lock (used in accordance with the C++ standard, without invoking any undefined behaviour) could result in EPERM and thus operation_not_permitted.
device_or_resource_busy is not permitted from C++17, which suggests it never really happens in practice, and its inclusion for C++11 was an error.
To summarize, the only cases in which std::mutex::lock() could throw an exception indicate program bugs. So it can be reasonable to assume the method "never" throws an exception.
It's safe to assume that the mutex won't throw if you can guarantee that none of the error conditions (as outlined in hmjd's answer) are present. How to put that call into a noexcept function depends on how you want to handle an (pretty impossible) failure. If the default of noexcept (to call std::terminate is acceptable, you don't need to do anything. If you want to do log the impossible error, wrap the function in a try/catch clause.

Why does pthread_exit() in rare cases cause a SEGV when called after pthread_detach()?

I am getting a SEGV in C++ that I cannot easily reproduce (it occurs in about one in 100,000 test runs) in my call to pthread_join() as my application is shutting down. I checked the value of errno and it is zero. This is running on Centos v4.
Under what conditions would pthread_join() get a SEGV? This might be some kind of race condition since it is extremely rare. One person suggests I should not be calling pthread_detach() and pthread_exit(), but I am not clear on why.
My first working hypothesis was that pthread_join() is being called while pthread_exit() is still running in the other thread and that this somehow leads to a SEGV, but many have stated this is not an issue.
The failing code getting SEGV in the main thread during application exit looks roughly like this (with error return code checking omitted for brevity):
// During application startup, this function is called to create the child thread:
return_val = pthread_create(&_threadId, &attr,
(void *(*)(void *))initialize,
(void *)this);
// Apparently this next line is the issue:
return_val = pthread_detach(_threadId);
// Later during exit the following code is executed in the main thread:
// This main thread waits for the child thread exit request to finish:
// Release condition so child thread will exit:
releaseCond(mtx(), startCond(), &startCount);
// Wait until the child thread is done exiting so we don't delete memory it is
// using while it is shutting down.
waitOnCond(mtx(), endCond(), &endCount, 0);
// The above wait completes at the point that the child thread is about
// to call pthread_exit().
// It is unspecified whether a thread that has exited but remains unjoined
// counts against {PTHREAD_THREADS_MAX}, hence we must do pthread_join() to
// avoid possibly leaking the threads we destroy.
pthread_join(_threadId, NULL); // SEGV in here!!!
The child thread which is being joined on exit runs the following code which begins at the point above where releaseCond() is called in the main thread:
// Wait for main thread to tell us to exit:
waitOnCond(mtx(), startCond(), &startCount);
// Tell the main thread we are done so it will do pthread_join():
releaseCond(mtx(), endCond(), &endCount);
// At this point the main thread could call pthread_join() while we
// call pthread_exit().
pthread_exit(NULL);
The thread appeared to come up properly and no error codes were produced during its creation during application startup and the thread performed its task correctly which took around five seconds before the application exited.
What might cause this rare SEGV to occur and how might I program defensively against it. One claim is that my call to pthread_detach() is the issue, if so, how should my code be corrected.
Assuming:
pthread_create returns zero (you are checking it, right?)
attr is a valid pthread_attr_t object (How are you creating it? Why not just pass NULL instead?)
attr does not specify that the thread is to be created detached
You did not call pthread_detach or pthread_join on the thread somewhere else
...then it is "impossible" for pthread_join to fail, and you either have some other memory corruption or a bug in your runtime.
[update]
The RATIONALE section for pthread_detach says:
The *pthread_join*() or *pthread_detach*() functions should eventually be
called for every thread that is created so that storage associated
with the thread may be reclaimed.
Although it does not say these are mutually exclusive, the pthread_join documentation specifies:
The behavior is undefined if the value specified by the thread
argument to *pthread_join*() does not refer to a joinable thread.
I am having trouble finding the exact wording that says a detached thread is not joinable, but I am pretty sure it is true.
So, either call pthread_join or pthread_detach, but not both.
If you read the standards documentation for pthread_join and pthread_exit and related pages, the join suspends execution "until the target thread terminates", and the thread calling pthread_exit doesn't terminate until it's done calling pthread_exit, so what you're worried about can't be the problem.
You may have corrupted memory somewhere (as Nemo suggests), or called pthread_exit from a cleanup handler (as user315052 suggests), or something else. But it's not "a race condition between pthread_join() and pthread_exit()", unless you're on a buggy or non-compliant implementation.
There is insufficient information to fully diagnose your problem. I concur with the other posted answers that the problem is more likely undefined behavior in your code than a race condition between pthread_join and pthread_exit. But I would also agree the existence of such a race would constitute a bug in the pthread library implementation.
Regarding pthread_join:
return_val = pthread_create(&_threadId, &attr,
(void *(*)(void *))initialize,
(void *)this);
//...
pthread_join(_threadId, NULL); // SEGV in here!!!
It looks like the join is in a class. This opens up the possibility that the object could be deleted while main is trying to do the join. If pthread_join is accessing freed memory, the result is undefined behavior. I am leaning towards this possibility, since accessing freed memory is very often undetected.
Regarding pthread_exit: The man page on Linux, and the POSIX spec state:
An implicit call to pthread_exit() is made when a thread other than the thread in which main() was first invoked returns from the start routine that was used to create it. The function's return value shall serve as the thread's exit status.
The behavior of pthread_exit() is undefined if called from a cancellation cleanup handler or destructor function that was invoked as a result of either an implicit or explicit call to pthread_exit().
If the pthread_exit call is made in a cleanup handler, you will have undefined behavior.