SetThreadAffinityMask of pooled thread - c++

I am wondering whether it is possible to set the processor affinity of a thread obtained from a thread pool. More specifically the thread is obtained through the use of TimerQueue API which I use to implement periodic tasks.
As a sidenote: I found TimerQueues the easiest way to implement periodic tasks but since these are usually longliving tasks might it be more appropriate to use dedicated threads for this purpose? Furthermore it is anticipated that synchronization primites such as semapores and mutexes need to be used to synchronize the various periodic tasks. Are the pooled threads suitable for these?
Thanks!
EDIT1: As Leo has pointed out the above question is actually two only loosely related ones. The first one is related to processor affinity of pooled threads. The second question is related to whether pooled threads obtained from the TimerQueue API are behaving just like manually created threads when it comes to synchronization objects. I will move this second question the a seperate topic.

If you do this, make sure you return things to how they were every time you release a thread back to the pool. Since you don't own those threads and other code which uses them may have other requirements/assumptions.
Are you sure you actually need to do this, though? It's very, very rare to need to set processor affinity. (I don't think I've ever needed to do it in anything I've written.)
Thread affinity can mean two quite different things. (Thanks to bk1e's comment to my original answer for pointing this out. I hadn't realised myself.)
What I would call processor affinity: Where a thread needs to be run consistently on a the same processor. This is what SetThreadAffinityMask deals with and it's very rare for code to care about it. (Usually it's due to very low-level issues like CPU caching in high performance code. Usually the OS will do its best to keep threads on the same CPU and it's usually counterproductive to force it to do otherwise.)
What I would call thread affinity: Where objects use thread-local storage (or some other state tied to the thread they're accessed from) and will go wrong if a sequence of actions is not done on the same thread.
From your question it sounds like you may be confusing #1 with #2. The thread itself will not change while your callback is running. While a thread is running it may jump between CPUs but that is normal and not something you have to worry about (except in very special cases).
Mutexes, semaphores, etc. do not care if a thread jumps between CPUs.
If your callback is executed by the thread pool multiple times, there is (depending on how the pool is used) usually no guarantee that the same thread will be used each time. i.e. Your callback may jump between threads, but not while it is in the middle of running; it may only change threads each time it runs again.
Some synchronization objects will care if your callback code runs on one thread and then, still thinking it holding locks on those objects, runs again on a different thread. (The first thread will still hold the locks, not the second one, although it depends which kind of synchronization object you use. Some don't care.) That isn't a #1, though; that's #2, and not something you'd use SetThreadAffinityMask to deal with.
As an example, Mutexes (CreateMutex) are owned by a thread. If you acquire a mutex on Thread A then any other thread which tries to acquire the mutex will block until you release the mutex on Thread A. (It is also an error for a thread to release a mutex it does not own.) So if your callback acquired a mutex, then exited, then ran again on another thread and released the mutex from there, it would be wrong.
On the other hand, an Event (CreateEvent) does not care which threads create, signal or destroy it. You can signal an event on one thread and then reset it on another and that's fine (normal, in fact).
It'd also be rare to hold a synchronization object between two separate runs of your callback (that would invite deadlocks, although there are certainly situations where you could legitimately want/do such a thing). However, if you created (for example) an apartment-threaded COM object then that would be something you would want to only access from one specific thread.

You shouldn't. You're only supposed to use that thread for the job at hand, on the processor it's running on at that point. Apart from the obvious inefficiency, the threadpool might destroy every thread as soon as you're done, and create a new one for your next job. The affinity masks wouldn't disappear that soon in practice, but it's even harder to debug if they disappear at random.

Related

Fire-and-forget std::thread objects cleaning themselves up

When implementing a service, etc, there may be a need for fire-and-forget functionality where one creates a thread and leaves it to its own devices. However, one needs to keep the std::thread object around somewhere to prevent it going out of scope, but when the thread completes there is no neat delete this support, and even if there was, that's going to be a problem for non-pointer allocations. Similarly, higher-level libraries may have Timer objects, where a one-shot timer may be fired off but needs to be cleaned up when done.
One can perhaps keep a collection of std::thread and Timer objects, and every so often go through the list and delete the finished objects, but that seems bothersome. Is there some useful idiom for managing these kinds of temporaries?
My immediate solution has been to use a combination of std::mutex and std::atomic to get my service to return BUSY so I only ever have one thread around, but that feels like a Code Smell
This is what std::thread::detach() is for -- if you want the thread to be "fire and forget", then you call detach on it after creating it. This causes the std:thread object to no longer refer to the actual thread of execution, so you can destroy the std::thread and is has no effect on the execution.
In case of a lot fire-and-forget functionality you may consider using Thread Pool. Creating and destroying of threads have overheads, and a lot of it may eat your RAM in RT. Thread Pool fire up for you X threads ahead for you, that will manage every "fire-and-forget" thread for you.
A simple Thread Pool that solve the problem for me.
std::mutex isn't related to managing threads' resources, it's related to synchronize threads.

Do mutexes guarantee ordering of acquisition? Unlocking thread takes it again while others are still waiting

A coworker had an issue recently that boiled down to what we believe was the following sequence of events in a C++ application with two threads:
Thread A holds a mutex.
While thread A is holding the mutex, thread B attempts to lock it. Since it is held, thread B is suspended.
Thread A finishes the work that it was holding the mutex for, thus releasing the mutex.
Very shortly thereafter, thread A needs to touch a resource that is protected by the mutex, so it locks it again.
It appears that thread A is given the mutex again; thread B is still waiting, even though it "asked" for the lock first.
Does this sequence of events fit with the semantics of, say, C++11's std::mutex and/or pthreads? I can honestly say I've never thought about this aspect of mutexes before.
Are there any fairness guarantees to prevent starvation of other threads for too long, or any way to get such guarantees?
Known problem. C++ mutexes are thin layer on top of OS-provided mutexes, and OS-provided mutexes are often not fair. They do not care for FIFO.
The other side of the same coin is that threads are usually not pre-empted until they run out of their time slice. As a result, thread A in this scenario was likely to continue to be executed, and got the mutex right away because of that.
The guarantee of a std::mutex is enable exclusive access to shared resources. Its sole purpose is to eliminate the race condition when multiple threads attempt to access shared resources.
The implementer of a mutex may choose to favor the current thread acquiring a mutex (over another thread) for performance reasons. Allowing the current thread to acquire the mutex and make forward progress without requiring a context switch is often a preferred implementation choice supported by profiling/measurements.
Alternatively, the mutex could be constructed to prefer another (blocked) thread for acquisition (perhaps chosen according FIFO). This likely requires a thread context switch (on the same or other processor core) increasing latency/overhead. NOTE: FIFO mutexes can behave in surprising ways. E.g. Thread priorities must be considered in FIFO support - so acquisition won't be strictly FIFO unless all competing threads are the same priority.
Adding a FIFO requirement to a mutex's definition constrains implementers to provide suboptimal performance in nominal workloads. (see above)
Protecting a queue of callable objects (std::function) with a mutex would enable sequenced execution. Multiple threads can acquire the mutex, enqueue a callable object, and release the mutex. The callable objects can be executed by a single thread (or a pool of threads if synchrony is not required).
•Thread A finishes the work that it was holding the mutex for, thus
releasing the mutex.
•Very shortly thereafter, thread A needs to touch a resource that is
protected by the mutex, so it locks it again
In real world, when the program is running. there is no guarantee provided by any threading library or the OS. Here "shortly thereafter" may mean a lot to the OS and the hardware. If you say, 2 minutes, then thread B would definitely get it. If you say 200 ms or low, there is no promise of A or B getting it.
Number of cores, load on different processors/cores/threading units, contention, thread switching, kernel/user switches, pre-emption, priorities, deadlock detection schemes et. al. will make a lot of difference. Just by looking at green signal from far you cannot guarantee that you will get it green.
If you want that thread B must get the resource, you may use IPC mechanism to instruct the thread B to gain the resource.
You are inadvertently suggesting that threads should synchronise access to the synchronisation primitive. Mutexes are, as the name suggests, about Mutual Exclusion. They are not designed for control flow. If you want to signal a thread to run from another thread you need to use a synchronisation primitive designed for control flow i.e. a signal.
You can use a fair mutex to solve your task, i.e. a mutex that will guarantee the FIFO order of your operations. Unfortunately, C++ standard library doesn't have a fair mutex.
Thankfully, there are open-source implementations, for example yamc (a header-only library).
The logic here is very simple - the thread is not preempted based on mutexes, because that would require a cost incurred for each mutex operation, which is definitely not what you want. The cost of grabbing a mutex is high enough without forcing the scheduler to look for other threads to run.
If you want to fix this you can always yield the current thread. You can use std::this_thread::yield() - http://en.cppreference.com/w/cpp/thread/yield - and that might offer the chance to thread B to take over the mutex. But before you do that, allow me to tell you that this is a very fragile way of doing things, and offers no guarantee. You could, alternatively, investigate the issue deeper:
Why is it a problem that the B thread is not started when A releases the resource? Your code should not depend on such logic.
Consider using alternative thread synchronization objects like barriers (boost::barrier or http://linux.die.net/man/3/pthread_barrier_wait ) instead, if you really need this sort of logic.
Investigate if you really need to release the mutex from A at that point - I find the practice of locking and releasing fast a mutex for more than one time a code smell, it usually impacts terribly the performace. See if you can group extraction of data in immutable structures which you can play around with.
Ambitious, but try to work without mutexes - use instead lock-free structures and a more functional approach, including using a lot of immutable structures. I often found quite a performance gain from updating my code to not use mutexes (and still work correctly from the mt point of view)
How do you know this:
While thread A is holding the mutex, thread B attempts to lock it.
Since it is held, thread B is suspended.
How do you know thread B is suspended. How do you know that it is not just finished the line of code before trying to grab the lock, but not yet grabbed the lock:
Thread B:
x = 17; // is the thread here?
// or here? ('between' lines of code)
mtx.lock(); // or suspended in here?
// how can you tell?
You can't tell. At least not in theory.
Thus the order of acquiring the lock is, to the abstract machine (ie the language), not definable.

Reduce Context Switches Between Threads With Same Priority

I am writing an application that use a third-party library to perform heavy computations.
This library implements parallelism internally and spawn given number threads. I want to run several (dynamic count) instances of this library and therefore end up with quite heavily oversubscribing the cpu.
Is there any way I can increase the "time quantum" of all the threads in a process so that e.g. all the threads with normal priority rarely context switch (yield) unless they are explicitly yielded through e.g. semaphores?
That way I could possibly avoid most of the performance overhead of oversubscribing the cpu. Note that in this case I don't care if a thread is starved for a few seconds.
EDIT:
One complicated way of doing this is to perform thread scheduling manually.
Enumerate all the threads with a specific priority (e.g. normal).
Suspend all of them.
Create a loop which resumes/suspends the threads every e.g. 40 ms and makes sure no mor threads than the current cpu count is run.
Any major drawbacks with this approach? Not sure what the overhead of resume/suspending a thread is?
There is nothing special you need to do. Any decent scheduler will not allow unforced context switches to consume a significant fraction of CPU resources. Any operating system that doesn't have a decent scheduler should not be used.
The performance overhead of oversubscribing the CPU is not the cost of unforced context switches. Why? Because the scheduler can simply avoid those. The scheduler only performs an unforced context switch when that has a benefit. The performance costs are:
It can take longer to finish a job because more work will be done on other jobs between when the job is started and when the job finishes.
Additional threads consume memory for their stacks and related other tracking information.
More threads generally means more contention (for example, when memory is allocated) which can mean more forced context switches where a thread has to be switched out because it can't make forward progress.
You only want to try to change the scheduler's behavior when you know something significant that the scheduler doesn't know. There is nothing like that going on here. So the default behavior is what you want.
Any major drawbacks with this approach? Not sure what the overhead of
resume/suspending a thread is?
Yes,resume/suspend the thread is very very dangerous activity done in user mode of program. So it should not be used(almost never). Moreover we should not use these concepts to achieve something which any modern scheduler does for us. This too is mentioned in other post of this question.
The above is applicable for any operating system, but from SO post tag it appears to me that it has been asked for Microsoft Windows based system. Now if we read about the SuspendThread() from MSDN, we get the following:
"This function is primarily designed for use by debuggers. It is not intended to be used for thread synchronization. Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread".
So consider the scenario in which thread has acquired some resource(implicitly .i.e. part of not code..by library or kernel mode), and if we suspend the thread this would result into mysterious deadlock situation as other threads of that process would be waiting for that particular resource. The fact is we are not sure(at any time) in our program that what sort of resources are acquired by any running thread, suspend/resume thread is not good idea.

Design and Technical issue in Multi Threaded Application

I wanted to Discuss the Design and technical issue/challenges related with multi threaded application.
Issue I faced
1.I came across the situation where there is multiple thread is using the shared function/variable crash the application, so proper guard is required on that occasion.
2. State Machine and Multi thread-
There are several point one should remember before delve in to the multi thread application.
There can issue related to 1. Memory 2. Handle 3. Socket etc.
please share your experience on the following point
what are the common mistake one do in the multi threaded application
Any specific issue related to multi threaded.
Should we pass data by value or by referen in the thread function.
Well, there are so many...
1) Shared functions/procedures - they are just code and, unless the code modifies itself, there can be no problem. Local variables are no problem because each thread calls on a separate stack, (amost by definition:). Any other data can an issue and may need protection. 99.99% of all household API calls on multiTasking OS are thread-safe, again, almost by definition. Another poster has already warned about thread-local storage...
2) State machines. Can be a little awkward. You can easly lock all the events firing into the SM, so ensuring the integrity of the state, but you must not make blocking calls from inside the SM while it is locked, (might seem obvious, but I have done this.. once :).
I occasionally run state-machines from one thread only, queueing event objects to it. This moves the locking to the input queue and means that the SM is somewhat easier to debug. It also means that the thread running the SM can implement timeouts on an internal delta queue and so itself fire timeout calls to the objects on the delta queue, (classic example: TCP server sockets with connection timeouts - thousands of socket objects that each need an independent timeout).
3) 'Should we pass data by value or by referen in the thread function.'. Not sure what you mean, here. Most OS allow one pointer to be passed on thread creation - do with it what you will. You could pass it an event it should signal on work completion or a queue object upon which it is to wait for work requests. After creation, you need some form of inter-thread comms to send requests and get results, (unless you are going to use the direct 'read/write/waitForExit' mechanism - AV/deadlock/noClose generator).
I usually use a simple semaphore/CS producer-consumer queue to send/receive comms objects between worker threads, and the PostMessage API to send them to a UI thread. Apart from the locking in the queue, I don't often need any more locking. You have to try quite hard to deadlock a threaded system based on message-passing and things like thread pools become trivial - just make [no. of CPU] threads and pass each one the same queue to wait on.
Common mistakes. See the other posters for many, to which I would add:
a) Reading/writing directly to thread fields to pass parameters and return results, (esp. between UI threads and 'worker' threads), ie 'Create thread suspended, load parameters into thread fields, resume thread, wait on thread handle for exit, read results from thread fields, free thread object'. This causes performance hit from continually creating/terminating/destroying threads and often forces the developer to ensure that thread are terminated when exiting an app to prevent AV/216/217 exceptions on close. This can be very tricky, in some cases impossible because a few API's block with no way of unblocking them. If developers would stop this nasty practice, there would be far fewer app close problems.
b) Trying to build multiThreaded apps in a procedural fashion, eg. trying to wait for results from a work thread in a UI event handler. Much safer to build a thread request object, load it with parameters, queue it to a work thread and exit the event handler. The thread can get the object, do work, put results back into the object and, (on Windows, anyway), PostMessage the object back. A UI message-handler can deal with the results and dispose of the object, (or recycle, reuse:). This approach means that, since the UI and worker are always operating on different data that can outlive them both, no locking and, (usually), no need to ensure that the work thread is freed when closing the app, (problems with this are ledgendary).
Rgds,
Martin
The biggest issue people face in multi threading applications are race conditions, deadlocks and not using semaphores of some sort to protect globally accessible variables.
You are facing these problems when using thread locks.
Deadlock
Priority Inversion
Convoying
“Async-signal-safety”
Kill-tolerant availability
Preemption tolerance
Overall performance
If you want to look at more advanced threading techniques you can look at the lock free threading, where many threads work on the same problem in case they are waiting.
Deadlocks, memory corruption (of shared resources) due to lack of proper synchronization, buffer overflow (even that can be occured due to memory corruption), improper usage of thread local storage are the most common things
Also it depends on under which platform and technology you're using to implement the thread. For e.g. in Microsoft Windows, if you use MFC objects, several MFC objects are not really shareable across threads because they're heavily rely on thread local storage (e.g CSocket, CWnd classes etc.)

Why would I want to start a thread "suspended"?

The Windows and Solaris thread APIs both allow a thread to be created in a "suspended" state. The thread only actually starts when it is later "resumed". I'm used to POSIX threads which don't have this concept, and I'm struggling to understand the motivation for it. Can anyone suggest why it would be useful to create a "suspended" thread?
Here's a simple illustrative example. WinAPI allows me to do this:
t = CreateThread(NULL,0,func,NULL,CREATE_SUSPENDED,NULL);
// A. Thread not running, so do... something here?
ResumeThread(t);
// B. Thread running, so do something else.
The (simpler) POSIX equivalent appears to be:
// A. Thread not running, so do... something here?
pthread_create(&t,NULL,func,NULL);
// B. Thread running, so do something else.
Does anyone have any real-world examples where they've been able to do something at point A (between CreateThread & ResumeThread) which would have been difficult on POSIX?
To preallocate resources and later start the thread almost immediately.
You have a mechanism that reuses a thread (resumes it), but you don't have actually a thread to reuse and you must create one.
It can be useful to create a thread in a suspended state in many instances (I find) - you may wish to get the handle to the thread and set some of it's properties before allowing it to start using the resources you're setting up for it.
Starting is suspended is much safer than starting it and then suspending it - you have no idea how far it's got or what it's doing.
Another example might be for when you want to use a thread pool - you create the necessary threads up front, suspended, and then when a request comes in, pick one of the threads, set the thread information for the task, and then set it as schedulable.
I dare say there are ways around not having CREATE_SUSPENDED, but it certainly has its uses.
There are some example of uses in 'Windows via C/C++' (Richter/Nasarre) if you want lots of detail!
There is an implicit race condition in CreateThread: you cannot obtain the thread ID until after the thread started running. It is entirely unpredictable when the call returns, for all you know the thread might have already completed. If the thread causes any interaction in the rest of that process that requires the TID then you've got a problem.
It is not an unsolvable problem if the API doesn't support starting the thread suspended, simply have the thread block on a mutex right away and release that mutex after the CreateThread call returns.
However, there's another use for CREATE_SUSPENDED in the Windows API that is very difficult to deal with if API support is lacking. The CreateProcess() call also accepts this flag, it suspends the startup thread of the process. The mechanism is identical, the process gets loaded and you'll get a PID but no code runs until you release the startup thread. That's very useful, I've used this feature to setup a process guard that detects process failure and creates a minidump. The CREATE_SUSPEND flag allowed me to detect and deal with initialization failures, normally very hard to troubleshoot.
You might want to start a thread with some other (usually lower) priority or with a specific affinity mask. If you spawn it as usual it can run with undesired priority/affinity for some time. So you start it suspended, change the parameters you want, then resume the thread.
The threads we use are able to exchange messages, and we have arbitrarily configurable priority-inherited message queues (described in the config file) that connect those threads. Until every queue has been constructed and connected to every thread, we cannot allow the threads to execute, since they will start sending messages off to nowhere and expect responses. Until every thread was constructed, we cannot construct the queues since they need to attach to something. So, no thread can be allowed to do work until the very last one was configured. We use boost.threads, and the first thing they do is wait on a boost::barrier.
I stumbled with a similar problem once upon I time. The reasons for suspended initial state are treated in other answer.
My solution with pthread was to use a mutex and cond_wait, but I don't know if it is a good solution and if can cover all the possible needs. I don't know, moreover, if the thread can be considered suspended (at the time, I considered "blocked" in the manual as a synonim, but likely it is not so)