I looked other SO questions/answers about this, but none of them resolved my issue.
My threads have default scheduling policy SCHED_OTHER and it has the priority as 0, so they don't have priority level. Or in other words, you can't change the priority using (sched_param) param.sched_priority.
So, this means using system call pthread_setschedparam is ruled out.
pthread_setschedprio(std::thread::native_handle(), -1) - this doesn't affect the thread's priority. I verified using getpriority (PRIO_PROCESS, tid). Can we use pthread_setschedprio() for default schedulers?
After carefully reading this page, I understood that I need to change the thread's dynamic priority by tweaking the nice value which can be achieved by either of these:
nice(19);
I tried this and it doesn't have any effect. I hope it is process wide nice value change.
I ruled this out too.
setpriority(PRIO_PROCESS, id, 19)
It returns -1 always and errno is ESRCH (No such process). Why this is not working?
According to this, it can be used to change the priority of the thread.
syscall(SYS_sched_setattr, id, &attr, flags)
struct sched_attr attr;
unsigned int flags = 0;
attr.size = sizeof(attr);
attr.sched_nice = 6;
attr.sched_policy = SCHED_OTHER;
This also didn't work, no luck.
I want to lower the priority of these threads(this consumes more CPU) in a process without changing the policy. This ensures other threads which has the same priority as these to get CPU time
Preliminary: context suggests that you're on Linux, so portions of this answer are Linux specific.
My threads have default scheduling policy SCHED_OTHER and it has the priority as 0, so they don't have priority level. Or in other words, you can't change the priority using (sched_param) param.sched_priority.
That seems to mischaracterize the situation a bit. Per sched(7),
For threads scheduled under one of the normal scheduling policies (SCHED_OTHER, SCHED_IDLE, SCHED_BATCH), sched_priority is not used in scheduling decisions (it must be specified as 0).
You appear to be interpreting that or some alternative formulation of it to be a speaking about the behavior of certain system interfaces, such as pthread_setschedprio(). Perhaps that comes from "it must be specified as 0", but the part to focus on is rather "not used in scheduling decisions". Threads scheduled according to the SCHED_OTHER policy effectively do not have a priority property at all. Their scheduling does not take any such property into account.
So, this means using system call pthread_setschedparam is ruled out.
Well, no, not exactly. With pthread_setschedparam() you can set both the policy and the priority. If you set the policy to one that considers thread priorities, then you can also meaningfully set the priority. Not that I'm recommending that.
pthread_setschedprio(std::thread::native_handle(), -1) - this doesn't affect the thread's priority.
It shouldn't if the thread's scheduling policy is SCHED_OTHER. Again, threads scheduled according to that policy do not actually have priority, and according to the docs, the (unused) value of their priority properties should be 0.
I verified using getpriority (PRIO_PROCESS, tid).
You should verify by observing that pthread_setschedprio() fails with EINVAL.
Can we use pthread_setschedprio() for default schedulers?
If by "default schedulers" you mean SCHED_OTHER then no. See above. If you include the realtime schedulers SCHED_FIFO and SCHED_RR then yes, for threads scheduled via those schedulers.
After carefully reading this page, I understood that I need to change the thread's dynamic priority by tweaking the nice value
Traditionally and per POSIX, individual threads do not have "nice" values. These are properties of processes. On systems where that is the case, you cannot use niceness to produce different scheduling behavior for different threads of the same process.
On Linux, however, individual threads do have their own niceness. I don't particularly recommend relying on that, but if you
do not require portability beyond Linux,
must provide preferrential scheduling of some threads of one process over other threads of that process, and
must use the SCHED_OTHER scheduling policy
then it's probably your best option.
setpriority(PRIO_PROCESS, id, 19)
It returns -1 always and errno is ESRCH (No such process). Why this is not working?
The system is telling you that id is not a valid process ID. I speculate that it is a pthread_t obtained via std::thread::native_handle(), but note that although on Linux, setpriority(PRIO_PROCESS, ...) can set per-thread niceness, the pid_t identifier it expects is not drawn from the same space as pthreads thread identifiers. See gettid(2).
syscall(SYS_sched_setattr, id, &attr, flags)
If you're using the same id that elicits an ESRCH from setpriority() then I would not expect the sched_settatr syscall to like it any better just because you invoke that directly.
Related
I have a timer class which uses the std::condition_variable wait_until (I have also tried wait_for). I am using the std::chrono::steady_clock time to wait until a specific time in the future.
This is meant to be monotonic, but there has been a long standing issue with this where this actually uses the system clock and fails to work correctly when the system time is changed.
It has been fixed in libc as suggested here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41861.
the issue is that this is still pretty new ~2019 and only available in gcc version 10. I have some cross compilers that are only up to gcc version ~8.
I am wandering if there is a way to get this fix into my versions (I have quite a few cross compilers) of gcc? - but this might prove difficult to maintain if I have to re-build the cross compilers each time I update them or such.
So a better question might be, what is a solution to this issue until I can get all my tools up to gcc v10? - how can I make my timer resistant to system time changes?
updated notes
Rustyx mentions the version of glibc needed is 2.3.0+ (more for my ref - use ldd --version to check that)
glibc change log showing the relevant entry for the fix supplied by Daniel Langr: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/ChangeLog-2019#L2093
The required glibc patch (supplied by rustyx): https://gcc.gnu.org/git/?p=gcc.git&a=commit;h=ad4d1d21ad5c515ba90355d13b14cbb74262edd2
Create a data structure that contains a list of condition variables each with a use count protected by a mutex.
When a thread is about to block on a condition variable, first acquire the mutex and add the condition variable to the list (or bump its use count if it's already on the list).
When done blocking on the condition variable, have the thread again acquire the mutex that protects the list and decrement the use count of the condition variable it was blocked on. Remove the condition variable from the list if its use count drops to zero.
Have a dedicated thread to watch the system clock. If it detects a clock jump, acquire the mutex that protects the list of condition variables and broadcast every condition variable.
That's it. That solves the problem.
If necessary, you can also add a boolean to each entry in the table and set it to false when the entry is added. If the clock watcher thread hast broadcast the condition variable, have it set the bool to true so the woken threads will know why they were woken.
If you wish, you can just add the condition variable to the list when it's created and remove it from the list when it's destroyed. This will result in broadcasting condition variables no threads are blocked on if the clock jumps, but that's harmless.
Here are some implementation suggestions:
Use a dedicated thread to watch the clock. An easy thing to look at is the offset between wall time and the system's uptime clock.
One simple thing to do is to keep a count of the number of time jumps observed and increment it each time you sense a time jump. When you wait for a condition, you can use the following logic:
Note the number of time jumps.
Block on the condition.
When you wake up, recheck the condition.
If the condition isn't satisfied, check the number of time jumps.
If the count from 1 and 4 mismatch, handle it as a time jump wakeup.
You can wrap this all up so that there's no ugliness in the calling code. It just becomes another possible return value from your version of wait_for.
I am writing an application in C++14 that consists of a master thread and multiple slave threads. The master thread coordinates the slave threads which coordinately perform a search, each exploring a part of the search space. A slave thread sometimes encounters a bound on the search. Then it communicates this bound to the master thread which sends the bound to all other slave threads so that they can possibly narrow their searches.
A slave thread must very frequently check whether there is a new bound available, possibly at the entrance of a loop.
What would be the best way to communicate the bound to the slave threads? I can think of using std::atomic<int>, but I am afraid of the performance implications this has whenever the variable is read inside the loop.
The simplest way here is IMO to not overthink this. Just use a std::mutex for each thread, protecting a std::queue that the boundary information is in. Have the main thread wait on a std::condition_variable that each child can lock, write to a "new boundary" queue , then signals te cv, which the main thread then wakes up and copies the value to each child one at at time. As you said in your question, at the top of their loops, the child threads can check their thread-specific queue to see if there's additional bounding conditions.
You actually don't NEED the "main thread" in this. You could have the children write to all other children's queues directly (still mutex-protected), as long as you're careful to avoid deadlock, it would work that way too.
All of these classes can be seen in the thread support library, with decent documentation here.
Yes there's interrupt-based ways of doing things, but in this case polling is relatively cheap because it's not a lot of threads smashing on one mutex, but mostly thread-specific mutexes, and mutexes aren't all that expensive to lock, check, unlock quickly. You're not "holding" on to them for long periods, and thus it's OK. It's a bit of a test really: do you NEED the additional complexity of lock-free? If it's only a dozen (or less) threads, then probably not.
Basically you could make a bet with your architecture that a single write to a primitive datatype is atomic. As you only have one writer, your program would not break if you use the volatile keyword to prevent compiler optimizations that might perform updates to it only in local caches.
However everybody serious about doing things right(tm) will tell you otherwise. Have a look at this article to get a pretty good riskassessment: http://preshing.com/20130618/atomic-vs-non-atomic-operations/
So if you want to be on the safe side, which I recommend, you need to follow the C++ standard. As the C++ standard does not guarantee any atomicity even for the simplest operations, you are stuck with using std::atomic. But honestly, I don't think it is too bad. Sure there is a lock involved, but you can balance out the reading frequency with the benefit of knowing the new boundary early.
To prevent polling the atomic variable, you could use the POSIX signal mechanism to notify slave threads of an update (make sure it works with the platform you are programming for). If that benefits performance or not needs to be seen.
This is actually very simple. You only have to be aware of how things work to be confident the simple solution is not broken. So, what you need is two things:
1. Be sure the variable is written/read to/from memory every time you access it.
2. Be sure you read it in an atomic way, which means you have to read the full value in one go, or if it is not done naturally, have a cheap test to verify it.
To address #1, you have to declare it volatile. Make sure the volatile keyword is applied to the variable itself. Not it's pointer of anything like that.
To address #2, it depends on the type. On x86/64 accesses to integer types is atomic as long as they are aligned to their size. That is, int32_t has to be aligned to 4 bit boundary, and int64_t has to be aligned to 8 byte boundary.
So you may have something like this:
struct Params {
volatile uint64_t bound __attribute__((aligned(8)));
};
If your bounds variable is more complex (a struct) but still fits in 64 bits, you may union it with uint64_t and use the same attribute and volatile as above.
If it's too big for 64 bit, you will need some sort of a lock to ensure you did not read half stale value. The best lock for your circumstances (single writer, multiple readers) is a sequence lock. A sequence lock is simply an volatile int, like above, that serves as the version of the data. Its value starts from 0 and advances 2 on every update. You increment it by 1 before updating the protected value, and again afterwards. The net result is that even numbers are stable states and odd numbers are transient (value updating). In the readers you do this:
1. Read the version. If not changed - return
2. Read till you get an even number
3. Read the protected variable
4. Read the version again. If you get the same number as before - you're good
5. Otherwise - back to step 2
This is actually one of the topics in my next article. I'll implement that in C++ and let you know. Meanwhile, you can look at the seqlock in the linux kernel.
Another word of caution - you need compiler barriers between your memory accesses so that the compiler does not reorder things it should really not. That's how you do it in gcc:
asm volatile ("":::"memory");
I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.
First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.
TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.
My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.
I have this problem:
I have a C++ code that use some threads. These thread are pthread type.
In my iPhone app I use NSOperationQueue and also some C++ code.
The problem is this: the C++ pthread always have lower priority than NsOperationQueue.
How can I fix this? I have also tried to give low priority to NSOpertionQueue but this fix does not work.
If you have to resort to twiddling priority (notably upwards), it's usually indicative of a design flaw in concurrent models. This should be reserved for very special cases, like a realtime thread (e.g. audio playback).
First assess how your threads and tasks operate, and make sure you have no other choice. Typically, you can do something simple, like reducing the operation queue's max operation count, reducing total thread count, or by grouping your tasks by the resource they require.
What method are you using to determine the threads' priorities?
Also note that setting an operation's priority affects the ordering of enqueued operations (not the thread itself).
I've always been able to solve this problem by tweaking distribution. You should stop reading now :)
Available, but NOT RECOMMENDED:
To lower an operation's priority, you could approach it in your operation's main:
- (void)main
{
#autorelease {
const double priority = [NSThread threadPriority];
const bool isMainThread = [NSThread isMainThread];
if (!isMainThread) {
[NSThread setThreadPriority:priority * 0.5];
}
do_your_work_here
if (!isMainThread) {
[NSThread setThreadPriority:priority];
}
}
}
If you really need to push the kernel after all that, this is how you can set a pthread's priority:
pthreads with real time priority
How to increase thread priority in pthreads?
I have written a small test program in which I try to use the Windows API call SetThreadAffinityMask to lock the thread to a single NUMA node. I retrieve the CPU bitmask of a node with the GetNumaNodeProcessorMask API call, then pass that bitmask to SetThreadAffinityMask along with the thread handle returned by GetCurrentThread. Here is a greatly simplified version of my code:
// Inside a function called from a boost::thread
unsigned long long nodeMask = 0;
GetNumaNodeProcessorMask(1, &nodeMask);
HANDLE thread = GetCurrentThread();
SetThreadAffinityMask(thread, nodeMask);
DoWork(); // make-work function
I of course check whether the API calls return 0 in my code, and I've also printed out the NUMA node mask and it is exactly what I would expect. I've also followed advice given elsewhere and printed out the mask returned by a second identical call to SetThreadAffinityMask, and it matches the node mask.
However, from watching the resource monitor when the DoWork function executes, the work is split among all cores instead of only those it is ostensibly bound to. Are there any trip-ups I may have missed when using SetThreadAffinityMask? I am running Windows 7 Professional 64-bit, and the DoWork function contains a loop parallelized with OpenMP which performs operations on the elements of three very large arrays (which combined are still able to fit in the node).
Edit: To expand on the answer given by David Schwartz, on Windows any threads spawned with OpenMP do NOT inherit the affinity of the thread which spawned them. The problem lies with that, not SetThreadAffinityMask.
Did you confirm that the particular thread whose affinity mask was running on a core in another numa node? Otherwise, it's working as intended. You are setting the processor mask on one thread and then observing the behavior of a group of threads.