Is it really impossible to suspend two std/posix threads at the same time?

Is it really impossible to suspend two std/posix threads at the same time? - c++

I want to briefly suspend multiple C++ std threads, running on Linux, at the same time.
It seems this is not supported by the OS.
The threads work on tasks that take an uneven and unpredictable amount of time (several seconds).
I want to suspend them when the CPU temperature rises above a threshold.
It is impractical to check for suspension within the tasks, only inbetween tasks.
I would like to simply have all workers suspend operation for a few milliseconds.
How could that be done?
What I'm currently doing
I'm currently using a condition variable in a slim, custom binary semaphore class (think C++20 Semaphore).
A worker checks for suspension before starting the next task by acquiring and immediately releasing the semaphore.
A separate control thread occupies the control semaphore for a few milliseconds if the temperature is too high.
This often works well and the CPU temperature is stable.
I do not care much about a slight delay in suspending the threads.
However, when one task takes some seconds longer than the others, its thread will continue to run alone.
This activates CPU turbo mode, which is the opposite of what I want to achieve (it is comparatively power inefficient, thus bad for thermals).
I cannot deactivate CPU turbo as I do not control the hardware.
In other words, the tasks take too long to complete.
So I want to forcefully pause them from outside.

I want to suspend them when the CPU temperature rises above a threshold.
In general, that is putting the cart before the horse.
Properly designed hardware should have adequate cooling for maximum load and your program should not be able to exceed that cooling capacity.
In addition, since you are talking about Turbo, we can assume an Intel CPU, which will thermally throttle all on their own, making your program run slower without you doing anything.
In other words, the tasks take too long to complete
You could break the tasks into smaller parts, and check the semaphore more often.
A separate control thread occupies the control semaphore for a few milliseconds
It's really unlikely that your hardware can react to millisecond delays -- that's too short a timescale for anything thermal. You will probably be better off monitoring the temperature and simply reducing the number of tasks you are scheduling when the temperature is rising and getting close to your limits.
I've now implemented it with pthread_kill and SIGRT.
Note that suspending threads in unknown state (whatever the target task was doing at the time of signal receipt) is a recipe for deadlocks. The task may be inside malloc, may be holding arbitrary locks, etc. etc.
If your "control thread" also needs that lock, it will block and you lose. Your control thread must execute only direct system calls, may not call into libc, etc. etc.
This solution is ~impossible to test, and ~impossible to implement correctly.

Related

How hardware timers work and affect on software performance?

I want to use async functions calling. I chose boost::deadline_timer.
For me, hardware-timer is a specific hardware (surprisingly), that works independently from CPU and is duty only for monitoring time. At the same time, if I understand correctly, it is also can be used for setting timeout and generating an interrupt when the timeout has been reached. (timers)
The primary advantage of that is asynchronous execution. The thread that set a timer can continue working and the callback function will be triggered in the same thread where the timer has been set.
Let me describe as I see it in action.
The application contains one or more worker threads. E.g. they process input items and filter them. Let's consider that application has 5 threads and each thread set one timer (5 seconds).
Application is working. E.g. current thread is thread-3.
Timer (thread-0) has been expired and generates (probably the wrong term) an interrupt.
Thread-context switching (thread-3 -> thread-0);
Callback function execution;
Timer (thread-1) has been expired and generates interruption.
...
And so on
P.S.0. I understand that this is not only one possible case for multi-threaded application.
Questions:
Did I describe the working process rightly?
Do I understand correctly that even current thread is thread-0 it also leads to context-switching, since the thread has to stop to execute current code and switch to execute the code from callback fuction?
If each thread sets 100k or 500k timers how it will affect on performance?
Does hardware have the limit to count of timers?
How expensive to update the timeout for a timer?

A hardware timer is, at its core, just a count-up counter and a set of comparators (or a count-down counter that uses the borrow of the MSb as an implicit comparison with 0).
Picture it as a register with a specialized operation Increment (or Decrement) that is started at every cycle of a clock (the easiest kind of counter with this operation is the Ripple-counter).
Each cycle the counter value is also fed to the comparator, previously loaded with a value, and its output will be the input to the CPU (as an interrupt or in a specialized pin).
In the case of a count-down counter, the borrow from the MSb acts as the signal that the value rolled over zero.
These timers have usually more functions, like the ability to stop after they reach the desired value (one-shot), to reset (periodic), to alternate the output state low and high (square wave generator), and other fancy features.
There is no limit on how many timers you can put on a package, of course, albeit simple circuits, they still have a cost in terms of money and space.
Most MCUs have one or two timers, when two, the idea is to use one for generic scheduling and the other for high-priority tasks orthogonal to the OS scheduling.
It's worth noting that having many hardware timers (to be used by the software) is useless unless there are also many CPUs/MCUs since it's easier to use software timers.
On x86 the HPET timer is actually made of at most 32 timers, each with 8 comparators, for a total of 256 timers as seen from the software POV.
The idea was to assign each timer to a specific application.
Applications in an OS don't use the hardware timers directly, because there can possibly be a lot of applications but just one or two timers.
So what the OS does is share the timer.
It does this by programming the timer to generate an interrupt every X units of time and by registering an ISR (Interrupt Service Routine) for such an event.
When a thread/task/program sets up a timer, the OS appends the timer information (periodic vs one-shot, period, ticks left, and callback) to a priority queue using the absolute expiration time as the key (see Peter Cordes comments below) or a list for simple OSes.
Each time the ISR is called the OS will peek at the queue and see if the element on top is expired.
What happens when a software timer is expired is OS-dependent.
Some embedded and small OS may call the timer's callback directly from the context of the ISR.
This is often true if the OS doesn't really have a concept of thread/task (and so of context switch).
Other OSes may append the timer's callback to a list of "to be called soon" functions.
This list will be walked and processed by a specialized task. This is how FreeRTOS does it if the timer task is enabled.
This approach keeps the ISR short and allows programming the hardware timer with a shorter period (in many architectures interrupts are ignored while in an ISR, either by the CPU automatically masking interrupts or by the interrupt controller).
IIRC Windows does something similar, it posts an APC (Async Procedure Call) in the context of the thread that set the software timer just expired. When the thread is scheduled the APC will (as a form of a window's message or not, depending on the specific API used) call the callback. If the thread was waiting on the timer, I think it is just set in the ready state. In any case, it's not scheduled right away but it may get a priority boost.
Where the ISR will return is still OS-dependent.
An OS may continue executing the interrupted thread/task until it's scheduled out. In this case, you don't have step 4 immediately after step 3, instead, thread3 will run until its quantum expires.
On the other way around, an OS may signal the end of the ISR to the hardware and then schedule the thread with the callback.
This approach doesn't work if two or more timers expire in the same tick, so a better approach would be to execute a rescheduling, letting the schedule pick up the most appropriate thread.
The scheduling may also take into account other hints given by the thread during the creation of the software timer.
The OS may also just switch context, execute the callback and get back to the ISR context where it continues peeking at the queue.
The OS may even do any of that based on the period of the timer and other hints.
So it works pretty much like you imagined, except that a thread may not be called immediately upon the timer's expiration.
Updating a timer is not expensive.
While all in all the total work is not much, the timer ISR is meant to be called many many times a second.
In fact, I'm not even sure an OS will allow you to create such a huge number (500k) of timers.
Windows can manage a lot of timers (and their backing threads) but probably not 500k.
The main problem with having a lot of timers is that even if each one performs little work, the total work performed may be too much to keep up with the rate of ticking.
If each X units (e.g. 1ms) of time 100 timers expire, you have X/100 units of time (e.g. 10us) to execute each callback and the callback's code may just be too long to execute in that slice of time.
When this happens the callbacks will be called less often than desired.
More CPU/cores will allow some callback to execute in parallel and would alleviate the pressure.
In general, you need different timers if they run at different rates, otherwise, a single timer that walks a data structure filled with elements of work/data is fine.
Multi-threading can provide concurrency if your tasks are IO-bounded (files, network, input, and so on) or parallelism if you have a multi-processor system.

Threads: How to calculate precisely the execution time of an algorithm (duration of function) in C or C++?

There is easy way to calc duration of any function which described here: How to Calculate Execution Time of a Code Snippet in C++
start_timestamp = get_current_uptime();
// measured algorithm
duration_of_code = get_current_uptime() - start_timestamp;
But, it does not allow to get clear duration cause some time for execution other threads will be included in the measured time.
So question is: how to consider time which code spend in other threads?
OSX code preffer. Although it's great to look to windows or linux code also...
upd: Ideal? concept of code
start_timestamp = get_this_thread_current_uptime();
// measured algorithm
duration_of_code = get_this_thread_current_uptime() - start_timestamp;

I'm sorry to say that in the general case there is no way to do what you want. You are looking for worst-case execution time, and there are several methods to get a good approximation for this, but there is no perfect way as WCET is equivalent to the Halting problem.

If you want to exclude the time spent in other threads then you could disable task context switches upon entering the function that you want to measure. This is RTOS dependent but one possibility is to raise the priority of the current thread to the maximum. If this thread is max priority then other threads won't be able to run. Remember to reset the thread priority again at the end of the function. This measurement may still include the time spent in interrupts, however.
Another idea is to disable interrupts altogether. This could remove other threads and interrupts from your measurement. But with interrupts disabled the timer interrupt may not function properly. So you'll need to setup a hardware timer appropriately and rely on the timer's counter value register (rather than any time value derived from a timer interrupt) to measure the time. Also make sure your function doesn't call any RTOS routines that allow for a context switch. And remember to restore interrupts at the end of your function.
Another idea is to run the function many times and record the shortest duration measured over those many times. Longer durations probably include time spent in other threads but the shortest duration may be just the function with no other threads.
Another idea is to set a GPIO pin upon entry to and clear it upon exit from the function. Then monitor the GPIO pin with an oscilloscope (or logic analyzer). Use the oscilloscope to measure the period for when the GPIO pin is high. In order to remove the time spent in other threads you would need to modify the RTOS scheduler routine that selects the thread to run. Clear the GPIO pin in the scheduler when another thread runs and set it when the scheduler returns to your function's thread. You might also consider clearing the GPIO pin in interrupt handlers.

Your question is entirely OS dependent. The only way you can accomplish this is to somehow get a guarantee from the OS that it won't preempt your process to perform some other task, and to my knowledge this is simply not possible in most consumer OS's.
RTOS often do provide ways to accomplish this though. With Windows CE, anything running at priority 0 (in theory) won't be preempted by another thread unless it makes a function/os api/library call that requires servicing from another thread.
I'm not super familer with OSx, but after glancing at the documentation, OSX is a "soft" realtime operating system. This means that technically what you want can't be guaranteed. The OS may decide that there is "Something" more important than your process that NEEDS to be done.
OSX does however allow you to specify a Real-time process which means the OS will make every effort to honor your request to not be interrupted and will only do so if it deems absolutely necessary.
Mac OS X Scheduling documentation provides examples on how to set up real-time threads

OSX is not an RTOS, so the question is mistitled and mistagged.
In a true RTOS you can lock the scheduler, disable interrupts or raise the task to the highest priority (with round-robin scheduling disabled if other tasks share that priority) to prevent preemption - although only interrupt disable will truly prevent preemption by interrupt handlers. In a GPOS, even if it has a priority scheme, that normally only controls the number of timeslices allowed to a process in what is otherwise round-robin scheduling, and does not prevent preemption.
One approach is to make many repeated tests and take the smallest value obtained, since that is likely to be the one where the fewest pre-emptions occurred. It will help also to set the process to the highest priority in order to minimise the number of preemtions. But bear in mind on a GPOS many interrupts from devices such as the mouse, keyboard, and system clock will occur and consume a small (an possibly negligible) amount of time.

Linux, need accurate program timing. Scheduler wake up program

I have a thread running on a Linux system which i need to execute in as accurate intervals as possbile. E.g. execute once every ms.
Currently this is done by creating a timer with
timerfd_create(CLOCK_MONOTONIC, 0)
, and then passing the desired sleep time in a struct with
timerfd_settime (fd, 0, &itval, NULL);
A blocking read call is performed on this timer which halts thread execution and reports lost wakeup calls.
The problem is that at higher frequencies, the system starts loosing deadlines, even though CPU usage is below 10%. I think this is due to the scheduler not waking the thread often enough to check the blocking call. Is there a command i can use to tell the scheduler to wake the thread at certain intervals as far as it is possble?
Busy-waiting is a bad option since the system handles many other tasks.
Thank you.

You need to get RT linux*, and then increase the RT priority of the process that you want to wake up at regular intervals.
Other then that, I do not see problems in your code, and if your process is not getting blocked, it should work fine.
(*) RT linux - an os with some real time scheduling patches applied.

One way to reduce scheduler latency is to run your process using the realtime scheduler such as SCHED_FIFO. See sched_setscheduler .
This will generally improve latency a lot, but still theres little guarantee, to further reduce latency spikes, you'll need to move to the realtime brance of linux, or a realtime OS such as VxWorks, RTEMS or QNX.

You won't be able to do what you want unless you run it on an actual "Real Time OS".

If this is only Linux for x86 system I would choose HPET timer. I think all modern PCs has this hardware timer build in and it is very, very accurate. I allow you to define callback that will be called every millisecond and in this callback you can do your calculations (if they are simple) or just trigger other thread work using some synchronization object (conditional variable for example)
Here is some example how to use this timer http://blog.fpmurphy.com/2009/07/linux-hpet-support.html

Along with other advice such as setting the scheduling class to SCHED_FIFO, you will need to use a Linux kernel compiled with a high enough tick rate that it can meet your deadline.
For example, a kernel compiled with CONFIG_HZ of 100 or 250 Hz (timer interrupts per second) can never respond to timer events faster than that.
You must also set your timer to be just a little bit faster than you actually need, because timers are allowed to go beyond their requested time but never expire early, this will give you better results. If you need 1 ms, then I'd recommend asking for 999 us instead.

Could someone explain this interesting behaviour with Sleep(1)?

I was testing how long a various win32 API calls will wait for when asked to wait for 1ms. I tried:
::Sleep(1)
::WaitForSingleObject(handle, 1)
::GetQueuedCompletionStatus(handle, &bytes, &key, &overlapped, 1)
I was detecting the elapsed time using QueryPerformanceCounter and QueryPerformanceFrequency. The elapsed time was about 15ms most of the time, which is expected and documented all over the Internet. However for short period of time the waits were taking about 2ms!!! It happen consistently for few minutes but now it is back to 15ms. I did not use timeBeginPeriod() and timeEndPeriod calls! Then I tried the same app on another machine and waits are constantly taking about 2ms! Both machines have Windows XP SP2 and hardware should be identical. Is there something that explains why wait times vary by so much? TIA

Thread.Sleep(0) will let any threads of the same priority execute. Thread.Sleep(1) will let any threads of the same or lower priority execute.
Each thread is given an interval of time to execute in, before the scheduler lets another thread execute. As Billy ONeal states, calling Thread.Sleep will give up the rest of this interval to other threads (subject to the priority considerations above).
Windows balances over threads over the entire OS - not just in your process. This means that other threads on the OS can also cause your thread to be pre-empted (ie interrupted and the rest of the time interval given to another thread).
There is an article that might be of interest on the topic of Thread.Sleep(x) at:
Priority-induced starvation: Why Sleep(1) is better than Sleep(0) and the Windows balance set manager

Changing the timer's resolution can be done by any process on the system, and the effect is seen globally. See this article on how the Hotspot Java compiler deals with times on windows, specifically:
Note that any application can change the timer interrupt and that it affects the whole system. Windows only allows the period to be shortened, thus ensuring that the shortest requested period by all applications is the one that is used. If a process doesn't reset the period then Windows takes care of it when the process terminates. The reason why the VM doesn't just arbitrarily change the interrupt rate when it starts - it could do this - is that there is a potential performance impact to everything on the system due to the 10x increase in interrupts. However other applications do change it, typically multi-media viewers/players.

The biggest thing sleep(1) does is give up the rest of your thread's quantum . That depends entirely upon how much of your thread's quantum remains when you call sleep.

To aggregate what was said before:
CPU time is assigned in quantums (time slices)
The thread scheduler picks the thread to run. This thread may run for the entire time slice, even if threads of higher priority become ready to run.
Typical time slices are 8..15ms, depending on architecture.
The thread can "give up" the time slice - typically Sleep(0) or Sleep(1). Sleep(0) allows another thread of same or hogher priority to run for the next time slice. Sleep(1) allows "any" thread.
The time slice is global and can be affected by all processes
Even if you don't change the time slice, someone else could.
Even if the time slice doesn't change, you may "jump" between the two different times.
For simplicity, assume a single core, your thread and another thread X.
If Thread X runs at the same priority as yours, crunching numbers, Your Sleep(1) will take an entire time slice, 15ms being typical on client systems.
If Thread X runs at a lower priority, and gives up its own time slice after 4 ms, your Sleep(1) will take 4 ms.

I would say it just depends on how loaded the cpu is, if there arent many other process/threads it could get back to the calling thread a lot faster.

Allocate more processor cycles to my program

I've been working on win32, c,c++ for a while. I code on visual studio. Most of the time I see system idle process uses more cpu utilization. Is there a way to allocate more processor cycles to my program to run it faster? I understand there might be limitations from i/o, in those cases this question doesn't make any sense.
OR
did i misunderstood the task manager numbers? I'm in a confusion, please help me out.
And I want to do something in program itself, btw I will be happy if answers are specific to windows.
Thanks in advance
~calvin

If your program it the only program that has something to do (not wait for IO), its thread will always be assigned to a processor core.
However, if you have a multi-core processor, and a single-threaded program, the CPU usage of your process displayed in the task manager will always be limited by 100/Ncores.
For example, if you have a quad-core machine, your process will be at 25% (using one core), and the idle process at around 75%. You can only additional CPU power by dividing your tasks into chunks that can be worked on by separate threads which will then be run on the idle cores.

The idle process only "runs" when no other process needs to. If you want to use more CPU cycles, then use them.

If your program is idling, it doesn't do anything, i.e. there is nothing that could be done any faster. So the CPU is probably not the bottle-neck in your case.
Are you maybe waiting for data coming from the disk or network?
In case your processor has multiple cores and your program uses only one core to its full extent, making your program multi-threaded could work.

In a multitask / multithread OS the processor(s) time is splitted among threads.
If you want a specific thread to get bigger time chunk you can set its priority with the SetThreadPriority function, not wise to do it though.
Only special software (should) mess with those settings.
It's common for window applications to have a low cpu usage percent (which we see in the task manager)
because most of the time they just wait for messages.

Use threads to:
abstract away all the I/O waits.
assign work to all cores.
also, remove all sleep-wait states from main thread.
Defer all I/O to a thread, so that wait states are confined within it. Keep the actual computations in the foreground thread, and use synchronization mechanisms that make the I/O slave thread to wait for your main thread when communicating.
If your CPU is multi-core, and your problem is paralellizable, create as many threads as you have cores, research "set affinity" functions to assign them between the cores and still keep a separate thread for all I/O.
Also pay attention not to wait in your main thread - usleep(1) doesn't send you into background for 1 microsecond, but for "no less than..." and that may mean anything between 1ms and 100ms but hardly ever less than that, and never anything close to a microsecond.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js