Multiple timeouts in C++ - c++

I have several objects I need to maintain with time-to-live parameters, and trigger some kind of event when they time out.
Implementing this as a timer that just waits for a the object with the smallest TTL before popping it off a queue doesn't seem very efficient because I need to be able to add/remove objects sporadically from this queue (and they can have any time out value), possibly before they time out. For example, ugly things would happen if I think that the shortest TTL object is 10 seconds and block the timeout thread for 10 seconds, but during this period an object with 3 seconds to live is added to the queue.
The naive way of doing
while (true) {
update()
}
void update() {
// get delta since last call to update()
// subtract delta from each object and time out if ttl < 0
}
is pretty slow since there is a lot of memory being shuffled around for the sole purpose of updating ttl on microsecond resolution.
Is there a good way to do this without creating a separate thread for each object?
I have access to the C++11 std lib but no boost or any external libraries.

One easy but somewhat crappy option is to poll for updates to the queue - say every second or tenth of a second. Too often and your OS may not have time to yield for some productive work; too infrequently and your timing capability becomes very crude. You can use std::this_thread::sleep_for(std::chrono::milliseconds(n)) for the inter-poll delay, or may want to use e.g. select if you are doing other I/O too. You can have any accesses to the queue arbitrated by a mutex, and use say a std::map<std::chrono::time_point, Task> to keep Tasks sorted by TTL, so each time the poll period expires you just iterate from .begin() towards .end(), exiting early if the time_point has not elapsed yet.
...without creating a separate thread for each object?
The above can be done is a single background thread if desired.
Another option: a non-Standard OS-provided asynchronous notification mechanism such as a signal-raising alarm... e.g. alarm, but your signal is typically only allowed to do a fairly restricted number of operations though, the normal advice is to set a flag so the interrupted thread can know there's work for it to do - that's not much different to having to check the queue for expired Tasks anyway, but the advantage is the signal itself can force some blocking operations to terminate early (e.g. without SA_RESTART flag to sigaction), with error codes indicating the reason for the interruption. Decades ago I came across some blocking operations on some Operating Systems that had limited guarantees about the state of I/O buffers the interrupted routine may have been using, making it impossible to build a robust resumption of that I/O - check your OS docs.

Related

How hardware timers work and affect on software performance?

I want to use async functions calling. I chose boost::deadline_timer.
For me, hardware-timer is a specific hardware (surprisingly), that works independently from CPU and is duty only for monitoring time. At the same time, if I understand correctly, it is also can be used for setting timeout and generating an interrupt when the timeout has been reached. (timers)
The primary advantage of that is asynchronous execution. The thread that set a timer can continue working and the callback function will be triggered in the same thread where the timer has been set.
Let me describe as I see it in action.
The application contains one or more worker threads. E.g. they process input items and filter them. Let's consider that application has 5 threads and each thread set one timer (5 seconds).
Application is working. E.g. current thread is thread-3.
Timer (thread-0) has been expired and generates (probably the wrong term) an interrupt.
Thread-context switching (thread-3 -> thread-0);
Callback function execution;
Timer (thread-1) has been expired and generates interruption.
...
And so on
P.S.0. I understand that this is not only one possible case for multi-threaded application.
Questions:
Did I describe the working process rightly?
Do I understand correctly that even current thread is thread-0 it also leads to context-switching, since the thread has to stop to execute current code and switch to execute the code from callback fuction?
If each thread sets 100k or 500k timers how it will affect on performance?
Does hardware have the limit to count of timers?
How expensive to update the timeout for a timer?
A hardware timer is, at its core, just a count-up counter and a set of comparators (or a count-down counter that uses the borrow of the MSb as an implicit comparison with 0).
Picture it as a register with a specialized operation Increment (or Decrement) that is started at every cycle of a clock (the easiest kind of counter with this operation is the Ripple-counter).
Each cycle the counter value is also fed to the comparator, previously loaded with a value, and its output will be the input to the CPU (as an interrupt or in a specialized pin).
In the case of a count-down counter, the borrow from the MSb acts as the signal that the value rolled over zero.
These timers have usually more functions, like the ability to stop after they reach the desired value (one-shot), to reset (periodic), to alternate the output state low and high (square wave generator), and other fancy features.
There is no limit on how many timers you can put on a package, of course, albeit simple circuits, they still have a cost in terms of money and space.
Most MCUs have one or two timers, when two, the idea is to use one for generic scheduling and the other for high-priority tasks orthogonal to the OS scheduling.
It's worth noting that having many hardware timers (to be used by the software) is useless unless there are also many CPUs/MCUs since it's easier to use software timers.
On x86 the HPET timer is actually made of at most 32 timers, each with 8 comparators, for a total of 256 timers as seen from the software POV.
The idea was to assign each timer to a specific application.
Applications in an OS don't use the hardware timers directly, because there can possibly be a lot of applications but just one or two timers.
So what the OS does is share the timer.
It does this by programming the timer to generate an interrupt every X units of time and by registering an ISR (Interrupt Service Routine) for such an event.
When a thread/task/program sets up a timer, the OS appends the timer information (periodic vs one-shot, period, ticks left, and callback) to a priority queue using the absolute expiration time as the key (see Peter Cordes comments below) or a list for simple OSes.
Each time the ISR is called the OS will peek at the queue and see if the element on top is expired.
What happens when a software timer is expired is OS-dependent.
Some embedded and small OS may call the timer's callback directly from the context of the ISR.
This is often true if the OS doesn't really have a concept of thread/task (and so of context switch).
Other OSes may append the timer's callback to a list of "to be called soon" functions.
This list will be walked and processed by a specialized task. This is how FreeRTOS does it if the timer task is enabled.
This approach keeps the ISR short and allows programming the hardware timer with a shorter period (in many architectures interrupts are ignored while in an ISR, either by the CPU automatically masking interrupts or by the interrupt controller).
IIRC Windows does something similar, it posts an APC (Async Procedure Call) in the context of the thread that set the software timer just expired. When the thread is scheduled the APC will (as a form of a window's message or not, depending on the specific API used) call the callback. If the thread was waiting on the timer, I think it is just set in the ready state. In any case, it's not scheduled right away but it may get a priority boost.
Where the ISR will return is still OS-dependent.
An OS may continue executing the interrupted thread/task until it's scheduled out. In this case, you don't have step 4 immediately after step 3, instead, thread3 will run until its quantum expires.
On the other way around, an OS may signal the end of the ISR to the hardware and then schedule the thread with the callback.
This approach doesn't work if two or more timers expire in the same tick, so a better approach would be to execute a rescheduling, letting the schedule pick up the most appropriate thread.
The scheduling may also take into account other hints given by the thread during the creation of the software timer.
The OS may also just switch context, execute the callback and get back to the ISR context where it continues peeking at the queue.
The OS may even do any of that based on the period of the timer and other hints.
So it works pretty much like you imagined, except that a thread may not be called immediately upon the timer's expiration.
Updating a timer is not expensive.
While all in all the total work is not much, the timer ISR is meant to be called many many times a second.
In fact, I'm not even sure an OS will allow you to create such a huge number (500k) of timers.
Windows can manage a lot of timers (and their backing threads) but probably not 500k.
The main problem with having a lot of timers is that even if each one performs little work, the total work performed may be too much to keep up with the rate of ticking.
If each X units (e.g. 1ms) of time 100 timers expire, you have X/100 units of time (e.g. 10us) to execute each callback and the callback's code may just be too long to execute in that slice of time.
When this happens the callbacks will be called less often than desired.
More CPU/cores will allow some callback to execute in parallel and would alleviate the pressure.
In general, you need different timers if they run at different rates, otherwise, a single timer that walks a data structure filled with elements of work/data is fine.
Multi-threading can provide concurrency if your tasks are IO-bounded (files, network, input, and so on) or parallelism if you have a multi-processor system.

What is the best way to handle timed callbacks that affect an object shared across different threads?

I have a class, let's call it Foo foo, which holds data that is useful to multiple threads. These threads may call read and write operations (e.g. foo->emplace(something)) which I protected by a mutex inside Foo and added mutex locks to the operations. Here's where I'm uncertain when it comes to implementation. I have to add another piece of shared information to foo where I have to call foo->emplace2(somethingElse) and this will store somethingElse in a std::set but this should only be stored for a minute.
What is the right approach to this? Do I create a new thread from inside foo whenever emplace2 is called and inside this thread emplace, sleep for 60 seconds, then erase? I feel like there is a better way than creating lots of threads every time emplace2 is called.
Not looking for code, just general implementation advice.
You have a few options:
One thread per request
a.k.a your solution
Advantages:
easy to implement
precise remove times
Disadvantages:
lots of threads created. But then again all the threads sleep for their whole life so it might not be a problem worth solving.
One queue and one thread that periodically checks expired requests
Create a queue that stores all the requests along with their expire time. Have one thread that periodically wakes up and deletes all expired threads
Advantages:
easy-ish to implement
just one thread
Disadvantages:
A compromise must be made:
increase the wake-up and check frequency: this increases the precision time of removing requests, but also increases the number of useless wake-ups
lower the wake-up and check frequency: this lowers the precision time of removing requests but decreases the number of useless wake-ups
One thread with precise wake-ups
Create one queue with all the requests along with their expire time. Have one thread that wakes up only when the next request expires.
When a request is added: signal and wake the thread, recompute the next expire and sleep until that time.
On wake-up: delete the expiring request, compute the time until the next expire and sleep until then.
Advantages:
Best performance
precise remove times
Disadvantage:
difficult-ish to implement
What is the best way to ...
As always the answer is: "Is depends". I gave you some options along with a brief analysis of each. It's up to you to decide which to implement and the decision is a balancing act between performance requirements versus implementation and maintenance costs.

Threads: How to calculate precisely the execution time of an algorithm (duration of function) in C or C++?

There is easy way to calc duration of any function which described here: How to Calculate Execution Time of a Code Snippet in C++
start_timestamp = get_current_uptime();
// measured algorithm
duration_of_code = get_current_uptime() - start_timestamp;
But, it does not allow to get clear duration cause some time for execution other threads will be included in the measured time.
So question is: how to consider time which code spend in other threads?
OSX code preffer. Although it's great to look to windows or linux code also...
upd: Ideal? concept of code
start_timestamp = get_this_thread_current_uptime();
// measured algorithm
duration_of_code = get_this_thread_current_uptime() - start_timestamp;
I'm sorry to say that in the general case there is no way to do what you want. You are looking for worst-case execution time, and there are several methods to get a good approximation for this, but there is no perfect way as WCET is equivalent to the Halting problem.
If you want to exclude the time spent in other threads then you could disable task context switches upon entering the function that you want to measure. This is RTOS dependent but one possibility is to raise the priority of the current thread to the maximum. If this thread is max priority then other threads won't be able to run. Remember to reset the thread priority again at the end of the function. This measurement may still include the time spent in interrupts, however.
Another idea is to disable interrupts altogether. This could remove other threads and interrupts from your measurement. But with interrupts disabled the timer interrupt may not function properly. So you'll need to setup a hardware timer appropriately and rely on the timer's counter value register (rather than any time value derived from a timer interrupt) to measure the time. Also make sure your function doesn't call any RTOS routines that allow for a context switch. And remember to restore interrupts at the end of your function.
Another idea is to run the function many times and record the shortest duration measured over those many times. Longer durations probably include time spent in other threads but the shortest duration may be just the function with no other threads.
Another idea is to set a GPIO pin upon entry to and clear it upon exit from the function. Then monitor the GPIO pin with an oscilloscope (or logic analyzer). Use the oscilloscope to measure the period for when the GPIO pin is high. In order to remove the time spent in other threads you would need to modify the RTOS scheduler routine that selects the thread to run. Clear the GPIO pin in the scheduler when another thread runs and set it when the scheduler returns to your function's thread. You might also consider clearing the GPIO pin in interrupt handlers.
Your question is entirely OS dependent. The only way you can accomplish this is to somehow get a guarantee from the OS that it won't preempt your process to perform some other task, and to my knowledge this is simply not possible in most consumer OS's.
RTOS often do provide ways to accomplish this though. With Windows CE, anything running at priority 0 (in theory) won't be preempted by another thread unless it makes a function/os api/library call that requires servicing from another thread.
I'm not super familer with OSx, but after glancing at the documentation, OSX is a "soft" realtime operating system. This means that technically what you want can't be guaranteed. The OS may decide that there is "Something" more important than your process that NEEDS to be done.
OSX does however allow you to specify a Real-time process which means the OS will make every effort to honor your request to not be interrupted and will only do so if it deems absolutely necessary.
Mac OS X Scheduling documentation provides examples on how to set up real-time threads
OSX is not an RTOS, so the question is mistitled and mistagged.
In a true RTOS you can lock the scheduler, disable interrupts or raise the task to the highest priority (with round-robin scheduling disabled if other tasks share that priority) to prevent preemption - although only interrupt disable will truly prevent preemption by interrupt handlers. In a GPOS, even if it has a priority scheme, that normally only controls the number of timeslices allowed to a process in what is otherwise round-robin scheduling, and does not prevent preemption.
One approach is to make many repeated tests and take the smallest value obtained, since that is likely to be the one where the fewest pre-emptions occurred. It will help also to set the process to the highest priority in order to minimise the number of preemtions. But bear in mind on a GPOS many interrupts from devices such as the mouse, keyboard, and system clock will occur and consume a small (an possibly negligible) amount of time.

Scheduling of Process(s) waiting for Semaphore

It is always said when the count of a semaphore is 0, the process requesting the semaphore are blocked and added to a wait queue.
When some process releases the semaphore, and count increases from 0->1, a blocking process is activated. This can be any process, randomly picked from the blocked processes.
Now my question is:
If they are added to a queue, why is the activation of blocking processes NOT in FIFO order? I think it would be easy to pick next process from the queue rather than picking up a process at random and granting it the semaphore. If there is some idea behind this random logic, please explain. Also, how does the kernel select a process at random from queue? getting a random process that too from queue is something complex as far as a queue data structure is concerned.
tags: various OSes as each have a kernel usually written in C++ and mutex shares similar concept
A FIFO is the simplest data structure for the waiting list in a system
that doesn't support priorities, but it's not the absolute answer
otherwise. Depending on the scheduling algorithm chosen, different
threads might have different absolute priorities, or some sort of
decaying priority might be in effect, in which case, the OS might choose
the thread which has had the least CPU time in some preceding interval.
Since such strategies are widely used (particularly the latter), the
usual rule is to consider that you don't know (although with absolute
priorities, it will be one of the threads with the highest priority).
When a process is scheduled "at random", it's not that a process is randomly chosen; it's that the selection process is not predictable.
The algorithm used by Windows kernels is that there is a queue of threads (Windows schedules "threads", not "processes") waiting on a semaphore. When the semaphore is released, the kernel schedules the next thread waiting in the queue. However, scheduling the thread does not immediately make that thread start executing; it merely makes the thread able to execute by putting it in the queue of threads waiting to run. The thread will not actually run until a CPU has no threads of higher priority to execute.
While the thread is waiting in the scheduling queue, another thread that is actually executing may wait on the same semaphore. In a traditional queue system, that new thread would have to stop executing and go to the end of the queue waiting in line for that semaphore.
In recent Windows kernels, however, the new thread does not have to stop and wait for that semaphore. If the thread that has been assigned that semaphore is still sitting in the run queue, the semaphore may be reassigned to the old thread, causing the old thread to go back to waiting on the semaphore again.
The advantage of this is that the thread that was about to have to wait in the queue for the semaphore and then wait in the queue to run will not have to wait at all. The disadvantage is that you cannot predict which thread will actually get the semaphore next, and it's not fair so the thread waiting on the semaphore could potentially starve.
It is not that it CAN'T be FIFO; in fact, I'd bet many implementations ARE, for just the reasons that you state. The spec isn't that the process is chosen at random; it is that it isn't specified, so your program shouldn't rely on it being chosen in any particular way. (It COULD be chosen at random; just because it isn't the fastest approach doesn't mean it can't be done.)
All of the other answers here are great descriptions of the basic problem - especially around thread priorities and ready queues. Another thing to consider however is IO. I'm only talking about Windows here, since it is the only platform I know with any authority, but other kernels are likely to have similar issues.
On Windows, when an IO completes, something called a kernel-mode APC (Asynchronous Procedure Call) is queued against the thread which initiated the IO in order to complete it. If the thread happens to be waiting on a scheduler object (such as the semaphore in your example) then the thread is removed from the wait queue for that object which causes the (internal kernel mode) wait to complete with (something like) STATUS_ALERTED. Now, since these kernel-mode APCs are an implementation detail, and you can't see them from user mode, the kernel implementation of WaitForMultipleObjects restarts the wait at that point which causes your thread to get pushed to the back of the queue. From a kernel mode perspective, the queue is still in FIFO order, since the first caller of the underlying wait API is still at the head of the queue, however from your point of view, way up in user mode, you just got pushed to the back of the queue due to something you didn't see and quite possibly had no control over. This makes the queue order appear random from user mode. The implementation is still a simple FIFO, but because of IO it doesn't look like one from a higher level of abstraction.
I'm guessing a bit more here, but I would have thought that unix-like OSes have similar constraints around signal delivery and places where the kernel needs to hijack a process to run in its context.
Now this doesn't always happen, but the documentation has to be conservative and unless the order is explicitly guaranteed to be FIFO (which as described above - for windows at least - it can't be) then the ordering is described in the documentation as being "random" or "undocumented" or something because a random process controls it. It also gives the OS vendors lattitude to change the ordering at some later time.
Process scheduling algorithms are very specific to system functionality and operating system design. It will be hard to give a good answer to this question. If I am on a general PC, I want something with good throughput and average wait/response time. If I am on a system where I know the priority of all my jobs and know I absolutely want all my high priority jobs to run first (and don't care about preemption/starvation), then I want a Priority algorithm.
As far as a random selection goes, the motivation could be for various reasons. One being an attempt at good throughput, etc. as mentioned above above. However, it would be non-deterministic (hypothetically) and impossible to prove. This property could be an exploitation of probability (random samples, etc.), but, again, the proofs could only be based on empirical data on whether this would really work.

Intel Thread Building Blocks Concurrent Queue: Using pop() over pop_if_present()

What is the difference in using the blocking call pop() as compared to,
while(pop_if_present(...))
Which should be preferred over the other? And why?
I am looking for a deeper understanding of the tradeoff between polling yourself as in the case of while(pop_if_present(...)) with respect to letting the system doing it for you. This is quite a general theme. For example, with boost::asio I could do a myIO.run() which blocks or do the following:
while(1)
{
myIO.poll()
}
One possible explanation is is that the thread that invokes while(pop_if_present(...)) will remain busy so this is bad. But someone or something has to poll for the async event. Why and how can this be cheaper when it is delegated to the OS or the library? Is it because the OS or the library smart about polling for example do an exponential backoff?
Intel's TBB library is open source, so I took a look...
It looks like pop_if_present() essentially checks if the queue is empty and returns immediately if it is. If not, it attempts to get the element on the top of the queue (which might fail, since another thread may have come along and taken it). If it misses, it performs an "atomic_backoff" pause before checking again. The atomic_backoff will simply spin the first few times it's called (doubling its spin loop count each time), but after a certain number of pauses it'll just yield to the OS scheduler instead of spinning on the assumption that since it's been waiting a while, it might as well do it nicely.
For the plain pop() function, if there isn't anything in the queue will perform atomic_backoff waits until there is something in the queue that it gets.
Note that there are at least 2 interesting things (to me anyway) about this:
the pop() function performs spin waits (up to a point) for something to show up in the queue; it's not going to yield to the OS unless it has to wait for more than a little short moment. So as you might expect, there's not much reason to spin yourself calling pop_if_present() unless you have something else you're going to do between calls to pop_if_present()
when pop() does yield to the OS, it does so by simply giving up it's time slice. It doesn't block the thread on a synchronization object that can be signaled when an item is placed on the queue - it seems to go into a sleep/poll cycle to check the queue for something to pop. This surprised me a little.
Take this analysis with a grain of salt... The source I used for this analysis might be a bit old (it's actually from concurrent_queue_v2.h and .cpp) because the more recent concurrent_queue has a different API - there's no pop() or pop_if_present(), just a try_pop() function in the latest class concurrent_queue interface. The old interface has been moved (possibly changed somewhat) to the concurrent_bounded_queue class. It appears that the newer concurrent_queues can be configured when the library is built to use OS synchronization objects instead of busy waits and polling.
With the while(pop_if_present(...)) you are doing brute-force busy wait (also called spinning) on the queue. When the queue is empty you waste cycles by keeping CPU busy until either an item is pushed into the queue by another thread running on different CPU, or OS deciding to give your CPU to some other, possibly unrelated thread/process.
You can see how this could be bad if you have only one CPU - the producer thread would not be able to push and thus stop the consumer spinning until at least the end of consumer's time quanta plus overhead of a context switch. Clearly a mistake.
With multiple CPUs this might be better if the OS selects (or you enforce) the producer thread to run on different CPU. This is the basic idea of spin-lock - a synchronization primitive built directly on special processor instructions such as compare-and-swap or load-linked/store conditional and commonly used inside the operating system to communicate between interrupt handlers and rest of the kernel, and to build higher level constructs such as semaphores.
With blocking pop(), if queue is empty, you are entering sleep wait, i.e. asking the OS to put the consumer thread into non-schedulable state until an event - push onto the queue - occurs form another thread. The key here is that the processor is available for other (hopefully useful) work. The TBB implementation actually tries hard to avoid the sleep since it's expensive (entering the kernel, rescheduling, etc.) The goal is to optimize the normal case where the queue is not empty and the item can be retrieved quickly.
The choice is really simple though - always sleep-wait, i.e. do blocking pop(), unless you have to busy-wait (and that is in real-time systems, OS interrupt context, and some very specialized applications.)
Hope this helps a bit.