I have two different computational tasks that have to execute at certain frequencies. One has to be performed every 1ms and the other every 13.3ms. The tasks share some data.
I am having a hard time how to schedule these tasks and how to share data between them. One way that I thought might work is to create two threads, one for each task.
The first task is relatively simpler and can be handled in 1ms itself. But, when the second task (that is relatively more time-consuming) is going to launch, it will make a copy of the data that was just used by task 1, and continue to work on them.
Do you think this would work? How can it be done in c++?
There are multiple ways to do that in C++.
One simple way is to have 2 threads, as you described. Each thread does its action and then sleeps till the next period start. A working example:
#include <functional>
#include <iostream>
#include <chrono>
#include <thread>
#include <atomic>
#include <mutex>
std::mutex mutex;
std::atomic<bool> stop = {false};
unsigned last_result = 0; // Whatever thread_1ms produces.
void thread_1ms_action() {
// Do the work.
// Update the last result.
{
std::unique_lock<std::mutex> lock(mutex);
++last_result;
}
}
void thread_1333us_action() {
// Copy thread_1ms result.
unsigned last_result_copy;
{
std::unique_lock<std::mutex> lock(mutex);
last_result_copy = last_result;
}
// Do the work.
std::cout << last_result_copy << '\n';
}
void periodic_action_thread(std::chrono::microseconds period, std::function<void()> const& action) {
auto const start = std::chrono::steady_clock::now();
while(!stop.load(std::memory_order_relaxed)) {
// Do the work.
action();
// Wait till the next period start.
auto now = std::chrono::steady_clock::now();
auto iterations = (now - start) / period;
auto next_start = start + (iterations + 1) * period;
std::this_thread::sleep_until(next_start);
}
}
int main() {
std::thread a(periodic_action_thread, std::chrono::milliseconds(1), thread_1ms_action);
std::thread b(periodic_action_thread, std::chrono::microseconds(13333), thread_1333us_action);
std::this_thread::sleep_for(std::chrono::seconds(1));
stop = true;
a.join();
b.join();
}
If executing an action takes longer than one period to execute, then it sleeps till the next period start (skips one or more periods). I.e. each Nth action happens exactly at start_time + N * period, so that there is no time drift regardless of how long it takes to perform the action.
All access to the shared data is protected by the mutex.
So I'm thinking that task1 needs to make the copy, because it knows when it is safe to do so. Here is one simplistic model:
Shared:
atomic<Result*> latestResult = {0};
Task1:
Perform calculation
Result* pNewResult = new ResultBuffer
Copy result to pNewResult
latestResult.swap(pNewResult)
if (pNewResult)
delete pNewResult; // Task2 didn't take it!
Task2:
Result* pNewResult;
latestResult.swap(pNewResult);
process result
delete pNewResult;
In this model task1 and task2 only ever naggle when swapping a simple atomic pointer, which is quite painless.
Note that this makes many assumptions about your calculation. Could your task1 usefully calculate the result straight into the buffer, for example? Also note that at the start Task2 may find the pointer is still null.
Also it inefficiently new()s the buffers. You need 3 buffers to ensure there is never any significant naggling between the tasks, but you could just manage three buffer pointers under mutexes, such that Task 1 will have a set of data ready, and be writing another set of data, while task 2 is reading from a third set.
Note that even if you have task 2 copy the buffer, Task 1 still needs 2 buffers to avoid stalls.
You can use C++ threads and thread facilities like class thread and timer classes like steady_clock like it has been described in previous answer but if this solution works strongly depends on the platform your code is running on.
1ms and 13.3ms are pretty short time intervals and if your code is running on non-real time OS like Windows or non-RTOS Linux, there is no guarantee that OS scheduler will wake up your threads at exact times.
C++ 11 has the class high_resolution_clock that should use high resolution timer if your platform supports one but it still depends on the implementation of this class. And the bigger problem than the timer is using C++ wait functions. Neither C++ sleep_until nor sleep_for guarantees that they will wake up your thread at specified times. Here is the quote from C++ documentation.
sleep_for - blocks the execution of the current thread for at least the specified sleep_duration. sleep_for
Fortunately, most OS have some special facilities like Windows Multimedia Timers you can use if your threads are not woken up at expected times.
Here are more details. Precise thread sleep needed. Max 1ms error
Related
I will say in advance that huge speed is needed and calling ExecutePackets is very expensive.
Necessary that the ExecutePackets function process many packages in parallel from different threads.
struct Packet {
bool responseStatus;
char data[1024];
};
struct PacketPool {
int packet_count;
Packet* packets[10];
}packet_pool;
std::mutex queue_mtx;
std::mutex request_mtx;
bool ParallelExecutePacket(Packet* p_packet) {
p_packet->responseStatus = false;
struct QueuePacket {
bool executed;
Packet* p_packet;
}queue_packet{ false, p_packet };
static std::list<std::reference_wrapper<QueuePacket>> queue;
//make queue
queue_mtx.lock();
queue.push_back(queue_packet);
queue_mtx.unlock();
request_mtx.lock();
if (!queue_packet.executed)
{
ZeroMemory(&packet_pool, sizeof(packet_pool));
//move queue to pequest_pool and clear queue
queue_mtx.lock();
auto iter = queue.begin();
while (iter != queue.end())
if (!(*iter).get().executed)
{
int current_count = packet_pool.packet_count++;
packet_pool.packets[current_count] = (*iter).get().p_packet;
(*iter).get().executed = true;
queue.erase(iter++);
}
else ++iter;
queue_mtx.unlock();
//execute packets
ExecutePackets(&packet_pool);
}
request_mtx.unlock();
return p_packet->responseStatus;
}
The ParallelExecutePacket function can be called from multiple loops at the same time. I want packets to be processed in batches of several. More precisely, so that each thread processes the entire queue. Then the number of ExecutePackets will be reduced, while not losing the number of processed packets.
However, in my code with multiple threads, the total number of packets processed is equal to the number of packets processed by one thread. And I don't understand why this is happening.
In my test, I created several threads and in each thread called ParallelExecutePacket in a loop.
The results are the number of processed requests per second.
Multithread:
Summ:91902
Thread 0 : 20826
Thread 1 : 40031
Thread 2 : 6057
Thread 3 : 12769
Thread 4 : 12219
Singlethread:
Summ:104902
Thread 0 : 104902
And if my version is not working,how implement what i need?
queue_mtx.lock();
auto iter = queue.begin();
while (iter != queue.end())
queue.erase(iter++);
queue_mtx.unlock();
Only one execution thread locks the queue at a time, drains all messages from it, and then unlocks it. Even if a thousand execution threads are available here only one of them will be able to do any work. All others get blocked.
The length of time the queue_mtx is held must be minimized as much as possible, it should be no more than the absoulte minimum it takes to pluck one messages out of the queue, removing it completely, then unlocking the queue while all the real work is done.
int current_count = packet_pool.packet_count++;
packet_pool.packets[current_count] = (*iter).get().p_packet;
This appears to be the extent of the work that's done here. Currently the shown code enjoys the benefit of being protected by the queue_mtx. If this is no longer protected by it, any more, then thread safety must be implemented here in some other way, if that's needed (it's unclear what any of this is, and whether there's a thread synchronization issue here, at all).
You never drop request_mtx during the while loop. That while loop includes ExecutePackets, so your thread blocks all of the others until it completes executing all the tasks it finds.
Also note that you wont actually see any speed ups from this style of parallelism. To have n threads of parallelism with this code, you need to have n callers calling into ParallelExecutePacket. This is exactly the same parallelism that would happen if you just let each one work on its own. Indeed, statistically speaking you will find that almost always every thread just runs its own task. Every now and then you'll get a threading contention which causes one thread to execute another's task. When this occurs, both threads slow down to the slower of the two.
i want to generate interrupt every 100 microseconds on windows. Actually i couldnt do this on windows,because windows does not guarantee the interrupts less then 500 microseconds. So, i generate 2 threads. One of them is for timer counter(query performance counter), the other thread is the actual work. When timer counter is 100 microseconds, it change the state of the other thread(actual work) . But i have problem with race condition, because i dont want the threads wait each others, they must always run. So actually i need interrupts. How do i write such fast interrupt on windows with c++?
To avoid having two threads communicating when you have these short time windows, I'd put both the work and the timer in a loop in one thread.
Take a sample of the clock when the thread starts and add 100μs to that each loop.
Sleep until the calculated time occurs. Normally, one would use std::this_thread::sleep_until to do such a sleep, but in this case, when the naps are so short, it often becomes a little too inaccurate, so I suggest busy-waiting in a tight loop that just checks the time.
Do your work.
In this example a worker thread runs for 10s without doing any real work. On my machine I could add work consisting of ~3000 additions in the slot where you are supposed to do your work before the whole loop started taking more than 100μs, so you'd better do what you aim to do really fast.
Example:
#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>
using namespace std::chrono_literals;
static std::atomic<bool> running = true;
using myclock = std::chrono::steady_clock;
void worker() {
int loops = 0;
auto sleeper = myclock::now();
while(running) {
++loops; // count loops to check that it's good enough afterwards
// add 100us to the sleeper time_point
sleeper += 100us;
// busy-wait until it's time to do some work
while(myclock::now() < sleeper);
// do your work here
}
std::cout << loops << " (should be ~100000)\n";
}
int main() {
auto th = std::thread(worker);
// let the thread work for 10 seconds
std::this_thread::sleep_for(10s);
running = false;
th.join();
}
Possible output:
99996 (should be ~100000)
It takes a few clock cycles to get the thread started so don't worry about the number of loops not being exactly on target. Double the time the thread runs and you should still stay close to the target number of loops. What matters is that it's pretty good (but not realtime-good) once it's started running.
I'm writing a C++ ThreadPool implantation and using pthread_cond_wait in my worker's main function. I was wondering how much time will pass from signaling the condition variable until the thread/threads waiting on it will wake up.
do you have any idea of how can I estimate/calculate this time?
Thank you very much.
It depends, on the cost of a context switch
on the OS,
The CPU
is it thread or a different process
the load of the machine
Is the switch to same core as it last ran on
what is the working set size
time since it last ran
Linux best case, i7, 1100ns, thread in same process, same core as it ran in last, ran as the last thread, no load, working set 1 byte.
Bad case, flushed from cache, different core, different process, just expect 30µs of CPU overhead.
Where does the cost go:
Save last process context 70-400 cycles,
load new context 100-400 cycles
if different process, flush TLB, reload 3 to 5 page walks, which potentially could be from memory taking ~300 cycles each. Plus a few page walks if more than one page is touched, including instructions and data.
OS overhead, we all like the nice statistics, for example add 1 to context switch counter.
Scheduling overhead, which task to run next
potential cache misses on new core ~12 cycles per cache line on own L2 cache, and downhill from there the farther away the data is and the more there is of it.
As mentioned time for condition variable to react depends on many factors. One option is to actually measure it: you may start a thread that waits on a condition variable. Then, another thread that signals the condition variable takes timestamp right before signaling the variable. The thread that waits on the variable also takes timestamp the moment it wakes up. Simple as that. This way you may have rough approximation about time it takes for the thread to notice the signaled condition.
#include <mutex>
#include <condition_variable>
#include <thread>
#include <chrono>
#include <stdio.h>
typedef std::chrono::time_point<std::chrono::high_resolution_clock> timep;
int main()
{
std::mutex mx;
std::condition_variable cv;
timep t0, t1;
bool done = false;
std::thread th([&]() {
while (!done)
{
std::unique_lock lock(mx);
cv.wait(lock);
t1 = std::chrono::high_resolution_clock::now();
}
});
for (int i = 0; i < 25; ++i) // measure 25 times
{
std::this_thread::sleep_for(std::chrono::milliseconds(10));
t0 = std::chrono::high_resolution_clock::now();
cv.notify_one();
std::this_thread::sleep_for(std::chrono::milliseconds(10));
std::unique_lock lock(mx);
printf("test#%-2d: cv reaction time: %6.3f micro\n", i,
1000000 * std::chrono::duration<double>(t1 - t0).count());
}
{
std::unique_lock lock(mx);
done = true;
}
cv.notify_one();
th.join();
}
Try it on coliru, it produced this output:
test#0 : cv reaction time: 50.488 micro
test#1 : cv reaction time: 55.057 micro
test#2 : cv reaction time: 53.765 micro
test#3 : cv reaction time: 50.973 micro
test#4 : cv reaction time: 51.015 micro
test#5 : cv reaction time: 57.166 micro
and so on...
On my windows 11 laptop I got values roughly 5-10x faster (5-10 microseconds).
In the following example (an idealized "game") there are two threads. The main thread which updates data and RenderThread which "renders" it to the screen. What I need it those two to be synchronized. I cannot afford to run several update iteration without running a render for every single one of them.
I use a condition_variable to sync those two, so ideally the faster thread will spend some time waiting for the slower. However condition variables don't seem to do the job if one of the threads completes an iteration for a very small amount of time. It seems to quickly reacquire the lock of the mutex before wait in the other thread is able to acquire it. Even though notify_one is called
#include <iostream>
#include <thread>
#include <chrono>
#include <atomic>
#include <functional>
#include <mutex>
#include <condition_variable>
using namespace std;
bool isMultiThreaded = true;
struct RenderThread
{
RenderThread()
{
end = false;
drawing = false;
readyToDraw = false;
}
void Run()
{
while (!end)
{
DoJob();
}
}
void DoJob()
{
unique_lock<mutex> lk(renderReadyMutex);
renderReady.wait(lk, [this](){ return readyToDraw; });
drawing = true;
// RENDER DATA
this_thread::sleep_for(chrono::milliseconds(15)); // simulated render time
cout << "frame " << count << ": " << frame << endl;
++count;
drawing = false;
readyToDraw = false;
lk.unlock();
renderReady.notify_one();
}
atomic<bool> end;
mutex renderReadyMutex;
condition_variable renderReady;
//mutex frame_mutex;
int frame = -10;
int count = 0;
bool readyToDraw;
bool drawing;
};
struct UpdateThread
{
UpdateThread(RenderThread& rt)
: m_rt(rt)
{}
void Run()
{
this_thread::sleep_for(chrono::milliseconds(500));
for (int i = 0; i < 20; ++i)
{
// DO GAME UPDATE
// when this is uncommented everything is fine
// this_thread::sleep_for(chrono::milliseconds(10)); // simulated update time
// PREPARE RENDER THREAD
unique_lock<mutex> lk(m_rt.renderReadyMutex);
m_rt.renderReady.wait(lk, [this](){ return !m_rt.drawing; });
m_rt.readyToDraw = true;
// SUPPLY RENDER THREAD WITH DATA TO RENDER
m_rt.frame = i;
lk.unlock();
m_rt.renderReady.notify_one();
if (!isMultiThreaded)
m_rt.DoJob();
}
m_rt.end = true;
}
RenderThread& m_rt;
};
int main()
{
auto start = chrono::high_resolution_clock::now();
RenderThread rt;
UpdateThread u(rt);
thread* rendering = nullptr;
if (isMultiThreaded)
rendering = new thread(bind(&RenderThread::Run, &rt));
u.Run();
if (rendering)
rendering->join();
auto duration = chrono::high_resolution_clock::now() - start;
cout << "Duration: " << double(chrono::duration_cast<chrono::microseconds>(duration).count())/1000 << endl;
return 0;
}
Here is the source of this small example code, and as you can see even on ideone's run the output is frame 0: 19 (this means that the render thread has completed a single iteration, while the update thread has completed all 20 of its).
If we uncomment line 75 (ie simulate some time for the update loop) everything runs fine. Every update iteration has an associated render iteration.
Is there a way to really truly sync those threads, even if one of them completes an iteration in mere nanoseconds, but also without having a performance penalty if they both take some reasonable amount of milliseconds to complete?
If I understand correctly, you want the 2 threads to work alternately: updater wait until the renderer finish before to iterate again, and the renderer wait until the updater finish before to iterate again. Part of the computation could be parallel, but the number of iteration shall be similar between both.
You need 2 locks:
one for the updating
one for the rendering
Updater:
wait (renderingLk)
update
signal(updaterLk)
Renderer:
wait (updaterLk)
render
signal(renderingLk)
EDITED:
Even if it look simple, there are several problems to solve:
Allowing part of the calculations to be made in parallel: As in the above snippet, update and render will not be parallel but sequential, so there is no benefit to have multi-thread. To a real solution, some the calculation should be made before the wait, and only the copy of the new values need to be between the wait and the signal. Same for rendering: all the render need to be made after the signal, and only getting the value between the wait and the signal.
The implementation need to care also about the initial state: so no rendering is performed before the first update.
The termination of both thread: so no one will stay locked or loop infinitely after the other terminate.
I think a mutex (alone) is not the right tool for the job. You might want to consider using a semaphore (or something similar) instead. What you describe sound a lot like a producer/consumer problem, i.e., one process is allowed to run once everytime another process has finnished a task. Therefore you might also have a look at producer/consumer patterns. For example this series might get you some ideas:
A multi-threaded Producer Consumer with C++11
There a std::mutex is combined with a std::condition_variable to mimic the behavior of a semaphore. An approach that appears quite reasonable. You would probably not count up and down but rather toggle true and false a variable with needs redraw semantics.
For reference:
http://en.cppreference.com/w/cpp/thread/condition_variable
C++0x has no semaphores? How to synchronize threads?
This is because you use a separate drawing variable that is only set when the rendering thread reacquires the mutex after a wait, which may be too late. The problem disappears when the drawing variable is removed and the check for wait in the update thread is replaced with ! m_rt.readyToDraw (which is already set by the update thread and hence not susceptible to the logical race.
Modified code and results
That said, since the threads do not work in parallel, I don't really get the point of having two threads. Unless you should choose to implement double (or even triple) buffering later.
A technique often used in computer graphics is to use a double-buffer. Instead of having the renderer and the producer operate on the same data in memory, each one has its own buffer. This is implemented by using two independent buffers, and switch them when needed. The producer updates one buffer, and when it is done, it switches the buffer and fills the second buffer with the next data. Now, while the producer is processing the second buffer, the renderer works with the first one and displays it.
You could use this technique by letting the renderer lock the swap operation such that the producer may have to wait until rendering is finished.
I have been struggling for days to come up with a mechanism for launching a few timers and not having it clock the main program execution. Combinations of .join() and .detach(), wait_until(), etc
What I have is a vector of std::thread and I want to:
execute the first position
wait for it to finish
execute the next position
wait for it to finish
meanwhile the rest of my app is running along, users clicking things, etc. Everything I come up with seems to either:
block the main program from running while the timers are going
or
detach from the main thread but then the timers run concurrently, how I want one after the previous one has finished.
I even posted: C++11 std::threads and waiting for threads to finish but no resolution that I can seem to make sense of either.
should I be using std::launch::async maybe?
EDIT: I am not sure why this is so hard for me to grasp. I mean video games do this all the time. Take Tiny Tower for example. You stock your floors and each one of those operations has a delay from when you start the stock, until when that item is stocked and it triggers a HUD that pops up and says, "Floor is now stocked". Meanwhile the whole game stays running for you to do other things. I must be dense because I cannot figure this out.
This snippet of code will execute a std::vector of nullary tasks in a separate thread.
typedef std::vector<std::function< void() >> task_list;
typedef std::chrono::high_resolution_clock::duration timing;
typedef std::vector< timing > timing_result;
timing_result do_tasks( task_list list ) {
timing_result retval;
for (auto&& task: list) {
std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
task();
std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now();
retval.push_back( end-start );
}
return retval;
}
std::future<timing_result> execute_tasks_in_order_elsewhere( task_list list ) {
return std::async( std::launch::async, do_tasks, std::move(list) );
}
this should run each of the tasks in series outside the main thread, and return a std::future that contains the timing results.
If you want the timing results in smaller chunks (ie, before they are all ready), you'll have to do more work. I'd start with std::packaged_task and return a std::vector<std::future< timing >> and go from there.
The above code is untested/uncompiled, but shouldn't have any fundamental flaws.
You'll note that the above does not use std::thread. std::thread is a low level tool that you should build tools on top of, not something you should use directly (it is quite fragile due to the requirement that it be joined or detached prior to destruction, among other things).
While std::async is nothing to write home about, it is great for quick-and-dirty multiple threading, where you want to take a serial task and do it "somewhere else". The lack of decent signaling via std::future makes it less than completely general (and is a reason why you might want to write higher level abstractions around std::thread).
Here is one that will run a sequence of tasks with a minimum amount of delay between them:
#include <chrono>
#include <iostream>
#include <vector>
#include <functional>
#include <thread>
#include <future>
typedef std::chrono::high_resolution_clock::duration duration;
typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::vector<std::pair<duration, std::function< void() >>> delayed_task_list;
void do_delayed_tasks( delayed_task_list list ) {
time_point start = std::chrono::high_resolution_clock::now();
time_point last = start;
for (auto&& task: list) {
time_point next = last + task.first;
duration wait_for = next - std::chrono::high_resolution_clock::now();
std::this_thread::sleep_for( wait_for );
task.second();
last = next;
}
}
std::future<void> execute_delayed_tasks_in_order_elsewhere( delayed_task_list list ) {
return std::async( std::launch::async, do_delayed_tasks, std::move(list) );
}
int main() {
delayed_task_list meh;
meh.emplace_back( duration(), []{ std::cout << "hello world\n"; } );
std::future<void> f = execute_delayed_tasks_in_order_elsewhere( meh );
f.wait(); // wait for the task list to complete: you can instead store the `future`
}
which should make the helper async thread sleep for (at least as long as) the durations you use before running each task. As written, time taken to execute each task is not counted towards the delays, so if the tasks take longer than the delays, you'll end up with the tasks running with next to no delay between them. Changing that should be easy, if you want to.
Your trouble is understandable, because what you need in order to have timers that don't block your event loop, is an event loop, and C++ doesn't yet have a standard one. You need to use other frameworks (such as Qt, Boost.Asio(?) or non-portable APIs (select(), etc)) to write event loops.