I'm currently writing code for a simulator to sync with ROS time.
Essentially, the problem becomes "write a get_time and sleep that scales according to ROS time"? Doing this will allow no change to the codebase and just require linking to the custom get_time and sleep. get_time seems to work perfectly; however, I've been having trouble getting the sleep to run accurately.
My current design is like this (code attached at the bottom):
Thread calls sleep
Sleep will add the time when to unlock this thread (current_time + sleep_time) into a priority queue, and then wait on a condition variable.
A separate thread (let's call it watcher) will constantly loop and check for the top of the queue; if the top of the prio queue > current time, then it will notify_all on the condition variable and then pop the prio queue
However, it seems like the watcher thread is not accurate enough (I see discrepancies of 0~50ms), meaning the sleep calls make the threads sleep too long sometimes. I also visibly notice lag/jagged behavior in the simulator compared to if I were to replace the sleep with a usleep(1000*ms).
Unfortunately, I'm not too experienced at these types of designs, and I feel like there are lots of ways to optimize/rewrite this to make it run more accurately.
So my question is, are condition variables the right way? Am I even using them correctly? Here are some things I tried:
reduce the number of unnecessary notify_all calls by having an array of condition variables and assigning them based on time like this: (ms/100)%256. The idea being that close together times will share the same cv because they are likely to actually wake up from the notify_all. This made the performance worse
keep the threads and prio_queue pushing etc. but instead use usleep. I found out that the usleep will make it work so much better, which probably means the mutex, locking, and pushing/popping operations do not contribute to a noticeable amount of lag, meaning it must be in the condition variable part
Code:
Watcher (this is run on startup)
void watcher()
{
while (true)
{
usleep(1);
{
std::lock_guard<std::mutex> lk(m_queue);
if (prio_queue.empty())
continue;
if (get_time_in_ms() >= prio_queue.top())
{
cv.notify_all();
prio_queue.pop();
}
}
}
}
Sleep
void sleep(int ms)
{
int wakeup = get_time_in_ms() + ms;
{
std::lock_guard<std::mutex> lk(m_queue);
prio_queue.push(wakeup);
}
std::unique_lock<std::mutex> lk(m_time);
cv.wait(lk, [wakeup] {return get_time_in_ms() >= wakeup;});
lk.unlock();
}
Any help would be appreciated.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
In short: Does an un-delayed while loop consume significant processing power, compared to a similar loop which is slowed down by a delay?
In not-so-short:
I have run into this question more often. I am writing the core part of a program (either microcontroller unit or computer application) and it consists of a semi-infinite while loop to stay alive and look for events.
I will take this example: I have a small application that uses an SDL window and the console. In a while loop I would like to listen to events for this SDL window, but I would also like to break this loop according to the command line input by means of a global variable. Possible solution (pseudo-code):
// Global
bool running = true;
// ...
while (running)
{
if (getEvent() == quit)
{
running = false;
}
}
shutdown();
The core while loop will quit from the listened event or something external. However, this loop is run continuously, maybe even a 1000 times per second. That's a little over-kill, I don't need that response time. Therefore I often add a delaying statement:
while (running)
{
if (getEvent() == quit)
{
running = false;
}
delay(50); // Wait 50 milliseconds
}
This limits the refresh rate to 20 times per second, which is plenty.
So. Is there a real difference between the two? Is it significant? Would it be more significant on the microcontroller unit (where processing power is very limited (but nothing else besides the program needs to run...))?
Well, in fact it's not a question about C++, but rather the answer depends on CPU architecture / Host OS / delay() implementation.
If it's a multi-tasking environment then delay() could (and probably will) help to the OS scheduler to make its job more effectively. However the real difference could be too little to notice (except old cooperative multi-tasking where delay() is a must).
If it's a single-task environment (possibly some microcontroller) then delay() could still be useful if the underlying implementation is able to execute some dedicated low power consumption instructions instead of your ordinary loop. But, of course, there's no guarantee it will, unless your manual explicitly states so.
Considering performance issues, well, it's obvious that you can receive and process an event with a significant delay (or even miss it completely), but if you believe it's not a case then there are no other cons against delay().
You will make your code much harder to read and you are doing asynchronism the old style way: you explicitely wait for something to happen, instead of relying on mechanism that do the job for you.
Also, you delay by 50ms. Is it always optimal? Does it depend on which programs are running?
In C++11 you can use condition_variable. This allows you to wait for an event to happen, without coding the waiting loops.
Documentation here:
http://en.cppreference.com/w/cpp/thread/condition_variable
I have adapted the example to make it simpler to understand. Just waiting for a single event.
Here is an example for you, adapted to your context
// Example program
#include <iostream>
#include <string>
#include <iostream>
#include <string>
#include <thread>
#include <mutex>
#include <chrono>
#include <condition_variable>
std::mutex m;
std::condition_variable cv;
std::string data;
bool ready = false;
bool processed = false;
using namespace std::chrono_literals;
void worker_thread()
{
// Wait until main() sends data
std::unique_lock<std::mutex> lk(m);
std::cout << "Worker thread starts processing data\n";
std::this_thread::sleep_for(10s);//simulates the work
data += " after processing";
// Send data back to main()
processed = true;
std::cout << "Worker thread signals data processing completed"<<std::endl;
std::cout<<"Corresponds to you getEvent()==quit"<<std::endl;
// Manual unlocking is done before notifying, to avoid waking up
// the waiting thread only to block again (see notify_one for details)
lk.unlock();
cv.notify_one();
}
int main()
{
data = "Example data";
std::thread worker(worker_thread);
// wait for the worker
{
std::unique_lock<std::mutex> lk(m);
//this means I wait for the processing to be finished and I will be woken when it is done.
//No explicit waiting
cv.wait(lk, []{return processed;});
}
std::cout<<"data processed"<<std::endl;
}
In my experience, you must do something that will relinquish the processor. sleep works OK, and on most windows systems even sleep(1) is adequate to completely unload the processor in a loop.
You can get the best of all worlds, however, if you use something like std::condition_variable. It is possible to come up with constructions using condition variables (similar to 'events' and WaitForSingleObject in Windows API).
One thread can block on a condition variable that is released by another thread. This way, one thread can do condition_varaible.wait(some_time), and it will either wait for the timeout period (without loading the processor), or it will continue execution immediately when another thread releases it.
I use this method where one thread is sending messages to another thread. I want the receiving thread to respond as soon as possible, not after waiting for a sleep(20) to complete. The receiving thread has a condition_variable.wait(20), for example. The sending thread sends a message, and does a corresponding condition_variable.release(). The receiving thread will immediately release and process the message.
This solution gives very fast response to messages, and does not unduly load the processor.
If you don't care about portability, and you happen to be using windows, events and WaitForSingleObject do the same thing.
your loop would look something like:
while(!done)
{
cond_var.wait(std::chrono::milliseconds(20));
// process messages...
msg = dequeue_message();
if(msg == done_message)
done = true;
else
process_message(msg);
}
In another thread...
send_message(string msg)
{
enqueue_message(msg);
cond_var.release();
}
Your message processing loop will spend most if it's time idle, waiting on the condition variable. When a message is sent, and the condition variable is released by the send thread, your receive thread will immediately respond.
This allows your receive thread to loop at a minimum rate set by the wait time, and a maximum rated determined by the sending thread.
What you are asking is how to properly implement an Event Loop. Use OS calls. You ask the OS for event or message. If no message is present, the OS simply sends the process to sleep. In a micro-controller environment you probably don't have an OS. There the concept of interrupts has to be used, which pretty much an "message" (or event) on lower level.
And for microcontrollers you don't have concepts like sleeping or interrupts, so you end with just looping.
In your example, a properly implemented getEvent() should block and do nothing until something actually happens, e.g. a key press.
The best way to determine that is to measure it yourself.
Undelayed loop will result in 100% usage for that specific core the app is running on. With the delay statement, it will be around 0 - 1%.
(counting on immediate response of getEvent function)
Well, that depends on a few factors - if you don't need to run anything else besides that loop in parallel, it makes no performance difference, obviously.
But a problem that might come up is power consumption - depending on how long this loop is, you might save like 90% of the power consumed by the microcontroller in the second variant.
To call it a bad practice overall doesn't seem right to me - it works in a lot of scenarios.
As I know about while loop, the process is still kept in the ram. So its not going to let the processor use its resource while its given delay. The only difference it is making in the second code is the number of executions of while loop in a given amount of time. This helps if the program is running for long time. Else no problem with the first case.
So I have a Kinect program that has three main functions that collect data and saves it. I want one of these functions to execute as much as possible, while the other two run maybe 10 times every second.
while(1)
{
...
//multi-threading to make sure color and depth events are aligned -> get skeletal data
if (WaitForSingleObject(colorEvent, 0) == 0 && WaitForSingleObject(depthEvent, 0) == 0)
{
std::thread first(getColorImage, std::ref(colorEvent), std::ref(colorStreamHandle), std::ref(colorImage));
std::thread second(getDepthImage, std::ref(depthEvent), std::ref(depthStreamHandle), std::ref(depthImage));
if (WaitForSingleObject(skeletonEvent, INFINITE) == 0)
{
first.join();
second.join();
std::thread third(getSkeletonImage, std::ref(skeletonEvent), std::ref(skeletonImage), std::ref(colorImage), std::ref(depthImage), std::ref(myfile));
third.join();
}
//if (check == 1)
//check = 2;
}
}
Currently my threads are making them all run at the same exact time, but this slows down my computer a lot and I only need to run 'getColorImage' and 'getDepthImage' maybe 5-10 times/second, whereas 'getSkeletonImage' I would want to run as much as possible.
I want 'getSkeletonImage' to run at max frequency (~30 times/second through the while loop) and then the 'getColorImage' and 'getDepthImage' to time synchronize (~5-10 times/second through the while loop)
What is a way I can do this? I am already using threads, but I need one to run consistently, and then the other two to join in intermittently essentially. Thank you for your help.
Currently, your main loop is creating the threads every iteration, which suggests each thread function runs once to completion. That introduces the overhead of creating and destroying threads every time.
Personally, I wouldn't bother with threads at all. Instead, in the main thread I'd do
void RunSkeletonEvent(int n)
{
for (i = 0; i < n; ++i)
{
// wait required time (i.e. to next multiple of 1/30 second)
skeletonEvent();
}
}
// and, in your main function ....
while (termination_condition_not_met)
{
runSkeletonEvent(3);
colorEvent();
runSkeletonEvent(3);
depthEvent();
}
This interleaves the events, so skeletonEvent() runs six times for every time depthEvent() and colorEvent() are run. Just adjust the numbers as needed to get required behaviour.
You'll need to design the code for all the events so they don't run over time (if they do, all subsequent events will be delayed - there is no means to stop that).
The problem you'll then need to resolve is how to wait for the time to fire the skeleton event. A process of retrieving clock time, calculating how long to wait, and sleeping for that interval will do it. By sleeping (the thread yielding its time slice) your program will also be a bit better mannered (e.g. it won't be starving other processes of processor time).
One advantage is that, if data is to be shared between the "events" (e.g. all of the events modify some global data) there is no need for synchronisation, because the looping above guarantees that only one "event" accesses shared data at one time.
Note: your usage of WaitForSingleObject() indicates you are using windows. Windows (except, arguably CE in a weak sense) is not really a realtime system, so does not guarantee precise timing. In other words, the actual intervals you achieve will vary.
It is still possible to restructure to use threads. From your description, there is no evidence you really need anything like that, so I'll leave this reply at that.
I have performance issue with boost:barrier. I measure time of wait method call, for single thread situation when call to wait is repeated around 100000 it takes around 0.5 sec. Unfortunately for two thread scenario this time expands to 3 seconds and it is getting worse with every thread ( I have 8 core processor).
I implemented custom method which is responsible for providing the same functionality and it is much more faster.
Is it normal to work so slow for this method. Is there faster way to synchronize threads in boost (so all threads wait for completion of current job by all threads and then proceed to the next task, just synchronization, no data transmission is required).
I have been asked for my current code.
What I want to achieve. In a loop I run a function, this function can be divided into many threads, however all thread should finish current loop run before execution of another run.
My current solution
volatile int barrierCounter1 =0; //it will store number of threads which completed current loop run
volatile bool barrierThread1[NumberOfThreads]; //it will store go signal for all threads with id > 0. All values are set to false at the beginning
boost::mutex mutexSetBarrierCounter; //mutex for barrierCounter1 modification
void ProcessT(int threadId)
{
do
{
DoWork(); //function which should be executed by every thread
mutexSetBarrierCounter.lock();
barrierCounter1++; //every thread notifies that it finish execution of function
mutexSetBarrierCounter.unlock();
if(threadId == 0)
{
//main thread (0) awaits for completion of all threads
while(barrierCounter1!=NumberOfThreads)
{
//I assume that the number of threads is lower than the number of processor cores
//so this loop should not have an impact of overall performance
}
//if all threads completed, notify other thread that they can proceed to the consecutive loop
for(int i = 0; i<NumberOfThreads; i++)
{
barrierThread1[i] = true;
}
//clear counter, no lock is utilized because rest of threads await in else loop
barrierCounter1 = 0;
}
else
{
//rest of threads await for "go" signal
while(barrierThread1[i]==false)
{
}
//if thread is allowed to proceed then it should only clean up its barrier thread array
//no lock is utilized because '0' thread would not modify this value until all threads complete loop run
barrierThread1[i] = false;
}
}
while(!end)
}
Locking runs counter to concurrency. Lock contention is always worst behaviour.
IOW: Thread synchronization (in itself) never scales.
Solution: only use synchronization primitives in situations where the contention will be low (the threads need to synchronize "relatively rarely"[1]), or do not try to employ more than one thread for the job that contends for the shared resource.
Your benchmark seems to magnify the very worst-case behavior, by making all threads always wait. If you have a significant workload on all workers between barriers, then the overhead will dwindle, and could easily become insignificant.
Trust you profiler
Profile only your application code (no silly synthetic benchmarks)
Prefer non-threading to threading (remember: asynchrony != concurrency)
[1] Which is highly relative and subjective
I was told when writing Microsoft specific C++ code that writing Sleep(1) is much better than Sleep(0) for spinlocking, due to the fact that Sleep(0) will use more of the CPU time, moreover, it only yields if there is another equal-priority thread waiting to run.
However, with the C++11 thread library, there isn't much documentation (at least that I've been able to find) about the effects of std::this_thread::yield() vs. std::this_thread::sleep_for( std::chrono::milliseconds(1) ); the second is certainly more verbose, but are they both equally efficient for a spinlock, or does it suffer from potentially the same gotchas that affected Sleep(0) vs. Sleep(1)?
An example loop where either std::this_thread::yield() or std::this_thread::sleep_for( std::chrono::milliseconds(1) ) would be acceptable:
void SpinLock( const bool& bSomeCondition )
{
// Wait for some condition to be satisfied
while( !bSomeCondition )
{
/*Either std::this_thread::yield() or
std::this_thread::sleep_for( std::chrono::milliseconds(1) )
is acceptable here.*/
}
// Do something!
}
The Standard is somewhat fuzzy here, as a concrete implementation will largely be influenced by the scheduling capabilities of the underlying operating system.
That being said, you can safely assume a few things on any modern OS:
yield will give up the current timeslice and re-insert the thread into the scheduling queue. The amount of time that expires until the thread is executed again is usually entirely dependent upon the scheduler. Note that the Standard speaks of yield as an opportunity for rescheduling. So an implementation is completely free to return from a yield immediately if it desires. A yield will never mark a thread as inactive, so a thread spinning on a yield will always produce a 100% load on one core. If no other threads are ready, you are likely to lose at most the remainder of the current timeslice before you get scheduled again.
sleep_* will block the thread for at least the requested amount of time. An implementation may turn a sleep_for(0) into a yield. The sleep_for(1) on the other hand will send your thread into suspension. Instead of going back to the scheduling queue, the thread goes to a different queue of sleeping threads first. Only after the requested amount of time has passed will the scheduler consider re-inserting the thread into the scheduling queue. The load produced by a small sleep will still be very high. If the requested sleep time is smaller than a system timeslice, you can expect that the thread will only skip one timeslice (that is, one yield to release the active timeslice and then skipping the one afterwards), which will still lead to a cpu load close or even equal to 100% on one core.
A few words about which is better for spin-locking. Spin-locking is a tool of choice when expecting little to no contention on the lock. If in the vast majority of cases you expect the lock to be available, spin-locks are a cheap and valuable solution. However, as soon as you do have contention, spin-locks will cost you. If you are worrying about whether yield or sleep is the better solution here spin-locks are the wrong tool for the job. You should use a mutex instead.
For a spin-lock, the case that you actually have to wait for the lock should be considered exceptional. Therefore it is perfectly fine to just yield here - it expresses the intent clearly and wasting CPU time should never be a concern in the first place.
I just did a test with Visual Studio 2013 on Windows 7, 2.8GHz Intel i7, default release mode optimizations.
sleep_for(nonzero) appears sleep for a minimium of around one millisecond and takes no CPU resources in a loop like:
for (int k = 0; k < 1000; ++k)
std::this_thread::sleep_for(std::chrono::nanoseconds(1));
This loop of 1,000 sleeps takes about 1 second if you use 1 nanosecond, 1 microsecond, or 1 millisecond. On the other hand, yield() takes about 0.25 microseconds each but will spin the CPU to 100% for the thread:
for (int k = 0; k < 4,000,000; ++k) (commas added for clarity)
std::this_thread::yield();
std::this_thread::sleep_for((std::chrono::nanoseconds(0)) seems to be about the the same as yield() (test not shown here).
In comparison, locking an atomic_flag for a spinlock takes about 5 nanoseconds. This loop is 1 second:
std::atomic_flag f = ATOMIC_FLAG_INIT;
for (int k = 0; k < 200,000,000; ++k)
f.test_and_set();
Also, a mutex takes about 50 nanoseconds, 1 second for this loop:
for (int k = 0; k < 20,000,000; ++k)
std::lock_guard<std::mutex> lock(g_mutex);
Based on this, I probably wouldn't hesitate to put a yield in the spinlock, but I would almost certainly wouldn't use sleep_for. If you think your locks will be spinning a lot and are worried about cpu consumption, I would switch to std::mutex if that's practical in your application. Hopefully, the days of really bad performance on std::mutex in Windows are behind us.
What you want is probably a condition variable. A condition variable with a conditional wake up function is typically implemented like what you are writing, with the sleep or yield inside the loop a wait on the condition.
Your code would look like:
std::unique_lock<std::mutex> lck(mtx)
while(!bSomeCondition) {
cv.wait(lck);
}
Or
std::unique_lock<std::mutex> lck(mtx)
cv.wait(lck, [bSomeCondition](){ return !bSomeCondition; })
All you need to do is notify the condition variable on another thread when the data is ready. However, you cannot avoid a lock there if you want to use condition variable.
if you are interested in cpu load while using yield - it's very bad, except one case-(only your application is running, and you are aware that it will basically eat all your resources)
here is more explanation:
running yield in loop will ensure that cpu will release execution of thread, still, if system try to come back to thread it will just repeat yield operation. This can make thread use full 100% load of cpu core.
running sleep() or sleep_for() is also a mistake, this will block thread execution but you will have something like wait time on cpu. Don't be mistaken, this IS working cpu but on lowest possible priority. While somehow working for simple usage examples ( fully loaded cpu on sleep() is half that bad as fully loaded working processor ), if you want to ensure application responsibility, you would like something like third example:
combining! :
std::chrono::milliseconds duration(1);
while (true)
{
if(!mutex.try_lock())
{
std::this_thread::yield();
std::this_thread::sleep_for(duration);
continue;
}
return;
}
something like this will ensure, cpu will yield as fast as this operation will be executed, and also sleep_for() will ensure that cpu will wait some time before even trying to execute next iteration. This time can be of course dynamicaly (or staticaly) adjusted to suits your needs
cheers :)
What is the difference between C++11 std::this_thread::yield() and std::this_thread::sleep_for()? How to decide when to use which one?
std::this_thread::yield tells the implementation to reschedule the execution of threads, that should be used in a case where you are in a busy waiting state, like in a thread pool:
...
while(true) {
if(pool.try_get_work()) {
// do work
}
else {
std::this_thread::yield(); // other threads can push work to the queue now
}
}
std::this_thread::sleep_for can be used if you really want to wait for a specific amount of time. This can be used for task, where timing really matters, e.g.: if you really only want to wait for 2 seconds. (Note that the implementation might wait longer than the given time duration)
std::this_thread::sleep_for()
will make your thread sleep for a given time (the thread is stopped for a given time).
(http://en.cppreference.com/w/cpp/thread/sleep_for)
std::this_thread::yield()
will stop the execution of the current thread and give priority to other process/threads (if there are other process/threads waiting in the queue).
The execution of the thread is not stopped. (it just release the CPU).
(http://en.cppreference.com/w/cpp/thread/yield)