Low performance of boost::barrier, wait operation - c++

I have performance issue with boost:barrier. I measure time of wait method call, for single thread situation when call to wait is repeated around 100000 it takes around 0.5 sec. Unfortunately for two thread scenario this time expands to 3 seconds and it is getting worse with every thread ( I have 8 core processor).
I implemented custom method which is responsible for providing the same functionality and it is much more faster.
Is it normal to work so slow for this method. Is there faster way to synchronize threads in boost (so all threads wait for completion of current job by all threads and then proceed to the next task, just synchronization, no data transmission is required).
I have been asked for my current code.
What I want to achieve. In a loop I run a function, this function can be divided into many threads, however all thread should finish current loop run before execution of another run.
My current solution
volatile int barrierCounter1 =0; //it will store number of threads which completed current loop run
volatile bool barrierThread1[NumberOfThreads]; //it will store go signal for all threads with id > 0. All values are set to false at the beginning
boost::mutex mutexSetBarrierCounter; //mutex for barrierCounter1 modification
void ProcessT(int threadId)
{
do
{
DoWork(); //function which should be executed by every thread
mutexSetBarrierCounter.lock();
barrierCounter1++; //every thread notifies that it finish execution of function
mutexSetBarrierCounter.unlock();
if(threadId == 0)
{
//main thread (0) awaits for completion of all threads
while(barrierCounter1!=NumberOfThreads)
{
//I assume that the number of threads is lower than the number of processor cores
//so this loop should not have an impact of overall performance
}
//if all threads completed, notify other thread that they can proceed to the consecutive loop
for(int i = 0; i<NumberOfThreads; i++)
{
barrierThread1[i] = true;
}
//clear counter, no lock is utilized because rest of threads await in else loop
barrierCounter1 = 0;
}
else
{
//rest of threads await for "go" signal
while(barrierThread1[i]==false)
{
}
//if thread is allowed to proceed then it should only clean up its barrier thread array
//no lock is utilized because '0' thread would not modify this value until all threads complete loop run
barrierThread1[i] = false;
}
}
while(!end)
}

Locking runs counter to concurrency. Lock contention is always worst behaviour.
IOW: Thread synchronization (in itself) never scales.
Solution: only use synchronization primitives in situations where the contention will be low (the threads need to synchronize "relatively rarely"[1]), or do not try to employ more than one thread for the job that contends for the shared resource.
Your benchmark seems to magnify the very worst-case behavior, by making all threads always wait. If you have a significant workload on all workers between barriers, then the overhead will dwindle, and could easily become insignificant.
Trust you profiler
Profile only your application code (no silly synthetic benchmarks)
Prefer non-threading to threading (remember: asynchrony != concurrency)
[1] Which is highly relative and subjective

Related

How to improve performance of pushing data to a mutex locked queue

I have a queue of "jobs" (function pointers and data) pushed onto it from a main thread, which then notifies worker threads to pop the data off and run it.
The functions are pretty basic and look like this:
class JobQueue {
public:
// usually called by main thread but other threads can use this too
void push(Job job) {
{
std::lock_guard<std::mutex> lock(mutex); // this takes 40% of the thread's time (when NOT sync'ing)
ready = true;
queue.emplace_back(job);
}
cv.notify_one(); // this also takes another 40% of the thread's time
}
// only called by worker threads
Job pop() {
std::unique_lock<std::mutex> lock(mutex);
cv.wait(lock, [&]{return ready;});
Job job = list.front();
list.pop_front();
return job;
}
private:
std::list<Job> queue;
std::mutex mutex;
std::condition_variable cv;
bool ready;
};
But I have a major problem, push() is really slow. The worker threads outpace the main thread, which in my test adding jobs is all the main thread does. (The worker threads perform 20 4x4 matrix rotations that feed into eachother and get printed at the end so they're not optimized away) This seems to get worse with the number of worker threads available too. If each "Job" is bigger, say 100 matrix operations, this negative goes away and more threads == better, but the Jobs I would give it in practice are much smaller than that.
The hottest calls are the mutex lock and notify_one(), which take up 40% of the time each, everything else is negligible it seems. Also, the mutex lock is rarely waiting, it is nearly always available.
I'm not sure what I should do here, is there an obvious or not-so obvious optimization I can make that will help, or perhaps I have made a mistake? Any insight would be greatly appreciated.
(here are some metrics I took if it might help, they don't count the time it takes to create threads, the pattern is the same even for billions of jobs)
Time to calc 2000000 matrice rotations
(20 rotations x 100000 jobs)
threads 0: 149 ms << no-bool baseline
threads 1: 151 ms << single threaded w/pool
threads 2: 89 ms
threads 3: 120 ms
threads 4: 216 ms
threads 8: 269 ms
threads 12: 311 ms << hardware hint
threads 16: 329 ms
threads 24: 332 ms
threads 96: 336 ms
(all worker threads have the same pattern, green is execution, red is waiting on synchronization)
TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)
Your tasks are (computationally) too small. A 4x4 matrix multiplication is just a few multiplies and adds. ~60-70 operations. 20 of them done together isn't much more expensive, ~1500 (pipelined) arithmetic operations. The cost of the thread switch including waking a thread waiting on the cv and then the actual context switch, is likely higher than this - possibly much higher.
Also, the cost of the synchronization (the manipulation of the mutex and the cv) is very expensive, especially in the case of contention, especially on a multi-core system where the hardware native synchronization operations are much more expensive than arithmetic (because of cache coherency enforcement between the multiple cores).
This is why you observe that the problem lessens when each task is doing 100 of these matrix operations, increased from 20: The workers were going back to the well for more stuff to do too often, causing contention, when they only had 20 MMs to do ... giving them 100 to do slows them down enough that contention is reduced.
(In a comment you indicate that there is only one supplier, pretty much eliminating that as a source of contention to the queue. But even there, the more tasks than can be enqueued together while under the cv lock the better - up to the limit where it is blocking workers from taking tasks.)
I suggest using an event handler.
The events are of two types:
New job arrives
Worker completes job
The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )
When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker
You will also need a free worker queue, at startup and when no jobs are available.
You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio

Thread pools and context switching slowdowns

I have a thread pool with idling threads that wait for jobs to be pushed to a queue, in a windows application.
I have a loop in my main application thread that adds 1000 jobs to the pool's queue sequentially (it adds a job, then waits for the job to finish, then adds another job, x1000). So no actual parallel processing is happening...here's some pseudocode:
////threadpool:
class ThreadPool
{
....
std::condition_variable job_cv;
std::condition_variable finished_cv;
std::mutex job_mutex;
std::queue<std::function <void(void)>> job_queue;
void addJob(std::function <void(void)> jobfn)
{
std::unique_lock <std::mutex> lock(job_mutex);
job_queue.emplace(std::move(jobfn));
job_cv.notify_one();
}
void waitForJobToFinish()
{
std::unique_lock<std::mutex> lock(job_mutex);
finished_cv.wait(lock, [this]() {return job_queue.empty(); });
}
....
void threadFunction() //called by each thread when it's first started
{
std::function <void(void)> job;
while (true)
{
std::unique_lock <std::mutex> latch(job_mutex);
job_cv.wait(latch, [this](){return !job_queue.empty();});
{
job = std::move(job_queue.front());
job_queue.pop();
latch.unlock();
job();
latch.lock();
finished_cv.notify_one();
}
}
}
}
...
////main application:
void jobfn()
{
//do some lightweight calculation
}
void main()
{
//test 1000 calls to the lightweight jobfn from the thread pool
for (int q = 0; q < 1000; q++)
{
threadPool->addJob(&jobfn);
threadPool->waitForJobToFinish();
}
}
So basically what's happening is a job is added to the queue and the main loop begins to wait, a waiting thread then picks it up, and when the thread finishes, it notifies the application that the main loop can continue and another job can be added to the queue, etc. So that way 1000 jobs are processed sequentially.
It's worth noting that the jobs themselves are tiny and can complete in a few milliseconds.
However, I've noticed something strange....
The time it takes for the loop to complete is essentially O(n) where n is the number of threads in the thread pool. So even though jobs are processed one-at-a-time in all scenarios, a 10-thread pool takes 10x longer to complete the full 1000-job task than a 1-thread pool.
I'm trying to figure out why, and my only guess so far is that context switching is the bottleneck...maybe less (or zero?) context switching overhead is required when only 1 thread is grabbing jobs...but when 10 threads are continually taking their turn to process a single job at a time, there's some extra processing required? But that doesn't make sense to me...wouldn't it be the same operation required to unlock thread A for a job, as it would be thread B,C,D...? Is there some OS-level caching going on, where a thread doesn't lose context until a different thread is given it? So calling on the same thread over and over is faster than calling threads A,B,C sequentially?
But that's a complete guess at this point...maybe someone else could shed some insight on why I'm getting these results...Intuitively I assumed that so long as only 1 thread is executing at a time, I could have a thread pool with an arbitrarily large number of threads and the total task completion time for [x] jobs would be the same (so long as each job is identical and the total number of jobs is the same)...why is that wrong?
Your "guess" is correct; it's simply a resource contention issue.
Your 10 threads are not idle, they're waiting. This means that the OS has to iterate over the currently active threads for your application, which means a context switch likely occurs.
The active thread is pushed back, a "waiting" thread pulled to the front, in which the code checks if the signal has been notified and the lock can be acquired, since it likely can't in the time slice for that thread, it continues to iterate over the remaining threads, each trying to see if the lock can be acquired, which it can't because your "active" thread hasn't been allotted a time slice to complete yet.
A single-thread pool doesn't have this issue because no additional threads need to be iterated over at the OS level; granted, a single-thread pool is still slower than just calling job 1000 times.
Hope that can help.

Correct Way to Write a Custom Sleep

I'm currently writing code for a simulator to sync with ROS time.
Essentially, the problem becomes "write a get_time and sleep that scales according to ROS time"? Doing this will allow no change to the codebase and just require linking to the custom get_time and sleep. get_time seems to work perfectly; however, I've been having trouble getting the sleep to run accurately.
My current design is like this (code attached at the bottom):
Thread calls sleep
Sleep will add the time when to unlock this thread (current_time + sleep_time) into a priority queue, and then wait on a condition variable.
A separate thread (let's call it watcher) will constantly loop and check for the top of the queue; if the top of the prio queue > current time, then it will notify_all on the condition variable and then pop the prio queue
However, it seems like the watcher thread is not accurate enough (I see discrepancies of 0~50ms), meaning the sleep calls make the threads sleep too long sometimes. I also visibly notice lag/jagged behavior in the simulator compared to if I were to replace the sleep with a usleep(1000*ms).
Unfortunately, I'm not too experienced at these types of designs, and I feel like there are lots of ways to optimize/rewrite this to make it run more accurately.
So my question is, are condition variables the right way? Am I even using them correctly? Here are some things I tried:
reduce the number of unnecessary notify_all calls by having an array of condition variables and assigning them based on time like this: (ms/100)%256. The idea being that close together times will share the same cv because they are likely to actually wake up from the notify_all. This made the performance worse
keep the threads and prio_queue pushing etc. but instead use usleep. I found out that the usleep will make it work so much better, which probably means the mutex, locking, and pushing/popping operations do not contribute to a noticeable amount of lag, meaning it must be in the condition variable part
Code:
Watcher (this is run on startup)
void watcher()
{
while (true)
{
usleep(1);
{
std::lock_guard<std::mutex> lk(m_queue);
if (prio_queue.empty())
continue;
if (get_time_in_ms() >= prio_queue.top())
{
cv.notify_all();
prio_queue.pop();
}
}
}
}
Sleep
void sleep(int ms)
{
int wakeup = get_time_in_ms() + ms;
{
std::lock_guard<std::mutex> lk(m_queue);
prio_queue.push(wakeup);
}
std::unique_lock<std::mutex> lk(m_time);
cv.wait(lk, [wakeup] {return get_time_in_ms() >= wakeup;});
lk.unlock();
}
Any help would be appreciated.

Windows critical sections fairness

I've a question about the fairness of the critical sections on Windows, using EnterCriticalSection and LeaveCriticalSection methods. The MSDN documentation specifies: "There is no guarantee about the order in which threads will obtain ownership of the critical section, however, the system will be fair to all threads."
The problem comes with an application I wrote, which blocks some threads that never enter critical section, even after a long time; so I perfomed some tests with a simple c program, to verify this behaviour, but I noticed strange results when you have many threads an some wait times inside.
This is the code of the test program:
CRITICAL_SECTION CriticalSection;
DWORD WINAPI ThreadFunc(void* data) {
int me;
int i,c = 0;;
me = *(int *) data;
printf(" %d started\n",me);
for (i=0; i < 10000; i++) {
EnterCriticalSection(&CriticalSection);
printf(" %d Trying to connect (%d)\n",me,c);
if(i!=3 && i!=4 && i!=5)
Sleep(500);
else
Sleep(10);
LeaveCriticalSection(&CriticalSection);
c++;
Sleep(500);
}
return 0;
}
int main() {
int i;
int a[20];
HANDLE thread[20];
InitializeCriticalSection(&CriticalSection);
for (i=0; i<20; i++) {
a[i] = i;
thread[i] = CreateThread(NULL, 0, ThreadFunc, (LPVOID) &a[i], 0, NULL);
}
}
The results of this is that some threads are blocked for many many cycles, and some others enter critical section very often. I also noticed if you change the faster Sleep (the 10 ms one), everything might returns to be fair, but I didn't find any link between sleep times and fairness.
However, this test example works much better than my real application code, which is much more complicated, and shows actually starvation for some threads. To be sure that starved threads are alive and working, I made a test (in my application) in which I kill threads after entering 5 times in critical section: the result is that, at the end, every thread enters, so I'm sure all of them are alive and blocked on the mutex.
Do I have to assume that Windows is really NOT fair with threads?
Do you know any solution for this problem?
EDIT: The same code in linux with pthreads, works as expected (no thread starves).
EDIT2: I found a working solution, forcing fairness, using a CONDITION_VARIABLE.
It can be inferred from this post (link), with the required modifications.
You're going to encounter starvation issues here anyway since the critical section is held for so long.
I think MSDN is probably suggesting that the scheduler is fair about waking up threads but since there is no lock acquisition order then it may not actually be 'fair' in the way that you expect.
Have you tried using a mutex instead of a critical section? Also, have you tried adjusting the spin count?
If you can avoid locking the critical section for extended periods of time then that is probably a better way to deal with this.
For example, you could restructure your code to have a single thread that deals with your long running operation and the other threads queue requests to that thread, blocking on a completion event. You only need to lock the critical section for short periods of time when managing the queue. Of course if these operations must also be mutually exclusive to other operations then you would need to be careful with that. If all of this stuff can't operate concurrently then you may as well serialize that via the queue too.
Alternatively, perhaps take a look at using boost asio. You could use a threadpool and strands to prevent multiple async handlers from running concurrently where synchronization would otherwise be an issue.
I think you should review a few things:
in 9997 of 10000 cases you branch to Sleep(500). Each thread holds the citical section for as much as 500 ms on almost every successful attempt to acquire the critical section.
The threads do another Sleep(500) after releasing the critical section. As a result a single thread occupies almost 50 % (49.985 %) of the availble time by holding the critical section - no matter what!
Behind the scenes: Joe Duffy: The wait lists for mutually exclusive locks are kept in FIFO order, and the OS always wakes the thread at the front of such wait queues.
Assuming you did that on purpose to show the behavior: Starting 20 of those threads may result in a minimum wait time of 10 seconds for the last thread to get access to the critical section on a single logical processor when the processor is completely available for this test.
For how long dif you do the test / What CPU? And what Windows version? You should be able to write down some more facts: A histogram of thread active vs. thread id could tell a lot about fairness.
Critical sections shall be acquired for short periods of time. In most cases shared resources can be dealt with much quicker. A Sleep inside a critical section almost certainly points to a design flaw.
Hint: Reduce the time spent inside the critical section or investigate Semaphore Objects.

C++11 Thread waiting behaviour: std::this_thread::yield() vs. std::this_thread::sleep_for( std::chrono::milliseconds(1) )

I was told when writing Microsoft specific C++ code that writing Sleep(1) is much better than Sleep(0) for spinlocking, due to the fact that Sleep(0) will use more of the CPU time, moreover, it only yields if there is another equal-priority thread waiting to run.
However, with the C++11 thread library, there isn't much documentation (at least that I've been able to find) about the effects of std::this_thread::yield() vs. std::this_thread::sleep_for( std::chrono::milliseconds(1) ); the second is certainly more verbose, but are they both equally efficient for a spinlock, or does it suffer from potentially the same gotchas that affected Sleep(0) vs. Sleep(1)?
An example loop where either std::this_thread::yield() or std::this_thread::sleep_for( std::chrono::milliseconds(1) ) would be acceptable:
void SpinLock( const bool& bSomeCondition )
{
// Wait for some condition to be satisfied
while( !bSomeCondition )
{
/*Either std::this_thread::yield() or
std::this_thread::sleep_for( std::chrono::milliseconds(1) )
is acceptable here.*/
}
// Do something!
}
The Standard is somewhat fuzzy here, as a concrete implementation will largely be influenced by the scheduling capabilities of the underlying operating system.
That being said, you can safely assume a few things on any modern OS:
yield will give up the current timeslice and re-insert the thread into the scheduling queue. The amount of time that expires until the thread is executed again is usually entirely dependent upon the scheduler. Note that the Standard speaks of yield as an opportunity for rescheduling. So an implementation is completely free to return from a yield immediately if it desires. A yield will never mark a thread as inactive, so a thread spinning on a yield will always produce a 100% load on one core. If no other threads are ready, you are likely to lose at most the remainder of the current timeslice before you get scheduled again.
sleep_* will block the thread for at least the requested amount of time. An implementation may turn a sleep_for(0) into a yield. The sleep_for(1) on the other hand will send your thread into suspension. Instead of going back to the scheduling queue, the thread goes to a different queue of sleeping threads first. Only after the requested amount of time has passed will the scheduler consider re-inserting the thread into the scheduling queue. The load produced by a small sleep will still be very high. If the requested sleep time is smaller than a system timeslice, you can expect that the thread will only skip one timeslice (that is, one yield to release the active timeslice and then skipping the one afterwards), which will still lead to a cpu load close or even equal to 100% on one core.
A few words about which is better for spin-locking. Spin-locking is a tool of choice when expecting little to no contention on the lock. If in the vast majority of cases you expect the lock to be available, spin-locks are a cheap and valuable solution. However, as soon as you do have contention, spin-locks will cost you. If you are worrying about whether yield or sleep is the better solution here spin-locks are the wrong tool for the job. You should use a mutex instead.
For a spin-lock, the case that you actually have to wait for the lock should be considered exceptional. Therefore it is perfectly fine to just yield here - it expresses the intent clearly and wasting CPU time should never be a concern in the first place.
I just did a test with Visual Studio 2013 on Windows 7, 2.8GHz Intel i7, default release mode optimizations.
sleep_for(nonzero) appears sleep for a minimium of around one millisecond and takes no CPU resources in a loop like:
for (int k = 0; k < 1000; ++k)
std::this_thread::sleep_for(std::chrono::nanoseconds(1));
This loop of 1,000 sleeps takes about 1 second if you use 1 nanosecond, 1 microsecond, or 1 millisecond. On the other hand, yield() takes about 0.25 microseconds each but will spin the CPU to 100% for the thread:
for (int k = 0; k < 4,000,000; ++k) (commas added for clarity)
std::this_thread::yield();
std::this_thread::sleep_for((std::chrono::nanoseconds(0)) seems to be about the the same as yield() (test not shown here).
In comparison, locking an atomic_flag for a spinlock takes about 5 nanoseconds. This loop is 1 second:
std::atomic_flag f = ATOMIC_FLAG_INIT;
for (int k = 0; k < 200,000,000; ++k)
f.test_and_set();
Also, a mutex takes about 50 nanoseconds, 1 second for this loop:
for (int k = 0; k < 20,000,000; ++k)
std::lock_guard<std::mutex> lock(g_mutex);
Based on this, I probably wouldn't hesitate to put a yield in the spinlock, but I would almost certainly wouldn't use sleep_for. If you think your locks will be spinning a lot and are worried about cpu consumption, I would switch to std::mutex if that's practical in your application. Hopefully, the days of really bad performance on std::mutex in Windows are behind us.
What you want is probably a condition variable. A condition variable with a conditional wake up function is typically implemented like what you are writing, with the sleep or yield inside the loop a wait on the condition.
Your code would look like:
std::unique_lock<std::mutex> lck(mtx)
while(!bSomeCondition) {
cv.wait(lck);
}
Or
std::unique_lock<std::mutex> lck(mtx)
cv.wait(lck, [bSomeCondition](){ return !bSomeCondition; })
All you need to do is notify the condition variable on another thread when the data is ready. However, you cannot avoid a lock there if you want to use condition variable.
if you are interested in cpu load while using yield - it's very bad, except one case-(only your application is running, and you are aware that it will basically eat all your resources)
here is more explanation:
running yield in loop will ensure that cpu will release execution of thread, still, if system try to come back to thread it will just repeat yield operation. This can make thread use full 100% load of cpu core.
running sleep() or sleep_for() is also a mistake, this will block thread execution but you will have something like wait time on cpu. Don't be mistaken, this IS working cpu but on lowest possible priority. While somehow working for simple usage examples ( fully loaded cpu on sleep() is half that bad as fully loaded working processor ), if you want to ensure application responsibility, you would like something like third example:
combining! :
std::chrono::milliseconds duration(1);
while (true)
{
if(!mutex.try_lock())
{
std::this_thread::yield();
std::this_thread::sleep_for(duration);
continue;
}
return;
}
something like this will ensure, cpu will yield as fast as this operation will be executed, and also sleep_for() will ensure that cpu will wait some time before even trying to execute next iteration. This time can be of course dynamicaly (or staticaly) adjusted to suits your needs
cheers :)