Priority Preemptive Scheduling of infinite loop tasks - scheduling

There is much of material available regarding priority preemptive scheduling on Google and Stackoverflow, but I am still having confusion regarding scheduling of infinite loop tasks in a priority preemptive scheduling kernel. Let's consider the following case:
An RTOS starts two tasks T1 and T2 with priority 50 and 100 respectively. Both tasks look like:
void T1()
{
while(1)
{
perform_some_task1();
usleep(100);
}
}
and
void T2()
{
while(1)
{
perform_some_task2();
usleep(100);
}
}
As far as I understood, the kernel will schedule T2 because of its higher priority and suspend T1 because of its lower priority. Now because T2 is an infinite loop, it will never relinquish CPU to T1 until some other high priority task preempts T2.
BUT, it seems that my understanding is not correct because I have tested the above case in an RTOS and I get output on console printed by both tasks.
Can somebody comment on my understanding on the matter and the actual behavior of RTOS in above case?

In that case, both tasks are being suspended once perform_some_taskN(); have been executed (releasing the resources to be used by another threads). According to the documentation:
The usleep() function will cause the calling thread to be suspended from execution until either the number of real-time microseconds specified by the argument useconds has elapsed or a signal is delivered to the calling thread and its action is to invoke a signal-catching function or to terminate the process. The suspension time may be longer than requested due to the scheduling of other activity by the system.
BTW, usleep() is deprecated (use nanosleep() instead) :
POSIX.1-2001 declares this function obsolete;
use nanosleep(2) instead. POSIX.1-2008 removes the specification of
usleep().

Related

Recommended pattern for a queue accessed by multiple threads...what should the worker thread do?

I have a queue of objects that is being added to by a thread A. Thread B is removing objects from the queue and processing them. There may be many threads A and many threads B.
I am using a mutex when the queue in being "push"ed to, and also when "front"ed and "pop"ped from as shown in the pseudo-code as below:
Thread A calls this to add to the queue:
void Add(object)
{
mutex->lock();
queue.push(object);
mutex->unlock();
}
Thread B processes the queue as follows:
object GetNextTargetToWorkOn()
{
object = NULL;
mutex->lock();
if (! queue.empty())
{
object = queue.front();
queue.pop();
}
mutex->unlock();
return(object);
}
void DoTheWork(int param)
{
while(true)
{
object structure;
while( (object = GetNextTargetToWorkOn()) == NULL)
boost::thread::sleep(100ms); // sleep a very short time
// do something with the object
}
}
What bothers me is the while---get object---sleep-if-no-object paradigm. While there are objects to process it is fine. But while the thread is waiting for work there are two problems
a) The while loop is whirling consuming resources
b) the sleep means wasted time is a new object comes in to be processed
Is there a better pattern to achieve the same thing?
You're using spin-waiting, a better design is to use a monitor. Read more on the details on wikipedia.
And a cross-platform solution using std::condition_variable with a good example can be found here.
a) The while loop is whirling consuming resources
b) the sleep means wasted time is a new object comes in to be processed
It has been my experience that the sleep you used actually 'fixes' both of these issues.
a) The consuming of resources is a small amount of ram, and remarkably small fraction of available cpu cycles.
b) Sleep is not a wasted time on the OS's I've worked on.
c) Sleep can affect 'reaction time' (aka latency), but has seldom been an issue (outside of interrupts.)
The time spent in sleep is likely to be several orders of magnitude longer than the time spent in this simple loop. i.e. It is not significant.
IMHO - this is an ok implementation of the 'good neighbor' policy of relinquishing the processor as soon as possible.
On my desktop, AMD64 Dual Core, Ubuntu 15.04, a semaphore enforced context switch takes ~13 us.
100 ms ==> 100,000 us .. that is 4 orders of magnitude difference, i.e. VERY insignificant.
In the 5 OS's (Linux, vxWorks, OSE, and several other embedded system OS's) I have worked on, sleep (or their equivalent) is the correct way to relinquish the processor, so that it is not blocked from running another thread while the one thread is in sleep.
Note: It is feasible that some OS's sleep might not relinquish the processor. So, you should always confirm. I've not found one. Oh, but I admit I have not looked / worked much on Windows.

Windows critical sections fairness

I've a question about the fairness of the critical sections on Windows, using EnterCriticalSection and LeaveCriticalSection methods. The MSDN documentation specifies: "There is no guarantee about the order in which threads will obtain ownership of the critical section, however, the system will be fair to all threads."
The problem comes with an application I wrote, which blocks some threads that never enter critical section, even after a long time; so I perfomed some tests with a simple c program, to verify this behaviour, but I noticed strange results when you have many threads an some wait times inside.
This is the code of the test program:
CRITICAL_SECTION CriticalSection;
DWORD WINAPI ThreadFunc(void* data) {
int me;
int i,c = 0;;
me = *(int *) data;
printf(" %d started\n",me);
for (i=0; i < 10000; i++) {
EnterCriticalSection(&CriticalSection);
printf(" %d Trying to connect (%d)\n",me,c);
if(i!=3 && i!=4 && i!=5)
Sleep(500);
else
Sleep(10);
LeaveCriticalSection(&CriticalSection);
c++;
Sleep(500);
}
return 0;
}
int main() {
int i;
int a[20];
HANDLE thread[20];
InitializeCriticalSection(&CriticalSection);
for (i=0; i<20; i++) {
a[i] = i;
thread[i] = CreateThread(NULL, 0, ThreadFunc, (LPVOID) &a[i], 0, NULL);
}
}
The results of this is that some threads are blocked for many many cycles, and some others enter critical section very often. I also noticed if you change the faster Sleep (the 10 ms one), everything might returns to be fair, but I didn't find any link between sleep times and fairness.
However, this test example works much better than my real application code, which is much more complicated, and shows actually starvation for some threads. To be sure that starved threads are alive and working, I made a test (in my application) in which I kill threads after entering 5 times in critical section: the result is that, at the end, every thread enters, so I'm sure all of them are alive and blocked on the mutex.
Do I have to assume that Windows is really NOT fair with threads?
Do you know any solution for this problem?
EDIT: The same code in linux with pthreads, works as expected (no thread starves).
EDIT2: I found a working solution, forcing fairness, using a CONDITION_VARIABLE.
It can be inferred from this post (link), with the required modifications.
You're going to encounter starvation issues here anyway since the critical section is held for so long.
I think MSDN is probably suggesting that the scheduler is fair about waking up threads but since there is no lock acquisition order then it may not actually be 'fair' in the way that you expect.
Have you tried using a mutex instead of a critical section? Also, have you tried adjusting the spin count?
If you can avoid locking the critical section for extended periods of time then that is probably a better way to deal with this.
For example, you could restructure your code to have a single thread that deals with your long running operation and the other threads queue requests to that thread, blocking on a completion event. You only need to lock the critical section for short periods of time when managing the queue. Of course if these operations must also be mutually exclusive to other operations then you would need to be careful with that. If all of this stuff can't operate concurrently then you may as well serialize that via the queue too.
Alternatively, perhaps take a look at using boost asio. You could use a threadpool and strands to prevent multiple async handlers from running concurrently where synchronization would otherwise be an issue.
I think you should review a few things:
in 9997 of 10000 cases you branch to Sleep(500). Each thread holds the citical section for as much as 500 ms on almost every successful attempt to acquire the critical section.
The threads do another Sleep(500) after releasing the critical section. As a result a single thread occupies almost 50 % (49.985 %) of the availble time by holding the critical section - no matter what!
Behind the scenes: Joe Duffy: The wait lists for mutually exclusive locks are kept in FIFO order, and the OS always wakes the thread at the front of such wait queues.
Assuming you did that on purpose to show the behavior: Starting 20 of those threads may result in a minimum wait time of 10 seconds for the last thread to get access to the critical section on a single logical processor when the processor is completely available for this test.
For how long dif you do the test / What CPU? And what Windows version? You should be able to write down some more facts: A histogram of thread active vs. thread id could tell a lot about fairness.
Critical sections shall be acquired for short periods of time. In most cases shared resources can be dealt with much quicker. A Sleep inside a critical section almost certainly points to a design flaw.
Hint: Reduce the time spent inside the critical section or investigate Semaphore Objects.

Porting threads to windows. Critical sections are very slow

I'm porting some code to windows and found threading to be extremely slow. The task takes 300 seconds on windows (with two xeon E5-2670 8 core 2.6ghz = 16 core) and 3.5 seconds on linux (xeon E5-1607 4 core 3ghz). Using vs2012 express.
I've got 32 threads all calling EnterCriticalSection(), popping an 80 byte job of a std::stack, LeaveCriticalSection and doing some work (250k jobs in total).
Before and after every critical section call I print the thread ID and current time.
The wait time for a single thread's lock is ~160ms
To pop the job off the stack takes ~3ms
Calling leave takes ~3ms
The job takes ~1ms
(roughly same for Debug/Release, Debug takes a little longer. I'd love to be able to properly profile the code :P)
Commenting out the job call makes the whole process take 2 seconds (still more than linux).
I've tried both queryperformancecounter and timeGetTime, both give approx the same result.
AFAIK the job never makes any sync calls, but I can't explain the slowdown unless it does.
I have no idea why copying from a stack and calling pop takes so long.
Another very confusing thing is why a call to leave() takes so long.
Can anyone speculate on why it's running so slowly?
I wouldn't have thought the difference in processor would give a 100x performance difference, but could it be at all related to dual CPUs? (having to sync between separate CPUs than internal cores).
By the way, I'm aware of std::thread but want my library code to work with pre C++11.
edit
//in a while(hasJobs) loop...
EVENT qwe1 = {"lock", timeGetTime(), id};
events.push_back(qwe1);
scene->jobMutex.lock();
EVENT qwe2 = {"getjob", timeGetTime(), id};
events.push_back(qwe2);
hasJobs = !scene->jobs.empty();
if (hasJobs)
{
job = scene->jobs.front();
scene->jobs.pop();
}
EVENT qwe3 = {"gotjob", timeGetTime(), id};
events.push_back(qwe3);
scene->jobMutex.unlock();
EVENT qwe4 = {"unlock", timeGetTime(), id};
events.push_back(qwe4);
if (hasJobs)
scene->performJob(job);
and the mutex class, with linux #ifdef stuff removed...
CRITICAL_SECTION mutex;
...
Mutex::Mutex()
{
InitializeCriticalSection(&mutex);
}
Mutex::~Mutex()
{
DeleteCriticalSection(&mutex);
}
void Mutex::lock()
{
EnterCriticalSection(&mutex);
}
void Mutex::unlock()
{
LeaveCriticalSection(&mutex);
}
Window's CRITICAL_SECTION spins in a tight loop when you first enter it. It does not suspend the thread that called EnterCriticalSection unless a substantial period has elapsed in the spin loop. So having 32 threads contending for the same critical section will burn and waste a lot of CPU cycles. Try a mutex instead (see CreateMutex).
It seems like your windows threads are facing super contention. They seem totally serialized. You have about 7ms of total processing time in your critical section and 32 threads. If all the threads are queued up on the lock, the last thread in the queue wouldn't get to run until after sleeping about 217ms. This is not too far off your 160ms observed wait time.
So, if the threads have nothing else to do than to enter the critical section, do work, then leave the critical section, this is the behavior I would expect.
Try to characterize the linux profiling behavior, and see if the program behavior is really and apples to apples comparison.

C++11 Thread waiting behaviour: std::this_thread::yield() vs. std::this_thread::sleep_for( std::chrono::milliseconds(1) )

I was told when writing Microsoft specific C++ code that writing Sleep(1) is much better than Sleep(0) for spinlocking, due to the fact that Sleep(0) will use more of the CPU time, moreover, it only yields if there is another equal-priority thread waiting to run.
However, with the C++11 thread library, there isn't much documentation (at least that I've been able to find) about the effects of std::this_thread::yield() vs. std::this_thread::sleep_for( std::chrono::milliseconds(1) ); the second is certainly more verbose, but are they both equally efficient for a spinlock, or does it suffer from potentially the same gotchas that affected Sleep(0) vs. Sleep(1)?
An example loop where either std::this_thread::yield() or std::this_thread::sleep_for( std::chrono::milliseconds(1) ) would be acceptable:
void SpinLock( const bool& bSomeCondition )
{
// Wait for some condition to be satisfied
while( !bSomeCondition )
{
/*Either std::this_thread::yield() or
std::this_thread::sleep_for( std::chrono::milliseconds(1) )
is acceptable here.*/
}
// Do something!
}
The Standard is somewhat fuzzy here, as a concrete implementation will largely be influenced by the scheduling capabilities of the underlying operating system.
That being said, you can safely assume a few things on any modern OS:
yield will give up the current timeslice and re-insert the thread into the scheduling queue. The amount of time that expires until the thread is executed again is usually entirely dependent upon the scheduler. Note that the Standard speaks of yield as an opportunity for rescheduling. So an implementation is completely free to return from a yield immediately if it desires. A yield will never mark a thread as inactive, so a thread spinning on a yield will always produce a 100% load on one core. If no other threads are ready, you are likely to lose at most the remainder of the current timeslice before you get scheduled again.
sleep_* will block the thread for at least the requested amount of time. An implementation may turn a sleep_for(0) into a yield. The sleep_for(1) on the other hand will send your thread into suspension. Instead of going back to the scheduling queue, the thread goes to a different queue of sleeping threads first. Only after the requested amount of time has passed will the scheduler consider re-inserting the thread into the scheduling queue. The load produced by a small sleep will still be very high. If the requested sleep time is smaller than a system timeslice, you can expect that the thread will only skip one timeslice (that is, one yield to release the active timeslice and then skipping the one afterwards), which will still lead to a cpu load close or even equal to 100% on one core.
A few words about which is better for spin-locking. Spin-locking is a tool of choice when expecting little to no contention on the lock. If in the vast majority of cases you expect the lock to be available, spin-locks are a cheap and valuable solution. However, as soon as you do have contention, spin-locks will cost you. If you are worrying about whether yield or sleep is the better solution here spin-locks are the wrong tool for the job. You should use a mutex instead.
For a spin-lock, the case that you actually have to wait for the lock should be considered exceptional. Therefore it is perfectly fine to just yield here - it expresses the intent clearly and wasting CPU time should never be a concern in the first place.
I just did a test with Visual Studio 2013 on Windows 7, 2.8GHz Intel i7, default release mode optimizations.
sleep_for(nonzero) appears sleep for a minimium of around one millisecond and takes no CPU resources in a loop like:
for (int k = 0; k < 1000; ++k)
std::this_thread::sleep_for(std::chrono::nanoseconds(1));
This loop of 1,000 sleeps takes about 1 second if you use 1 nanosecond, 1 microsecond, or 1 millisecond. On the other hand, yield() takes about 0.25 microseconds each but will spin the CPU to 100% for the thread:
for (int k = 0; k < 4,000,000; ++k) (commas added for clarity)
std::this_thread::yield();
std::this_thread::sleep_for((std::chrono::nanoseconds(0)) seems to be about the the same as yield() (test not shown here).
In comparison, locking an atomic_flag for a spinlock takes about 5 nanoseconds. This loop is 1 second:
std::atomic_flag f = ATOMIC_FLAG_INIT;
for (int k = 0; k < 200,000,000; ++k)
f.test_and_set();
Also, a mutex takes about 50 nanoseconds, 1 second for this loop:
for (int k = 0; k < 20,000,000; ++k)
std::lock_guard<std::mutex> lock(g_mutex);
Based on this, I probably wouldn't hesitate to put a yield in the spinlock, but I would almost certainly wouldn't use sleep_for. If you think your locks will be spinning a lot and are worried about cpu consumption, I would switch to std::mutex if that's practical in your application. Hopefully, the days of really bad performance on std::mutex in Windows are behind us.
What you want is probably a condition variable. A condition variable with a conditional wake up function is typically implemented like what you are writing, with the sleep or yield inside the loop a wait on the condition.
Your code would look like:
std::unique_lock<std::mutex> lck(mtx)
while(!bSomeCondition) {
cv.wait(lck);
}
Or
std::unique_lock<std::mutex> lck(mtx)
cv.wait(lck, [bSomeCondition](){ return !bSomeCondition; })
All you need to do is notify the condition variable on another thread when the data is ready. However, you cannot avoid a lock there if you want to use condition variable.
if you are interested in cpu load while using yield - it's very bad, except one case-(only your application is running, and you are aware that it will basically eat all your resources)
here is more explanation:
running yield in loop will ensure that cpu will release execution of thread, still, if system try to come back to thread it will just repeat yield operation. This can make thread use full 100% load of cpu core.
running sleep() or sleep_for() is also a mistake, this will block thread execution but you will have something like wait time on cpu. Don't be mistaken, this IS working cpu but on lowest possible priority. While somehow working for simple usage examples ( fully loaded cpu on sleep() is half that bad as fully loaded working processor ), if you want to ensure application responsibility, you would like something like third example:
combining! :
std::chrono::milliseconds duration(1);
while (true)
{
if(!mutex.try_lock())
{
std::this_thread::yield();
std::this_thread::sleep_for(duration);
continue;
}
return;
}
something like this will ensure, cpu will yield as fast as this operation will be executed, and also sleep_for() will ensure that cpu will wait some time before even trying to execute next iteration. This time can be of course dynamicaly (or staticaly) adjusted to suits your needs
cheers :)

std::this_thread::yield() vs std::this_thread::sleep_for()

What is the difference between C++11 std::this_thread::yield() and std::this_thread::sleep_for()? How to decide when to use which one?
std::this_thread::yield tells the implementation to reschedule the execution of threads, that should be used in a case where you are in a busy waiting state, like in a thread pool:
...
while(true) {
if(pool.try_get_work()) {
// do work
}
else {
std::this_thread::yield(); // other threads can push work to the queue now
}
}
std::this_thread::sleep_for can be used if you really want to wait for a specific amount of time. This can be used for task, where timing really matters, e.g.: if you really only want to wait for 2 seconds. (Note that the implementation might wait longer than the given time duration)
std::this_thread::sleep_for()
will make your thread sleep for a given time (the thread is stopped for a given time).
(http://en.cppreference.com/w/cpp/thread/sleep_for)
std::this_thread::yield()
will stop the execution of the current thread and give priority to other process/threads (if there are other process/threads waiting in the queue).
The execution of the thread is not stopped. (it just release the CPU).
(http://en.cppreference.com/w/cpp/thread/yield)