Is there no concept of queue in Windows critical sections?
I have the following render loop in a dedicated thread:
while (!viewer->finish)
{
EnterCriticalSection(&viewer->lock);
viewer->renderer->begin();
viewer->root->render(viewer->renderer);
viewer->renderer->end();
LeaveCriticalSection(&viewer->lock);
}
The main thread does message processing, and when I handle mouse events, I try to enter the same critical section, but for some reason, it runs the rendering thread for a thousand more iterations (around 10 seconds), before the main thread finally enters the critical section. What's causing this - even if there is no 'queue' to enter the section, shouldn't it be more like 50/50, instead of 99.9/0.1 like in my case? Both threads have 0 priority.
And what is a good way to add such queue? Would a simple flag like bDoNotRenderAnything suffice?
Edit: the solution in my case was simply to add an event object (a boolean variable would probably work too) that is set every time the message handler needs access to the critical section, and reset after using it. The renderer does not enter the section if the variable/event is set. This way message handler won't have to wait for more than one rendering iteration.
In older versions of Windows, critical sections were guaranteed to be acquired on a first-come first-served basis. This is no longer the case starting from Windows Server 2003 SP1.
From the MSDN:
Starting with Windows Server 2003 with Service Pack 1 (SP1), threads waiting on a critical section do not acquire the critical section on a first-come, first-serve basis. This change increases performance significantly for most code. However, some applications depend on first-in, first-out (FIFO) ordering and may perform poorly or not at all on current versions of Windows (for example, applications that have been using critical sections as a rate-limiter). To ensure that your code continues to work correctly, you may need to add an additional level of synchronization. For example, suppose you have a producer thread and a consumer thread that are using a critical section object to synchronize their work. Create two event objects, one for each thread to use to signal that it is ready for the other thread to proceed. The consumer thread will wait for the producer to signal its event before entering the critical section, and the producer thread will wait for the consumer thread to signal its event before entering the critical section. After each thread leaves the critical section, it signals its event to release the other thread.
Windows Server 2003 and Windows XP: Threads that are waiting on a critical section are added to a wait queue; they are woken and generally acquire the critical section in the order in which they were added to the queue. However, if threads are added to this queue at a fast enough rate, performance can be degraded because of the time it takes to awaken each waiting thread.
Threads waiting on a critical section do not acquire the critical section on a first-come, first-serve basis (MSDN)
Most of the time your worker thread owns the lock, because it re-locks immediately after it releases the lock. So there's not much time for the other thread to wake up and catch the lock when it's free.
According to MSDN
There is no guarantee about the order in which waiting threads
will acquire ownership of the critical section.
so it is not sure in which order threads will execute. And if your rather short
viewer->renderer->begin();
viewer->root->render(viewer->renderer);
viewer->renderer->end();
sequence manages to regain the CriticalSection over, this might happen.
You can try quick fix by using SwitchToThread call in your rendering loop (after certain number of iterations), although I doubt it will be good enough solution.
Related
In a project I run into a case like this (On windows 7),
When several threads are busy (all my CPU cores are busy working), there'll be delay for a thread
to receive a semaphore (which is increased from 0 to 1). It may be as long as 1.5ms.
I solve this by cache a little things and increase the semaphore value earlier.
So to me, it seems signaling a semaphore is slow, it's not immediately received by threads (especially when CPU are busy), but if you signal it earlier before some thread begin to wait on it,, there' be no delay.
I once thought event is just a semaphore with maximum value of 1,,, well, now having met this case, I'm beginning to wonder if event is faster than semaphore at noticing threads to 'wake up'.
Sorry, I tried, but didn't come out with a demo,, I'm not very good at threading yet.
EDIT:
Is it true that Event is faster than Semaphore on Windows?
1.5 milliseconds is not explained by just the overhead between different multithreading primitives.
To simplify, Threads have three states
blocked
runnable
running
If a thread is waiting on a semaphore or an event, then it's blocked. When the event is signalled, it becomes runnable.
So the real question is, "When does a runnable thread actually run?" This varies according to scheduler algorithms, etc, but obviously it needs to run on a core, and that means nothing else can be "running" on that core at the same time. The scheduler will normally 'remove' the current running thread from a core when one of the following happens
it waits on a semaphore/event, and so becomes 'blocked'
It's been running continually for a certain time (time-based, or round-robin scheduling)
A higher priority thread becomes runnable.
The 1.5 milliseconds is probably round-robin, or time-based scheduling. Your thread is runnable but just hasn't started yet. If the thread must start, and should boot out the current thread, then you can try to increase it's priority via SetThreadPriority
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx
If a thread is waiting on a semaphore and it gets signaled, the thread will in my limited testing, become running in ~10us on a box that is not overloaded.
Signaling, and subsequent dispatching onto a core, will take longer if:
The signaled thread is in another process than any thread is preempts.
Running the signaled thread requires a thread running on another core to be preempted.
The box is already overloaded with higher-priority threads.
1.5ms must represent an extreme case where your box is very busy.
In such a case, replacing the semaphore with an event is unlikely to result in any significant improvement to overall signaling latency because the bulk of the work/delay required by the inter-thread signaling is tied up the in scheduling/dispatching, which is required in either case.
What happens when a thread is put to sleep by other thread, possible by main thread, in the middle of its execution?
assuming I've a function Producer. What if Consumer sleep()s the Producer in the middle of production of one unit ?
Suppose the unit is half produced. and then its put on sleep(). The integrity of system may be in a problem
The thread that sleep is invoked on is put in the idle queue by the thread scheduler and is context switched out of the CPU it is running on, so other threads can take it's place.
All context (registers, stack pointer, base pointer, etc) are saved on the thread stack, so when it's run next time, it can continue from where it left off.
The OS is constantly doing context switches between threads in order to make your system seem like it's doing multiple things. The OS thread scheduler algorithm takes care of that.
Thread scheduling and threading is a big subject, if you want to really understand it, I suggest you start reading up on it. :)
EDIT: Using sleep for thread synchronization purposes not advised, you should use proper synchronization mechanisms to tell the thread to wait for other threads, etc.
There is no problem associated with this, unless some state is mutated while the thread sleeps, so it wakes up with a different set of values than before going to sleep.
Threads are switched in and out of execution by the CPU all the time, but that does not affect the overall outcome of their execution, assuming no data races or other bugs are present.
It would be unadvisable for one thread to forcibly and synchronously interfere with the execution of another thread. One thread could send an asynchronous message to another requesting that it reschedule itself in some way, but that would be handled by the other thread when it was in a suitable state to do so.
Assuming they communicate using channels that are thread-safe, nothing bad shoudl happen, as the sleeping thread will wake up eventually and grab data from its task queue or see that some semaphore has been set and read the prodced data.
If the threads communicate using nonvolatile variables or direct function calls that change state, that's when Bad Things occur.
I don't know of a way for a thread to forcibly cause another thread to sleep. If two threads are accessing a shared resource (like an input/output queue, which seems likely for you Produce/Consumer example), then both threads may contend for the same lock. The losing thread must wait for the other thread to release the lock if the contention is not the "trylock" variety. The thread that waits is placed into a waiting queue associated with the lock, and is removed from the schedulers run queue. When the winning thread releases the lock, the code checks the queue to see if there are threads still waiting to acquire it. If there are, one is chosen as the winner and is given the lock, and placed in the scheduler run queue.
I have a program that should get the maximum out of my cpu.
It is multithreaded via pthreads that do their job well apart from the fact that they "only" get my cores to about 60% load which is not enough in my opinion.
I am searching for the reason and am asking myself (and hereby you) if the blocking functions mutex_lock/cond_wait are candidates?
What happens when a thread cannot run on in such a function?
Does pthread switch to another thread it handles or
does the thread yield its time to the system and if the latter is the case, can I change this behavior?
Regards,
Nobody
More Information
The setting is one mainthread that fills the taskpool and countless workers that fetch jobs from there and wait on a conditional that is signaled via broadcast when a serialized calculation is done. They go on with the values from this calculation until they are done, deliver their mail and fetch the next job...
On a typical modern pthreads implementation, each thread is managed by the kernel not unlike a separate process. Any blocking call like pthread_mutex_lock or pthread_cond_wait (but also, say, read) will yield its time to the system. The system will then find another eligible thread to schedule, whether in your process or another process, and run it.
If your program is only taking 60% of the CPU, it is more likely blocked on I/O than on pthread operations, unless you have done something way too granular with your pthread operations.
If a thread is waiting on a mutex/condition, it doesn't use resources (well, uses just a tiny amount). Whenever the thread enters waiting state, control switches to other threads. When the mutex is released (or condition variable signalled), the thread wakes up and may acquire the mutex (if no other thread grabs it first), and continue to run. If however some other thread acquires the mutex (this can happen if several threads are waiting for it), the thread returns to sleeping state.
When using boost::conditional_variable, ACE_Conditional or directly pthread_cond_wait, is there any overhead for the waiting itself? These are more specific issues that trouble be:
After the waiting thread is unscheduled, will it be scheduled back before the wait expires and then unscheduled again or it will stay unscheduled until signaled?
Does wait acquires periodically the mutex? In this case, I guess it wastes each iteration some CPU time on system calls to lock and release the mutex. Is it the same as constantly acquiring and releasing a mutex?
Also, then, how much time passes between the signal and the return from wait?
Afaik, when using semaphores the acquire calls responsiveness is dependent on scheduler time slice size. How does it work in pthread_cond_wait? I assume this is platform dependent. I am more interested in Linux but if someone knows how it works on other platforms, it will help too.
And one more question: are there any additional system resources allocated for each conditional? I won't create 30000 mutexes in my code, but should I worry about 30000 conditionals that use the same one mutex?
Here's what is written in the pthread_cond man page:
pthread_cond_wait atomically unlocks the mutex and waits for the condition variable cond to be signaled. The thread execution is suspended and does not consume any CPU time until the condition variable is signaled.
So from here I'd answer to the questions as following:
The waiting thread won't be scheduled back before the wait was signaled or canceled.
There are no periodic mutex acquisitions. The mutex is reacquired only once before wait returns.
The time that passes between the signal and the wait return is similar to that of thread scheduling due to mutex release.
Regarding the resources, on the same man page:
In the LinuxThreads implementation, no resources are associated with condition variables, thus pthread_cond_destroy actually does nothing except checking that the condition has no waiting threads.
Update: I dug into the sources of pthread_cond_* functions and the behavior is as follows:
All the pthread conditionals in Linux are implemented using futex.
When a thread calls wait it is suspended and unscheduled. The thread id is inserted at the tail of a list of waiting threads.
When a thread calls signal the thread at the head of the list is scheduled back.
So, the waking is as efficient as the scheduler, no OS resources are consumed and the only memory overhead is the size of the waiting list (see futex_wake function).
You should only call pthread_cond_wait if the variable is already in the "wrong" state. Since it always waits, there is always the overhead associated with putting the current thread to sleep and switching.
When the thread is unscheduled, it is unscheduled. It should not use any resources, but of course an OS can in theory be implemented badly. It is allowed to re-acquire the mutex, and even to return, before the signal (which is why you must double-check the condition), but the OS will be implemented so this doesn't impact performance much, if it happens at all. It doesn't happen spontaneously, but rather in response to another, possibly-unrelated signal.
30000 mutexes shouldn't be a problem, but some OSes might have a problem with 30000 sleeping threads.
If a critical section lock is currently owned by a thread and other threads are trying to own this very lock, then all the threads other than the thread which owns the lock enter into a wait queue for the lock to be released.
When the initial owning thread releases the critical section lock then one of the threads in the waiting queue will be selected to run and given the critical section lock allowing the thread to run.
How is the next thread to run selected as it is not guaranteed that the thread that first came will be the owner of the thread.
If threads are not served in FIFO fashion then how is the next owner Thread selected from the wait queue?
The next thread to get the critical section is chosen non-deterministically. The only thing that you should be concerned about is whether the critical section is implemented fairly, i.e., that no thread waits infinitely long to get its turn. If you need to run threads in specific order, you have to implement this yourself.
The next thread is chosen in quasi FIFO order. However many system level variables may cause this to appear non deterministic:
From Concurrent Programming On Windows by Joe Duffy: (Chapter 5)
... When a fixed number of threads
needs to be awakened, the OS uses a
semi-fair algorithm to choose between
them: as threads wait they are placed
in a FIFO queue that the awakening
logic consults when determining which
thread to wake up. Threads that have
been waiting for the longest time are
thus preferred over threads that been
waiting less time. Although the OS
does use a strict FIFO data structure
to manage wait lists; ... this
ordering is regularly perturbed by
other system code and is not reliable.
Posix threads do the FIFO queue.
What about Thread Scheduling Algorithm , the threads in waiting state get priority as per Thread Scheduling algorithm
Plz correct if I am wrong.