I want to measure time (very precisely in milliseconds) and start another thread (totally 4) at some time. I tried it with:
double Time()
{
duration = (std::clock() - start);
return duration;
}
//...
start = std::clock();
while (Time() < 1000)
{
//Start thread...
//...
}
It works, but in every experiment I recived diffrent result (small difference).
Its even possible? Does depends how many programs runs in background (It slows down my computer)? So if it possible what should I use? Thanks
(sorry for my English)
The operating system runs in quanta - little chunks of processing which are are below our level of perception.
Within a single quantum, the CPU should act reasonably stably. If your task will use more than one quantum of time, then the operating system will be free to use slices of times for other tasks.
Using a condition variable You can notify_all to wake up any waiting threads.
So start the number of threads, but before they are measured and start working have them waiting on a condition_variable. Then when the condition_variable notify_all the threads will be runnable. If they are started at the same time, you should get synchronized stable results.
Variance occurs.
Not scheduled - the cores on your CPU are doing other things, so 1 or more thread misses the quantum
Blocked on IO. If you need to interact with the disk, that can cause blocks until data is available.
Blocked in mutex - if they are modifying a shared resource, the wait for the resource becoming free adds time.
Cache behavior some operations cause the cache of all the CPUs to be flushed, this will affect the performance of all the threads.
Whether data is in the cache or not the CPU runs faster from L1 cache, than from main memory. If the threads read the same data, they will help each other cause the data to be cached, and run at the same (ish) speed.
Related
I want to briefly suspend multiple C++ std threads, running on Linux, at the same time.
It seems this is not supported by the OS.
The threads work on tasks that take an uneven and unpredictable amount of time (several seconds).
I want to suspend them when the CPU temperature rises above a threshold.
It is impractical to check for suspension within the tasks, only inbetween tasks.
I would like to simply have all workers suspend operation for a few milliseconds.
How could that be done?
What I'm currently doing
I'm currently using a condition variable in a slim, custom binary semaphore class (think C++20 Semaphore).
A worker checks for suspension before starting the next task by acquiring and immediately releasing the semaphore.
A separate control thread occupies the control semaphore for a few milliseconds if the temperature is too high.
This often works well and the CPU temperature is stable.
I do not care much about a slight delay in suspending the threads.
However, when one task takes some seconds longer than the others, its thread will continue to run alone.
This activates CPU turbo mode, which is the opposite of what I want to achieve (it is comparatively power inefficient, thus bad for thermals).
I cannot deactivate CPU turbo as I do not control the hardware.
In other words, the tasks take too long to complete.
So I want to forcefully pause them from outside.
I want to suspend them when the CPU temperature rises above a threshold.
In general, that is putting the cart before the horse.
Properly designed hardware should have adequate cooling for maximum load and your program should not be able to exceed that cooling capacity.
In addition, since you are talking about Turbo, we can assume an Intel CPU, which will thermally throttle all on their own, making your program run slower without you doing anything.
In other words, the tasks take too long to complete
You could break the tasks into smaller parts, and check the semaphore more often.
A separate control thread occupies the control semaphore for a few milliseconds
It's really unlikely that your hardware can react to millisecond delays -- that's too short a timescale for anything thermal. You will probably be better off monitoring the temperature and simply reducing the number of tasks you are scheduling when the temperature is rising and getting close to your limits.
I've now implemented it with pthread_kill and SIGRT.
Note that suspending threads in unknown state (whatever the target task was doing at the time of signal receipt) is a recipe for deadlocks. The task may be inside malloc, may be holding arbitrary locks, etc. etc.
If your "control thread" also needs that lock, it will block and you lose. Your control thread must execute only direct system calls, may not call into libc, etc. etc.
This solution is ~impossible to test, and ~impossible to implement correctly.
I heard that the optimal amount of threads depends on whether they are CPU bound or not. But what exactly does it mean?
Suppose that the most time my threads will sleep via Sleep function from WinAPI. Should I considered such threads as non-CPU bound and increase their amount over the CPU cores count?
A thread is bound by a resource if it spends most of its time using it, and thus its speed is bound by the speed of that resource.
Given the above definition, a thread is CPU bound if its most used resource is the computing power of the CPU, that is, it's a thread that does heavy computation. You gain nothing from putting more of these than there are available cores, because they will compete for CPU time.
You can (instead) put more threads than available cores when the threads are bound by other resources (most commonly files), because they will spend most time waiting for those to be ready, and thus leave the CPU available for other threads.
A thread that spends most time sleeping does not use the CPU very much, and thus it is not CPU bound.
EDIT: examples of non-CPU bound threads are threads that read files, wait for network connections, talk to PCI connected devices, spend most time waiting on condition variables and GUI threads that wait for user input.
As said here: How to reduce CUDA synchronize latency / delay
There are two approach for waiting result from device:
"Polling" - burn CPU in spin - to decrease latency when we wait result
"Blocking" - thread is sleeping until an interrupt occurs - to increase general performance
For "Polling" need to use CudaDeviceScheduleSpin.
But for "Blocking" what do I need to use CudaDeviceScheduleYield or cudaDeviceScheduleBlockingSync?
What difference between cudaDeviceScheduleBlockingSync and cudaDeviceScheduleYield?
cudaDeviceScheduleYield as written: http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__DEVICE_g18074e885b4d89f5a0fe1beab589e0c8.html
"Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device." - i.e. wait result without burn CPU in spin - i.e. "Blocking". And cudaDeviceScheduleBlockingSync too - wait result without burn CPU in spin. But what difference?
For my understanding, both approaches use polling to synchronize. In pseudo-code for CudaDeviceScheduleSpin:
while (!IsCudaJobDone())
{
}
whereas CudaDeviceScheduleYield:
while (!IsCudaJobDone())
{
Thread.Yield();
}
i.e. CudaDeviceScheduleYield tells the operating system that it can interrupt the polling thread and activate another thread doing other work. This increases the performance for other threads on CPU but also increases latency, in case the CUDA job finishes when another thread than the polling one is active in that very moment.
I'm using busy waiting to synchronize access to critical regions, like this:
while (p1_flag != T_ID);
/* begin: critical section */
for (int i=0; i<N; i++) {
...
}
/* end: critical section */
p1_flag++;
p1_flag is a global volatile variable that is updated by another concurrent thread. As a matter of fact, I've two critical sections inside a loop and I've two threads (both executing the same loop) that commute execution of these critical regions. For instance, critical regions are named A and B.
Thread 1 Thread 2
A
B A
A B
B A
A B
B A
B
The parallel code executes faster than the serial one, however not as much as I expected. Profiling the parallel program using VTune Amplifier I noticed that a large amount of time is being spent in the synchronization directives, that is, the while(...) and flag update. I'm not sure why I'm seeing so large overhead on these "instructions" since region A is exactly the same as region B. My best guess is that this is due the cache coherence latency: I'm using an Intel i7 Ivy Bridge Machine and this micro-architecture resolves cache coherence at the L3. VTune also tells that the while (...) instruction is consuming all front-end bandwidth, but why?
To make the question(s) clear: Why are while(...) and update flag instructions taking so much execution time? Why would the while(...) instruction saturate the front-end bandwidth?
The overhead you're paying may very well be due to passing the sync variable back and forth between the core caches.
Cache coherency dictates that when you modify the cache line (p1_flag++) you need to have ownership on it. This means it would invalidate any copy existing in other cores, waiting for it to write back any changes made by that other core to a shared cache level. It would then provide the line to the requesting core in M state and perform the modification.
However, the other core would by then be constantly reading this line, read that would snoop the first core and ask if it has copy of that line. Since the first core is holding an M copy of that line, it would get written back to the shared cache and the core would lose ownership.
Now this depends on actual implementation in HW, but if the line was snooped before the change was actually made, the first core would have to attempt to get ownership on it again. In some cases i'd imagine this might take several iterations of attempts.
If you're set on using busy wait, you should at least use some pause inside it
: _mm_pause intrisic, or just __asm("pause"). This would both serve to give the other thread a chance get the lock and release you from waiting, as well as reducing the CPU effort in busy waiting (an out-of-order CPU would fill all pipelines with parallel instances of this busy wait, consuming lots of power - a pause would serialize it so only a single iteration can run at any given time - much less consuming and with the same effect).
A busy-wait is almost never a good idea in multithreaded applications.
When you busy-wait, thread scheduling algorithms will have no way of knowing that your loop is waiting on another thread, so they must allocate time as if your thread is doing useful work. And it does take processor time to check that variable over, and over, and over, and over, and over, and over...until it is finally "unlocked" by the other thread. In the meantime, your other thread will be preempted by your busy-waiting thread again and again, for no purpose at all.
This is an even worse problem if the scheduler is a priority-based one, and the busy-waiting thread is at a higher priority. In this situation, the lower-priority thread will NEVER preempt the higher-priority thread, thus you have a deadlock situation.
You should ALWAYS use semaphores or mutex objects or messaging to synchronize threads. I've never seen a situation where a busy-wait was the right solution.
When you use a semaphore or mutex, then the scheduler knows never to schedule that thread until the semaphore or mutex is released. Thus your thread will never be taking time away from threads that do real work.
How to measure amount of time given by a mutex to the OS? The main goal is to detect a mutex, that blocks threads for largest amount of time.
PS: I tried oprofile. It reports 30% of time spent inside vmlinux/.poll_idle. This is unexpected, because the app is designed to take 100% of its core. Therefore, I'm suspecting, that the time is given back to the OS while waiting for some mutex, and oprofile reports it as the IDLE time.
Profile.
Whenever the question is "What really takes the [most|least] time?", the answer is always "Profile to find out.".
As suggested - profile, but decide before what do you want to measure - elapsed time (time threads were blocked), or user/kernel time (time it cost you to perform synchronization). In different scenarios you might want to measure one or another, or both.
You could profile your program, using say OProfile on linux. Then filter out your results to look at the time spent in pthread_mutex_lock() for each mutex, or your higher-level function that performs the locking. Since the program will block inside the lock function call until the mutex is aquired, profiling the time spent in that function should give you an idea of which mutexes are your most expensive.
start = GetTime();
Mutex.Lock();
stop = GetTime();
elapsedTime = stop - start;
elaspedTime is the amount of time it took to grab the mutex. If it is bigger than some small value, it's because another thread has the mutex. This won't show how long the OS has the mutex, only that another thread has it.