How to minimize CPU consumption with a fixed thread loop period

How to minimize CPU consumption with a fixed thread loop period - c++

I use pthread lib 2.8 and the OS kernel is Linux 2.6.37 on arm. In my program, Thread A is using pthread interfaces to set scheduling priority to the halfway point between sched_get_priority_min(policy) and sched_get_priority_max(policy).
In the thread function loop:
{
//do my work
pthread_cond_timedwait(..., ... , 15 ms)
}
I find this thread consumes about 3% CPU. If I change the timeout to 30 ms, it reduces to 1.3%. However, I can not increase the timeout. Is there a way to reduce the CPU consumption without reducing the timeout? It seems the cost is due to thread switching.

Using this sort of construct will cause approximately 67 task switches per second, and most likely that switches to a different process, which means a complete context switch including page-tables. It's been a while since I looked at what that involves in ARM, but I'm sure it's not a "Lightweight" operation. If we count backwards, 1.75% of that is about 210k clock cycles per task switch. Which seems quite a lot. But I'm not sure how much work is involved in scrubbing TLB's, caches and such.

Related

high overhead of std::condition_variable::wait - synchronizing two threads 44100 times/second

background: synchronization of an emulated microcontroller (Motorola MC68331) at 20 Mhz, running in thread A, to an emulated DSP (Motorola 56300) at 120 MHz, running in thread B.
I need synchronization at audio rate, which results in synchronization 44100 times per second.
The current approach is to use a std::condition_variable, but the overhead of wait() is too high. At the moment I'm profiling on a Windows system, however, this has to work on Win, Mac and possibly Linux/Android, too.
In particular, the issue is a jmp instruction inside of SleepConditionvariableSRW, which is very costly:
I already tried some other options. Various variants of sleep are too imprecise and usually take way too long, the best one can get out of Windows is roughly one millisecond, however, the maximum sleep time should not be more than 1/44100 seconds => about 22us:
The closest one can get on Windows is to use CreateWaitableTimerEx with the high precision flag, but even in that case, the overhead of these functions is even higher than the std::condition_variable.
What I also tried: a spinloop with std::this_thread::yield, which results in more CPU usage in general.
Is there anything I am missing / could try? More than 50% of CPU time is wasted in the wait code, which I'd like to eliminate.
Thanks in advance!

Execution time of thread changes

I'm using this method: Measure execution time in C (on Windows) to measure the execution time of all threads. I'm on Windows and I'm using VC++.
Something like this:
//initialize start, end, frequency
std::thread thread1(...);
// and so on I initialise all the threads
thread1.join();
//join all threads
//time = (start-end)/frequency
printf(time);
I run it multiple times and I often get different execution time. I have like 30% variation.(which means I get values between 100s and 70s). I don't understand why ? 30% is not I small variation.
Thanks

Many things can affect an execution time, from system load to memory usage to the stars aligning. You don't specify the order of your execution time, but for lower values a 30% deviation is certainly not unheard of. The main take away is that an execution time on its own basically tells us nothing about the system.

QueryPerformanceCounter is a system-wide counter, so your result depends on system load and the particulars of the process scheduling.
You could use QueryProcessCycleTime, which returns
The number of CPU clock cycles used by the threads of the process. This value includes cycles spent in both user mode and kernel mode.
You can get a handle to the current process using GetCurrentProcess.

Time difference between execution of two statements is not consistent

Could you please tell me why the value of timediff printed by the following program is often 4 microseconds (in the range 90 to 1000 times for different runs), but sometimes 70 or more microseconds for a few cases (in the range of 2 to 10 times for different runs):
#include <iostream>
using namespace std;
#include<sys/time.h>
#define MAXQ 1000000
#define THRDS 3
double GetMicroSecond()
{
timeval tv;
gettimeofday (&tv, NULL);
return (double) (((double)tv.tv_sec * 1000000) + (double)tv.tv_usec);
}
int main()
{
double timew, timer, timediff;
bool flagarray[MAXQ];
int x=0, y=0;
for(int i=0; i<MAXQ; ++i)
flagarray[i] = false;
while(y <MAXQ)
{
x++;
if(x%1000 == 0)
{
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
timediff = timer - timew;
if(timediff > THRDS) cout << timer-timew << endl;
}
}
}
Compiled using: g++ testlatency.cpp -o testlatency
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.

timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
The statement flagarray[y++]=true; will take much less than a microsecond to execute on a modern computer if flagarray[y++] happens to be in the level 1 cache. The statement will take longer to execute if that location is in level 2 cache but not in level 1 cache, much longer if it is in level 3 cache but not in level 1 or level 2 cache, and much, much longer yet if it isn't in any of the caches.
Another thing that can make timer-timew exceed three milliseconds is when your program yields to the OS. Cache misses can result in a yield. So can system calls. The function gettimeofday is a system call. As a general rule, you should expect any system call to yield.
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
This is not true. There are always many other programs, and many, many other threads running on your 12 core computer. These include the operating system itself (which comprises many threads in and of itself), plus lots and lots of little daemons. Whenever your program yields, the OS can decide to temporarily suspend your program so that one of the myriad other threads that are temporarily suspended but are asking for use of the CPU.
One of those daemons is the Network Time Protocol daemon (ntpd). This does all kinds of funky little things to your system clock to keep it close to in sync with atomic clocks. With a tiny little instruction such as flagarray[y++]=true being the only thing between successive calls to gettimeofday, you might even see time occasionally go backwards.
When testing for timing, its a good idea to do the timing at a coarse level. Don't time an individual statement that doesn't involve any function calls. It's much better to time a loop than it is to time than it is to time individual executions of the loop body. Even then, you should expect some variability in timing because of cache misses and because the OS temporarily suspends execution of your program.
Modern Unix-based systems have better timers (e.g., clock_gettime) than gettimeofday that are not subject to changes made by the Network Time Protocol daemon. You should use one of these rather than gettimeofday.

Generally, there are many threads sharing a small number of cores. Unless you take steps to ensure that your thread has uninterrupted use of a core, you can't guarantee that the OS won't decide to preempt your thread between the two calls GetMicroSecond() calls, and let some other thread use the core for a bit.
Even if your code runs uninterrupted, the line you're trying to time:
flagarray[y++]=true;
likely takes much less time to execute than the measurement code itself.

There are many things happening inside of modern OS at the same time as Your program executes. Some of them may may "steal" CPU from Your program as it is stated in NPE's answer. A few more examples of what can influence timing:
interrups from devices (timer, HDD, network interfaces a few to mention);
access to RAM (caching)
None of these are easily predictable.
You can expect consistency if You run Your code on some microcontroller, or maybe using real time OS.

There are a lot of variables that might explain different time values seen. I would focus more on
Cache miss/fill
Scheduler Events
Interrupts
bool flagarray[MAXQ];
Since you defined MAXQ to 1000000, let's assume that flagarray takes up 1MB of space.
You can compute how many cache-misses can occur, based on your L1/L2 D-cache sizes. Then you can correlate with how many iterations it takes to fill all of L1 and start missing and same with L2. OS may deschedule your process and reschedule it - but, that I am hoping is less likely due to the number of cores you have. Same is the case with interrupts. An idle system is never completely idle. You may choose to affine your process to a core number, say N by doing
taskset 0x<MASK> ./exe and control its execution.
If you are really curious, I would suggest that you use "perf" tool available on most Linux distros.
You may do
perf stat -e L1-dcache-loadmisses
or
perf stat -e LLC-load-misses
Once you have these numbers and the number of iterations you start building a picture of the activity that causes the noticed lag. You may also monitor OS scheduler events using "perf stat".

What's the meaning of thread concurrency overhead time in the profiler output?

I'd be really appreciated if someone with good experience of Intel VTune Amplifier tell me about this thing.
Recently I received performance analysis report from other guys who used Intel VTune Amplifier against my program. It tells, there is high overhead time in the thread concurrency area.
What's the meaning of the Overhead Time? They don't know (asked me), I don't have access to Intel VTune Amplifier.
I have vague ideas. This program has many thread sleep calls because pthread condition is unstable (or I did badly) in the target platform so I change many routines to do works in the loop look like below:
while (true)
{
mutex.lock();
if (event changed)
{
mutex.unlock();
// do something
break;
}
else
{
mutex.unlock();
usleep(3 * 1000);
}
}
This can be flagged as Overhead Time?
Any advice?
I found help documentation about Overhead Time from Intel site.
http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/win/ug_docs/olh/common/overhead_time.html#overhead_time
Excerpt:
Overhead time is a duration that starts with the release of a shared resource and ends with the receipt of that resource. Ideally, the duration of Overhead time is very short because it reduces the time a thread has to wait to acquire a resource. However, not all CPU time in a parallel application may be spent on doing real pay load work. In cases when parallel runtime (Intel® Threading Building Blocks, OpenMP*) is used inefficiently, a significant portion of time may be spent inside the parallel runtime wasting CPU time at high concurrency levels. For example, this may result from low granularity of work split in recursive parallel algorithms: when the workload size becomes too low, the overhead on splitting the work and performing the housekeeping work becomes significant.
Still confusing.. Could it mean "you made unnecessary/too frequent lock"?

I am also not much of an expert on that, though I have tried to use pthread a bit myself.
To demonstrate my understanding of overhead time, let us take the example of a simple single-threaded program to compute an array sum:
for(i=0;i<NUM;i++) {
sum += array[i];
}
In a simple [reasonably done] multi-threaded version of that code, the array could be broken into one piece per thread, each thread keeps its own sum, and after the threads are done, the sums are summed.
In a very poorly written multi-threaded version, the array could be broken down as before, and every thread could atomicAdd to a global sum.
In this case, the atomic addition can only be done by one thread at a time. I believe that overhead time is a measure of how long all of the other threads spend while waiting to do their own atomicAdd (you could try writing this program to check if you want to be sure).
Of course, it also takes into account the time it takes to deal with switching the semaphores and mutexes around. In your case, it probably means a significant amount of time is spent on the internals of the mutex.lock and mutex.unlock.
I parallelized a piece of software a while ago (using pthread_barrier), and had issues where it took longer to run the barriers than it did to just use one thread. It turned out that the loop that had to have 4 barriers in it was executed quickly enough to make the overhead not worth it.

Sorry, I'm not an expert on pthread or Intel VTune Amplifier, but yes, locking a mutex and unlocking it will probably count as overhead time.
Locking and unlocking mutexes can be implemented as system calls, which the profiler probably would just lump under threading overhead.

I'm not familiar with vTune but there is an in the OS overhead switching between threads. Each time a thread stops and another loads on a processor the current thread context needs to be stored so that it can be restored when the thread next runs and then the new thread's context needs to be restored so it can carry on processing.
The problem may be that you have too many threads and so the processor is spending most of its time switching between them. Multi threaded applications will run most efficiently if there are the same number of threads as processors.

CPU throttling in C++

I was just wondering if there is an elegant way to set the maximum CPU load for a particular thread doing intensive calculations.
Right now I have located the most time consuming loop in the thread (it does only compression) and use GetTickCount() and Sleep() with hardcoded values. It makes sure that the loop continues for a certain period and then sleeps for a certain minimum time. It more or less does the job, i.e. guarantees that the thread will not use more than 50% of CPU. However, behavior is dependent on the number of CPU cores (huge disadvantage) and simply ugly (smaller disadvantage :)). Any ideas?

I am not aware of any API to do get the OS's scheduler to do what you want (even if your thread is idle-priority, if there are no higher-priority ready threads, yours will run). However, I think you can improvise a fairly elegant throttling function based on what you are already doing. Essentially (I don't have a Windows dev machine handy):
Pick a default amount of time the thread will sleep each iteration. Then, on each iteration (or on every nth iteration, such that the throttling function doesn't itself become a significant CPU load),
Compute the amount of CPU time your thread used since the last time your throttling function was called (I'll call this dCPU). You can use the GetThreadTimes() API to get the amount of time your thread has been executing.
Compute the amount of real time elapsed since the last time your throttling function was called (I'll call this dClock).
dCPU / dClock is the percent CPU usage (of one CPU). If it is higher than you want, increase your sleep time, if lower, decrease the sleep time.
Have your thread sleep for the computed time.
Depending on how your watchdog computes CPU usage, you might want to use GetProcessAffinityMask() to find out how many CPUs the system has. dCPU / (dClock * CPUs) is the percentage of total CPU time available.
You will still have to pick some magic numbers for the initial sleep time and the increment/decrement amount, but I think this algorithm could be tuned to keep a thread running at fairly close to a determined percent of CPU.

On linux, you can change the scheduling priority of a thread with nice().

I can't think of any cross platform way of what you want (or any guaranteed way full stop) but as you are using GetTickCount perhaps you aren't interested in cross platform :)
I'd use interprocess communications and set the intensive processes nice levels to get what you require but I'm not sure that's appropriate for your situation.
EDIT:
I agree with Bernard which is why I think a process rather than a thread might be more appropriate but it just might not suit your purposes.

The problem is it's not normal to want to leave the CPU idle while you have work to do. Normally you set a background task to IDLE priority, and let the OS handle scheduling it all the CPU time that isn't used by interactive tasks.
It sound to me like the problem is the watchdog process.
If your background task is CPU-bound then you want it to take all the unused CPU time for its task.
Maybe you should look at fixing the watchdog program?

You may be able to change the priority of a thread, but changing the maximum utilization would either require polling and hacks to limit how many things are occurring, or using OS tools that can set the maximum utilization of a process.
However, I don't see any circumstance where you would want to do this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js