Execution time of thread changes

Execution time of thread changes - c++

I'm using this method: Measure execution time in C (on Windows) to measure the execution time of all threads. I'm on Windows and I'm using VC++.
Something like this:
//initialize start, end, frequency
std::thread thread1(...);
// and so on I initialise all the threads
thread1.join();
//join all threads
//time = (start-end)/frequency
printf(time);
I run it multiple times and I often get different execution time. I have like 30% variation.(which means I get values between 100s and 70s). I don't understand why ? 30% is not I small variation.
Thanks

Many things can affect an execution time, from system load to memory usage to the stars aligning. You don't specify the order of your execution time, but for lower values a 30% deviation is certainly not unheard of. The main take away is that an execution time on its own basically tells us nothing about the system.

QueryPerformanceCounter is a system-wide counter, so your result depends on system load and the particulars of the process scheduling.
You could use QueryProcessCycleTime, which returns
The number of CPU clock cycles used by the threads of the process. This value includes cycles spent in both user mode and kernel mode.
You can get a handle to the current process using GetCurrentProcess.

Related

Worse performance with multiple threads in C++ [duplicate]

There is a really interesting note here: http://en.cppreference.com/w/cpp/chrono/c/clock
"Only the difference between two values returned by different calls to std::clock is meaningful, as the beginning of the std::clock era does not have to coincide with the start of the program. std::clock time may advance faster or slower than the wall clock, depending on the execution resources given to the program by the operating system. For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock."
Why does the clock speed up with multithreading? I'm checking the performance of a C++ program with threading vs without it and I'm noticing that the times are similar for threading (not better) but feel faster (like saying 8 seconds in 3 seconds of runtime).

If more than one core is available, and you are running multiple threads, then potentially multiple threads are executing at the same time on different cores. Since clock() measures processor time, it may advance faster than wallclock time, because multiple threads are advancing it simultaneously.
Just as the example given in the documentation - it shows two threads created, and the clock() value reported is almost double the wallclock time reported.

Multitasking and measuring time difference

I understand that a preemptive multitasking OS can interrupt a process at any "code position".
Given the following code:
int main() {
while( true ) {
doSthImportant(); // needs to be executed at least each 20 msec
// start of critical section
int start_usec = getTime_usec();
doSthElse();
int timeDiff_usec = getTime_usec() - start_usec;
// end of critical section
evalUsedTime( timeDiff_usec );
sleep_msec( 10 );
}
}
I would expect this code to usually produce proper results for timeDiff_usec, especially in case that doSthElse() and getTime_usec() don't take much time so they get interrupted rarely by the OS scheduler.
But the program would get interrupted from time to time somewhere in the "critical section". The context switch will do what it is supposed to do, and still in such a case the program would produce wrong results for the timeDiff_usec.
This is the only example I have in mind right now but I'm sure there would be other scenarios where multitasking might get a program(mer) into trouble (as time is not the only state that might be changed at re-entry).
Is there a way to ensure that measuring the time for a certain action works fine?
Which other common issues are critical with multitasking and need to be considered? (I'm not thinking of thread safety - but there might be common issues).
Edit:
I changed the sample code to make it more precise.
I want to check the time being spent to make sure that doSthElse() doesn't take like 50 msec or so, and if it does I would look for a better solution.

Is there a way to ensure that measuring the time for a certain action works fine?
That depends on your operating system and your privilege level. On some systems, for some privilege levels, you can set a process or thread to have a priority that prevents it from being preempted by anything at lower priority. For example, on Linux, you might use sched_setscheduler to give a thread real-time priority. (If you're really serious, you can also set the thread affinity and SMP affinities to prevent any interrupts from being handled on the CPU that's running your thread.)
Your system may also provide time tracking that accounts for time spent preempted. For example, POSIX defines the getrusage function, which returns a struct containing ru_utime (the amount of time spent in “user mode” by the process) and ru_stime (the amount of time spent in “kernel mode” by the process). These should sum to the total time the CPU spent on the process, excluding intervals during which the process was suspended. Note that if the kernel needs to, for example, spend time paging on behalf of your process, it's not defined how much (if any) of that time is charged to your process.
Anyway, the common way to measure time spent on some critical action is to time it (essentially the way your question presents) repeatedly, on an otherwise idle system, throw out outlier measurements, and take the mean (after eliminating outliers), or take the median or 95th percentile of the measurements, depending on why you need the measurement.
Which other common issues are critical with multitasking and need to be considered? (I'm not thinking of thread safety - but there might be common issues).
Too broad. There are whole books written about this subject.

Time difference between execution of two statements is not consistent

Could you please tell me why the value of timediff printed by the following program is often 4 microseconds (in the range 90 to 1000 times for different runs), but sometimes 70 or more microseconds for a few cases (in the range of 2 to 10 times for different runs):
#include <iostream>
using namespace std;
#include<sys/time.h>
#define MAXQ 1000000
#define THRDS 3
double GetMicroSecond()
{
timeval tv;
gettimeofday (&tv, NULL);
return (double) (((double)tv.tv_sec * 1000000) + (double)tv.tv_usec);
}
int main()
{
double timew, timer, timediff;
bool flagarray[MAXQ];
int x=0, y=0;
for(int i=0; i<MAXQ; ++i)
flagarray[i] = false;
while(y <MAXQ)
{
x++;
if(x%1000 == 0)
{
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
timediff = timer - timew;
if(timediff > THRDS) cout << timer-timew << endl;
}
}
}
Compiled using: g++ testlatency.cpp -o testlatency
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.

timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
The statement flagarray[y++]=true; will take much less than a microsecond to execute on a modern computer if flagarray[y++] happens to be in the level 1 cache. The statement will take longer to execute if that location is in level 2 cache but not in level 1 cache, much longer if it is in level 3 cache but not in level 1 or level 2 cache, and much, much longer yet if it isn't in any of the caches.
Another thing that can make timer-timew exceed three milliseconds is when your program yields to the OS. Cache misses can result in a yield. So can system calls. The function gettimeofday is a system call. As a general rule, you should expect any system call to yield.
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
This is not true. There are always many other programs, and many, many other threads running on your 12 core computer. These include the operating system itself (which comprises many threads in and of itself), plus lots and lots of little daemons. Whenever your program yields, the OS can decide to temporarily suspend your program so that one of the myriad other threads that are temporarily suspended but are asking for use of the CPU.
One of those daemons is the Network Time Protocol daemon (ntpd). This does all kinds of funky little things to your system clock to keep it close to in sync with atomic clocks. With a tiny little instruction such as flagarray[y++]=true being the only thing between successive calls to gettimeofday, you might even see time occasionally go backwards.
When testing for timing, its a good idea to do the timing at a coarse level. Don't time an individual statement that doesn't involve any function calls. It's much better to time a loop than it is to time than it is to time individual executions of the loop body. Even then, you should expect some variability in timing because of cache misses and because the OS temporarily suspends execution of your program.
Modern Unix-based systems have better timers (e.g., clock_gettime) than gettimeofday that are not subject to changes made by the Network Time Protocol daemon. You should use one of these rather than gettimeofday.

Generally, there are many threads sharing a small number of cores. Unless you take steps to ensure that your thread has uninterrupted use of a core, you can't guarantee that the OS won't decide to preempt your thread between the two calls GetMicroSecond() calls, and let some other thread use the core for a bit.
Even if your code runs uninterrupted, the line you're trying to time:
flagarray[y++]=true;
likely takes much less time to execute than the measurement code itself.

There are many things happening inside of modern OS at the same time as Your program executes. Some of them may may "steal" CPU from Your program as it is stated in NPE's answer. A few more examples of what can influence timing:
interrups from devices (timer, HDD, network interfaces a few to mention);
access to RAM (caching)
None of these are easily predictable.
You can expect consistency if You run Your code on some microcontroller, or maybe using real time OS.

There are a lot of variables that might explain different time values seen. I would focus more on
Cache miss/fill
Scheduler Events
Interrupts
bool flagarray[MAXQ];
Since you defined MAXQ to 1000000, let's assume that flagarray takes up 1MB of space.
You can compute how many cache-misses can occur, based on your L1/L2 D-cache sizes. Then you can correlate with how many iterations it takes to fill all of L1 and start missing and same with L2. OS may deschedule your process and reschedule it - but, that I am hoping is less likely due to the number of cores you have. Same is the case with interrupts. An idle system is never completely idle. You may choose to affine your process to a core number, say N by doing
taskset 0x<MASK> ./exe and control its execution.
If you are really curious, I would suggest that you use "perf" tool available on most Linux distros.
You may do
perf stat -e L1-dcache-loadmisses
or
perf stat -e LLC-load-misses
Once you have these numbers and the number of iterations you start building a picture of the activity that causes the noticed lag. You may also monitor OS scheduler events using "perf stat".

How to minimize CPU consumption with a fixed thread loop period

I use pthread lib 2.8 and the OS kernel is Linux 2.6.37 on arm. In my program, Thread A is using pthread interfaces to set scheduling priority to the halfway point between sched_get_priority_min(policy) and sched_get_priority_max(policy).
In the thread function loop:
{
//do my work
pthread_cond_timedwait(..., ... , 15 ms)
}
I find this thread consumes about 3% CPU. If I change the timeout to 30 ms, it reduces to 1.3%. However, I can not increase the timeout. Is there a way to reduce the CPU consumption without reducing the timeout? It seems the cost is due to thread switching.

Using this sort of construct will cause approximately 67 task switches per second, and most likely that switches to a different process, which means a complete context switch including page-tables. It's been a while since I looked at what that involves in ARM, but I'm sure it's not a "Lightweight" operation. If we count backwards, 1.75% of that is about 210k clock cycles per task switch. Which seems quite a lot. But I'm not sure how much work is involved in scrubbing TLB's, caches and such.

CPU throttling in C++

I was just wondering if there is an elegant way to set the maximum CPU load for a particular thread doing intensive calculations.
Right now I have located the most time consuming loop in the thread (it does only compression) and use GetTickCount() and Sleep() with hardcoded values. It makes sure that the loop continues for a certain period and then sleeps for a certain minimum time. It more or less does the job, i.e. guarantees that the thread will not use more than 50% of CPU. However, behavior is dependent on the number of CPU cores (huge disadvantage) and simply ugly (smaller disadvantage :)). Any ideas?

I am not aware of any API to do get the OS's scheduler to do what you want (even if your thread is idle-priority, if there are no higher-priority ready threads, yours will run). However, I think you can improvise a fairly elegant throttling function based on what you are already doing. Essentially (I don't have a Windows dev machine handy):
Pick a default amount of time the thread will sleep each iteration. Then, on each iteration (or on every nth iteration, such that the throttling function doesn't itself become a significant CPU load),
Compute the amount of CPU time your thread used since the last time your throttling function was called (I'll call this dCPU). You can use the GetThreadTimes() API to get the amount of time your thread has been executing.
Compute the amount of real time elapsed since the last time your throttling function was called (I'll call this dClock).
dCPU / dClock is the percent CPU usage (of one CPU). If it is higher than you want, increase your sleep time, if lower, decrease the sleep time.
Have your thread sleep for the computed time.
Depending on how your watchdog computes CPU usage, you might want to use GetProcessAffinityMask() to find out how many CPUs the system has. dCPU / (dClock * CPUs) is the percentage of total CPU time available.
You will still have to pick some magic numbers for the initial sleep time and the increment/decrement amount, but I think this algorithm could be tuned to keep a thread running at fairly close to a determined percent of CPU.

On linux, you can change the scheduling priority of a thread with nice().

I can't think of any cross platform way of what you want (or any guaranteed way full stop) but as you are using GetTickCount perhaps you aren't interested in cross platform :)
I'd use interprocess communications and set the intensive processes nice levels to get what you require but I'm not sure that's appropriate for your situation.
EDIT:
I agree with Bernard which is why I think a process rather than a thread might be more appropriate but it just might not suit your purposes.

The problem is it's not normal to want to leave the CPU idle while you have work to do. Normally you set a background task to IDLE priority, and let the OS handle scheduling it all the CPU time that isn't used by interactive tasks.
It sound to me like the problem is the watchdog process.
If your background task is CPU-bound then you want it to take all the unused CPU time for its task.
Maybe you should look at fixing the watchdog program?

You may be able to change the priority of a thread, but changing the maximum utilization would either require polling and hacks to limit how many things are occurring, or using OS tools that can set the maximum utilization of a process.
However, I don't see any circumstance where you would want to do this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js